Parallel Index Scans

Started by Amit Kapilaover 9 years ago89 messages

amit.kapila16@gmail.com

over 9 years ago

2 attachment(s)

As of now, the driving table for parallel query is accessed by
parallel sequential scan which limits its usage to a certain degree.
Parallelising index scans would further increase the usage of parallel
query in many more cases. This patch enables the parallelism for the
btree scans. Supporting parallel index scan for other index types
like hash, gist, spgist can be done as separate patches.

The basic idea is quite similar to parallel heap scans which is that
each worker (including leader whenever possible) will scan a block and
then get the next block that is required to be scan. The parallelism
in implemented at the leaf level of a btree. The first worker to
start a btree scan will scan till leaf and others will wait till the
first worker has reached till leaf. The first worker after reading
the leaf block will set the next block to be read and wake the first
worker waiting to scan the next block and proceed with scanning tuples
from the block it has read, similarly each worker after reading the
block, sets the next block to be read and wakes up the first waiting
worker. This is achieved by using the condition variable patch [1]/messages/by-id/CAEepm=0zshYwB6wDeJCkrRJeoBM=jPYBe+-k_VtKRU_8zMLEfA@mail.gmail.com
proposed by Robert. Parallelism is supported for both forward and
backward scans.

The optimizer will choose the parallelism based on number of pages in
index relation and cpu cost for evaluating the rows is divided equally
among workers. Index Scan node is made parallel aware and can be used
beneath Gather as shown below:

Current Plan for Index Scans
----------------------------------------
Index Scan using idx2 on test (cost=0.42..7378.96 rows=2433 width=29)
Index Cond: (c < 10)

Parallel version of plan
----------------------------------
Gather (cost=1000.42..1243.40 rows=2433 width=29)
Workers Planned: 1
-> Parallel Index Scan using idx2 on test (cost=0.42..0.10
rows=1431 width=29)
Index Cond: (c < 10)

The Parallel index scans can be used in parallelising aggregate
queries as well. For example, given a query like: select count(*)
from t1 where c1 > 1000 and c1 < 1100 and c2='aaa' Group By c2; below
form of parallel plans are possible:

Finalize HashAggregate
Group Key: c2
-> Gather
Workers Planned: 1
-> Partial HashAggregate
Group Key: c2
-> Parallel Index Scan using idx_t1_partial on t1
Index Cond: ((c1 > 1000) AND (c1 < 1100))
Filter: (c2 = 'aaa'::bpchar)

Finalize GroupAggregate
Group Key: c2
-> Sort
-> Gather
Workers Planned: 1
-> Partial GroupAggregate
Group Key: c2
-> Parallel Index Scan using idx_t1_partial on t1
Index Cond: ((c1 > 1000) AND (c1 < 1100))
Filter: (c2 = 'aaa'::bpchar)

In the second plan (GroupAggregate), the Sort + Gather step would be
replaced with GatherMerge, once we have a GatherMerge node as proposed
by Rushabh [2]/messages/by-id/CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com. Note, that above examples are just taken to explain
the usage of parallel index scan, actual plans will be selected based
on cost.

Performance tests
----------------------------
This test has been performed on community m/c (hydra, POWER-7).

Initialize pgbench with 3000 scale factor (./pgbench -i -s 3000 postgres)

Count the rows in pgbench_accounts based on values of aid and bid

Serial plan
------------------
set max_parallel_workers_per_gather=0;

postgres=# explain analyze select count(aid) from pgbench_accounts
where aid > 1000 and aid < 90000000 and bid > 800 and bid < 900;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
----
Aggregate (cost=4714590.52..4714590.53 rows=1 width=8) (actual
time=35684.425..35684.425 rows=1 loops=1)
-> Index Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.57..4707458.12 rows=2852961 width=4) (actual
time=29210.743..34385.271 rows=9900000 loops
=1)
Index Cond: ((aid > 1000) AND (aid < 90000000))
Filter: ((bid > 800) AND (bid < 900))
Rows Removed by Filter: 80098999
Planning time: 0.183 ms
Execution time: 35684.459 ms
(7 rows)

Parallel Plan
-------------------
set max_parallel_workers_per_gather=2;

postgres=# explain analyze select count(aid) from pgbench_accounts
where aid > 1000 and aid < 90000000 and bid > 800 and bid < 900;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------
Finalize Aggregate (cost=3924773.13..3924773.14 rows=1 width=8)
(actual time=15033.105..15033.105 rows=1 loops=1)
-> Gather (cost=3924772.92..3924773.12 rows=2 width=8) (actual
time=15032.986..15033.093 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=3923772.92..3923772.92 rows=1
width=8) (actual time=15030.354..15030.354 rows=1 loops=3)
-> Parallel Index Scan using pgbench_accounts_pkey on
pgbench_accounts (cost=0.57..3920801.08 rows=1188734 width=4) (actual
time=12476.068..14600.410 rows=3300000 loops=3)
Index Cond: ((aid > 1000) AND (aid < 90000000))
Filter: ((bid > 800) AND (bid < 900))
Rows Removed by Filter: 26699666
Planning time: 0.244 ms
Execution time: 15036.081 ms
(11 rows)

The above is a median of 3 runs, all the runs gave almost same
execution time. Here, we can notice that execution time is reduced by
more than half with two workers and I have tested with four workers
where time is reduced to one-fourth (9128.420 ms) of serial plan. I
think these results are quite similar to what we got for parallel
sequential scans. Another thing to note is that parallelising index
scans are more beneficial if there is a Filter which removes many rows
fetched from Index Scan or if the Filter is costly (example - filter
contains costly function execution). This observation is also quite
similar to what we have observed with Parallel Sequential Scans.

I think we can parallelise Index Only Scans as well, but I have not
evaluated the same and certainly it can be done as a separate patch in
future.

Contributions
--------------------
First patch (parallel_index_scan_v1.patch) implements parallelism at
IndexAM level - Rahila Syed and Amit Kapila based on design inputs and
suggestions by Robert Haas
Second patch (parallel_index_opt_exec_support_v1.patch) provides
optimizer and executor support for parallel index scans - Amit Kapila

The order to use these patches is first apply condition variable patch
[1]: /messages/by-id/CAEepm=0zshYwB6wDeJCkrRJeoBM=jPYBe+-k_VtKRU_8zMLEfA@mail.gmail.com
parallel_index_opt_exec_support_v1.patch

Thoughts?

[1]: /messages/by-id/CAEepm=0zshYwB6wDeJCkrRJeoBM=jPYBe+-k_VtKRU_8zMLEfA@mail.gmail.com
[2]: /messages/by-id/CAGPqQf09oPX-cQRpBKS0Gq49Z+m6KBxgxd_p9gX8CKk_d75HoQ@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v1.patchapplication/octet-stream; name=parallel_index_scan_v1.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..8d4f9f7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -77,61 +77,64 @@
   <para>
    The structure <structname>IndexAmRoutine</structname> is defined thus:
 <programlisting>
-typedef struct IndexAmRoutine
-{
-    NodeTag     type;
-
-    /*
-     * Total number of strategies (operators) by which we can traverse/search
-     * this AM.  Zero if AM does not have a fixed set of strategy assignments.
-     */
-    uint16      amstrategies;
-    /* total number of support functions that this AM uses */
-    uint16      amsupport;
-    /* does AM support ORDER BY indexed column's value? */
-    bool        amcanorder;
-    /* does AM support ORDER BY result of an operator on indexed column? */
-    bool        amcanorderbyop;
-    /* does AM support backward scanning? */
-    bool        amcanbackward;
-    /* does AM support UNIQUE indexes? */
-    bool        amcanunique;
-    /* does AM support multi-column indexes? */
-    bool        amcanmulticol;
-    /* does AM require scans to have a constraint on the first index column? */
-    bool        amoptionalkey;
-    /* does AM handle ScalarArrayOpExpr quals? */
-    bool        amsearcharray;
-    /* does AM handle IS NULL/IS NOT NULL quals? */
-    bool        amsearchnulls;
-    /* can index storage data type differ from column data type? */
-    bool        amstorage;
-    /* can an index of this type be clustered on? */
-    bool        amclusterable;
-    /* does AM handle predicate locks? */
-    bool        ampredlocks;
-    /* type of data stored in index, or InvalidOid if variable */
-    Oid         amkeytype;
-
-    /* interface functions */
-    ambuild_function ambuild;
-    ambuildempty_function ambuildempty;
-    aminsert_function aminsert;
-    ambulkdelete_function ambulkdelete;
-    amvacuumcleanup_function amvacuumcleanup;
-    amcanreturn_function amcanreturn;   /* can be NULL */
-    amcostestimate_function amcostestimate;
-    amoptions_function amoptions;
-    amproperty_function amproperty;     /* can be NULL */
-    amvalidate_function amvalidate;
-    ambeginscan_function ambeginscan;
-    amrescan_function amrescan;
-    amgettuple_function amgettuple;     /* can be NULL */
-    amgetbitmap_function amgetbitmap;   /* can be NULL */
-    amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
-} IndexAmRoutine;
+  typedef struct IndexAmRoutine
+  {
+  NodeTag     type;
+
+  /*
+  * Total number of strategies (operators) by which we can traverse/search
+  * this AM.  Zero if AM does not have a fixed set of strategy assignments.
+  */
+  uint16      amstrategies;
+  /* total number of support functions that this AM uses */
+  uint16      amsupport;
+  /* does AM support ORDER BY indexed column's value? */
+  bool        amcanorder;
+  /* does AM support ORDER BY result of an operator on indexed column? */
+  bool        amcanorderbyop;
+  /* does AM support backward scanning? */
+  bool        amcanbackward;
+  /* does AM support UNIQUE indexes? */
+  bool        amcanunique;
+  /* does AM support multi-column indexes? */
+  bool        amcanmulticol;
+  /* does AM require scans to have a constraint on the first index column? */
+  bool        amoptionalkey;
+  /* does AM handle ScalarArrayOpExpr quals? */
+  bool        amsearcharray;
+  /* does AM handle IS NULL/IS NOT NULL quals? */
+  bool        amsearchnulls;
+  /* can index storage data type differ from column data type? */
+  bool        amstorage;
+  /* can an index of this type be clustered on? */
+  bool        amclusterable;
+  /* does AM handle predicate locks? */
+  bool        ampredlocks;
+  /* type of data stored in index, or InvalidOid if variable */
+  Oid         amkeytype;
+
+  /* interface functions */
+  ambuild_function ambuild;
+  ambuildempty_function ambuildempty;
+  aminsert_function aminsert;
+  ambulkdelete_function ambulkdelete;
+  amvacuumcleanup_function amvacuumcleanup;
+  amcanreturn_function amcanreturn;   /* can be NULL */
+  amcostestimate_function amcostestimate;
+  amoptions_function amoptions;
+  amproperty_function amproperty;     /* can be NULL */
+  amvalidate_function amvalidate;
+  amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+  aminitparallelscan_function aminitparallelscan;    /* can be NULL */
+  ambeginscan_function ambeginscan;
+  amrescan_function amrescan;
+  amparallelrescan_function amparallelrescan;    /* can be NULL */
+  amgettuple_function amgettuple;     /* can be NULL */
+  amgetbitmap_function amgetbitmap;   /* can be NULL */
+  amendscan_function amendscan;
+  ammarkpos_function ammarkpos;       /* can be NULL */
+  amrestrpos_function amrestrpos;     /* can be NULL */
+  } IndexAmRoutine;
 </programlisting>
   </para>
 
@@ -459,6 +462,36 @@ amvalidate (Oid opclassoid);
    invalid.  Problems should be reported with <function>ereport</> messages.
   </para>
 
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (Size index_size);
+</programlisting>
+    Estimate and return the storage space required for parallel index scan.
+    The <literal>index_size</> parameter indicate the size of generic parallel
+    index scan information.  The size of index-type-specific parallel information
+    will be added to <literal>index_size</> and result will be returned back to
+    caller.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and <function>amestimateparallelscan</>
+     functions need only be provided if the access method supports <quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its <structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>
 
   <para>
    The purpose of an index, of course, is to support scans for tuples matching
@@ -511,6 +544,23 @@ amrescan (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   Restart the parallel index scan.  It resets the parallel index scan state.
+   It must be called only during restart of scan which will be typically
+   required for the inner side of nest-loop join.
+  </para>
+
+  <para>
+   The <function>amparallelrescan</> function need only be provided if the
+   access method supports <quote>parallel</> index scans.  If it doesn't,
+   the <structfield>amparallelrescan</> field in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
 boolean
 amgettuple (IndexScanDesc scan,
             ScanDirection direction);
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..5ff6a0f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -104,8 +104,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..6f7024e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..5727ef9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,8 +81,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..40ddda5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -78,8 +78,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..84ebd72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -375,7 +375,7 @@ systable_beginscan(Relation heapRelation,
 		}
 
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -577,7 +577,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	}
 
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..72e1b03 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -207,6 +207,70 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate storage for ParallelIndexScanDesc
+ *
+ * It calls am specific routine to obtain size of am specific shared
+ * information.
+ */
+Size
+index_parallelscan_estimate(Relation indexrel, Snapshot snapshot)
+{
+	Size		index_size = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+									  EstimateSnapshotSpace(snapshot));
+
+	/* amestimateparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->amestimateparallelscan == NULL)
+		return index_size;
+	else
+		return indexrel->rd_amroutine->amestimateparallelscan(index_size);
+}
+
+/*
+ * index_parallelscan_initialize - initialize ParallelIndexScanDesc
+ *
+ * It calls access method specific initialization routine to initialize am
+ * specific information.  Call this just once in the leader process; then,
+ * individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+								  EstimateSnapshotSpace(snapshot));
+	void	   *amtarget = (char *) ((void *) target) + offset;
+
+	target->ps_relid = RelationGetRelid(heaprel);
+	target->ps_indexid = RelationGetRelid(indexrel);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->aminitparallelscan == NULL)
+		return;
+
+	indexrel->rd_amroutine->aminitparallelscan(amtarget);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan(heaprel, indexrel, snapshot, pscan, nkeys, norderbys, true);
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -214,8 +278,8 @@ index_insert(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
-				Snapshot snapshot,
-				int nkeys, int norderbys)
+				Snapshot snapshot, ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap)
 {
 	IndexScanDesc scan;
 
@@ -227,6 +291,8 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
 
 	return scan;
 }
@@ -251,6 +317,8 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = NULL;
+	scan->xs_temp_snap = false;
 
 	return scan;
 }
@@ -319,6 +387,22 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan == NULL)
+		return;
+
+	scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +425,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..94fa368 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +64,23 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber ps_nextPage;	/* latest or next page to be scanned */
+	uint8		ps_pageStatus;	/* state of scan, see below */
+	int			ps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		ps_mutex;		/* protects above variables */
+	ConditionVariable cv;		/* used to synchronize parallel scan */
+}	BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -110,8 +129,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = btparallelrescan;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
@@ -462,6 +484,183 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(Size index_size)
+{
+	return add_size(index_size, sizeof(BTParallelScanDescData));
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->ps_mutex);
+	bt_target->ps_nextPage = InvalidBlockNumber;
+	bt_target->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->ps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->cv);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ *	status - True indicates that the block number returned is either valid
+ *			 and scan is continued or block number is invalid and scan has just
+ *			 begun.  False indicates that we have reached the end of scan for
+ *			 current scankeys and for that we return block number as P_NONE.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		queued_self = false;
+	uint8		pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		SpinLockAcquire(&btscan->ps_mutex);
+		pageStatus = btscan->ps_pageStatus;
+		if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->ps_nextPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->ps_mutex);
+		if (*status == false)
+			return P_NONE;
+		if (exit_loop)
+			break;
+		if (queued_self)
+		{
+			ConditionVariableSleep(WAIT_EVENT_PARALLEL_INDEX_SCAN);
+			queued_self = false;
+		}
+		else
+		{
+			ConditionVariablePrepareToSleep(&btscan->cv);
+			queued_self = true;
+		}
+	}
+	if (queued_self)
+	{
+		ConditionVariableCancelSleep();
+		queued_self = false;
+	}
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.
+ *
+ * For backward scan, next_page holds the latest page being scanned.
+ * For forward scan, next_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_page)
+{
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->ps_mutex);
+	btscan->ps_nextPage = next_page;
+	btscan->ps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->ps_mutex);
+	ConditionVariableSignal(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (scan->parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (so->arrayKeyCount >= btscan->ps_arrayKeyCount &&
+		btscan->ps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->ps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * It updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (btscan->ps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->ps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+}
+
+/*
  *	btrescan() -- rescan an index relation
  */
 void
@@ -481,6 +680,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -521,6 +721,31 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 }
 
 /*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	if (scan->parallel_scan)
+	{
+		BTParallelScanDesc btscan;
+
+		btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+		/*
+		 * Ideally, we don't need to acquire spinlock here, but being
+		 * consistent with heap_rescan seems to be a good idea.
+		 */
+		SpinLockAcquire(&btscan->ps_mutex);
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		SpinLockRelease(&btscan->ps_mutex);
+	}
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 void
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ee46023..baa69fc 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
@@ -544,8 +546,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +568,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (status == false)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +779,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,6 +1041,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
@@ -1060,6 +1116,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1211,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1345,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1381,25 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (status == false)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1407,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (status == false)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			BTScanPosUnpinIfPinned(so->currPos);
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1476,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1490,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1548,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1565,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1586,60 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer on success.  We return TRUE to
+ * indicate success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..15ea6df 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..4f39e93 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -60,8 +60,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dc1f79f..01b358c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -903,7 +903,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	if (OldIndex != NULL && !use_sort)
 	{
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..f8e662b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -722,7 +722,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, index_natts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, index_natts, 0, false);
 	index_rescan(index_scan, scankeys, index_natts, NULL, 0);
 
 	while ((tup = index_getnext(index_scan,
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a364098..43c610c 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -26,6 +26,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
+#include "access/relscan.h"
 
 
 /* ----------------------------------------------------------------
@@ -48,6 +49,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	 * extract necessary information from index scan node
 	 */
 	scandesc = node->biss_ScanDesc;
+	scandesc->parallel_scan = NULL;
 
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..45566bd 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -540,9 +540,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	 */
 	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
 												indexstate->ioss_RelationDesc,
-												estate->es_snapshot,
+												estate->es_snapshot, NULL,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+									 indexstate->ioss_NumOrderByKeys, false);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..77d2990 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1022,9 +1022,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	 */
 	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
 											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
+											   estate->es_snapshot, NULL,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+									  indexstate->iss_NumOrderByKeys, false);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a9efee8..76ff463 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3382,6 +3382,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_PARALLEL_INDEX_SCAN:
+			event_name = "ParallelIndexScan";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 56943f2..e058150 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5129,8 +5129,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 				 * extreme value has been deleted; that case motivates not
 				 * using SnapshotAny here.
 				 */
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in sortop's direction */
@@ -5161,8 +5161,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			/* If max is requested, and we didn't find the index is empty */
 			if (max && have_data)
 			{
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in reverse direction */
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..e777678 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (Size index_size);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81907d5..4410ea3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,16 +133,23 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys);
+				ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..75a4c96 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,25 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * Below flags are used indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04
+/*
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern Datum bthandler(PG_FUNCTION_ARGS);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -662,10 +682,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(Size size);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber next_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..ca843f9 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -21,6 +21,9 @@
 #include "access/tupdesc.h"
 #include "storage/spin.h"
 
+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))
+
 /*
  * Shared state for parallel heap scan.
  *
@@ -126,8 +129,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
+	ParallelIndexScanDesc parallel_scan;		/* parallel index scan
+												 * information */
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1c9bf13..a147ff3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -784,6 +784,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_PARALLEL_INDEX_SCAN,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 932be2f..8c2ed53 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -165,6 +165,7 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
@@ -1259,6 +1260,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

parallel_index_opt_exec_support_v1.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v1.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8d4f9f7..33ea60b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@
   bool        amclusterable;
   /* does AM handle predicate locks? */
   bool        ampredlocks;
+  /* does AM support parallel scan? */
+  bool        amcanparallel;
   /* type of data stored in index, or InvalidOid if variable */
   Oid         amkeytype;
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5ff6a0f..a13b5e7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6f7024e..38dd077 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5727ef9..d300545 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 40ddda5..304656a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 94fa368..cf2ed62 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -19,6 +19,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4f39e93..98c5a4e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -48,6 +48,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9b29f09 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -60,6 +60,7 @@
 
 static bool TargetListSupportsBackwardScan(List *targetlist);
 static bool IndexSupportsBackwardScan(Oid indexid);
+static bool GatherSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -485,7 +486,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Gather:
-			return false;
+			return GatherSupportsBackwardScan(node);
 
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
@@ -566,6 +567,25 @@ IndexSupportsBackwardScan(Oid indexid)
 }
 
 /*
+ * GatherSupportsBackwardScan - does a gather plan supports backward scan?
+ *
+ * Returns true if the outer plan node of gather supports backward scan.
+ * As of now, we can support backward scan, iff outer node of gather has
+ * index node.
+ */
+bool
+GatherSupportsBackwardScan(Plan *node)
+{
+	Plan	   *outer_node = outerPlan(node);
+
+	if (nodeTag(outer_node) == T_IndexScan)
+		return IndexSupportsBackwardScan(((IndexScan *) outer_node)->indexid) &&
+			TargetListSupportsBackwardScan(outer_node->targetlist);
+	else
+		return false;
+}
+
+/*
  * ExecMaterializesOutput - does a plan type materialize its output?
  *
  * Returns true if the plan node type is one that automatically materializes
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 5aa6f02..c55b7c4 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -193,6 +194,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -245,6 +250,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -689,6 +698,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 77d2990..a7d4c25 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -515,6 +518,15 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * if we are here to just update the scan keys, then don't reset parallel
+	 * scan
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -540,10 +552,16 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/* reset (parallel) index scan */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1018,22 +1036,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot, NULL,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot, NULL,
+												 indexstate->iss_NumScanKeys,
 									  indexstate->iss_NumOrderByKeys, false);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1595,3 +1620,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 2a49639..f216c26 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -418,6 +418,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -620,11 +621,33 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = path->path.parallel_workers;
+		double		leader_contribution;
+
+		/*
+		 * We divide only the cpu cost among workers and the division uses the
+		 * same formula as we use for seq scan.  See cost_seqscan.
+		 */
+		leader_contribution = 1.0 - (0.3 * path->path.parallel_workers);
+		if (leader_contribution > 0)
+			parallel_divisor += leader_contribution;
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb..54a8a2f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include <limits.h>
 #include <math.h>
 
 #include "access/stratnum.h"
@@ -108,6 +109,8 @@ static bool bms_equal_any(Relids relids, List *relids_list);
 static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				IndexOptInfo *index, IndexClauseSet *clauses,
 				List **bitindexpaths);
+static int index_path_get_workers(PlannerInfo *root, RelOptInfo *rel,
+					   IndexOptInfo *index);
 static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
@@ -811,6 +814,51 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 }
 
 /*
+ * index_path_get_workers
+ *	  Build partial access paths for parallel scan of a index relation
+ */
+static int
+index_path_get_workers(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index)
+{
+	int			parallel_workers;
+	int			parallel_threshold;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child. In that
+	 * case, we want to generate a parallel path here anyway.  It might not be
+	 * worthwhile just for this relation, but when combined with all of its
+	 * inheritance siblings it may well pay off.
+	 */
+	if (index->pages < (BlockNumber) min_parallel_relation_size &&
+		rel->reloptkind == RELOPT_BASEREL)
+		return 0;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated,
+	 * but we need something here for now.  Note that the upper limit of the
+	 * min_parallel_relation_size GUC is chosen to prevent overflow here.
+	 */
+	parallel_workers = 1;
+	parallel_threshold = Max(min_parallel_relation_size, 1);
+	while (index->pages >= (BlockNumber) (parallel_threshold * 3))
+	{
+		parallel_workers++;
+		parallel_threshold *= 3;
+		if (parallel_threshold > INT_MAX / 3)
+			break;				/* avoid overflow */
+	}
+
+	/*
+	 * In no case use more than max_parallel_workers_per_gather workers.
+	 */
+	parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather);
+
+	return parallel_workers;
+}
+
+/*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
  *	  or more IndexPaths.
@@ -1042,8 +1090,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = index_path_get_workers(root, rel, index);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1149,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = index_path_get_workers(root, rel, index);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f657ffc..8fbde7c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5297,7 +5297,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index abb7507..0ac3b19 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 5d18206..400091a 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -239,6 +239,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e777678..baca400 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..e03d591 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fa3661..78fdece 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1325,6 +1325,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1351,6 +1352,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3a1255a..7e4b475 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -622,6 +622,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 71d9154..206fe0a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#1)

Re: Parallel Index Scans

On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

As of now, the driving table for parallel query is accessed by
parallel sequential scan which limits its usage to a certain degree.
Parallelising index scans would further increase the usage of parallel
query in many more cases. This patch enables the parallelism for the
btree scans. Supporting parallel index scan for other index types
like hash, gist, spgist can be done as separate patches.

I would like to have an input on the method of selecting parallel
workers for scanning index. Currently the patch selects number of
workers based on size of index relation and the upper limit of
parallel workers is max_parallel_workers_per_gather. This is quite
similar to what we do for parallel sequential scan except for the fact
that in parallel seq. scan, we use the parallel_workers option if
provided by user during Create Table. User can provide
parallel_workers option as below:

Create Table .... With (parallel_workers = 4);

Is it desirable to have similar option for parallel index scans, if
yes then what should be the interface for same? One possible way
could be to allow user to provide it during Create Index as below:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build. It seems to me that parallel
tuple sort patch [1]/messages/by-id/CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg@mail.gmail.com proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.

[1]: /messages/by-id/CAM3SWZTmkOFEiCDpUNaO4n9-1xcmWP-1NXmT7h0Pu3gM2YuHvg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Rahila Syed

rahilasyed90@gmail.com

about 9 years ago

In reply to: Amit Kapila (#2)

Re: Parallel Index Scans

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.

IIUC, its not possible to know the exact number of pages scanned from an
index
in advance.
What we are essentially making parallel is the scan of the leaf pages.
So it will make sense to have the number of workers based on number of leaf
pages.
Having said that, I think it will not make much difference as compared to
existing method because
currently total index pages are used to calculate the number of workers. As
far as I understand,in large indexes, the difference between
number of leaf pages and total pages is not significant. In other words,
internal pages form a small fraction of total pages.
Also the calculation is based on log of number of pages so it will make
even lesser difference.

Thank you,
Rahila Syed

On Tue, Oct 18, 2016 at 8:38 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Thu, Oct 13, 2016 at 8:48 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

As of now, the driving table for parallel query is accessed by
parallel sequential scan which limits its usage to a certain degree.
Parallelising index scans would further increase the usage of parallel
query in many more cases. This patch enables the parallelism for the
btree scans. Supporting parallel index scan for other index types
like hash, gist, spgist can be done as separate patches.

I would like to have an input on the method of selecting parallel
workers for scanning index. Currently the patch selects number of
workers based on size of index relation and the upper limit of
parallel workers is max_parallel_workers_per_gather. This is quite
similar to what we do for parallel sequential scan except for the fact
that in parallel seq. scan, we use the parallel_workers option if
provided by user during Create Table. User can provide
parallel_workers option as below:

Create Table .... With (parallel_workers = 4);

Is it desirable to have similar option for parallel index scans, if
yes then what should be the interface for same? One possible way
could be to allow user to provide it during Create Index as below:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build. It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.

[1] - /messages/by-id/CAM3SWZTmkOFEiCDpUNaO4n9-
1xcmWP-1NXmT7h0Pu3gM2YuHvg%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Rahila Syed (#3)

Re: Parallel Index Scans

On Tue, Oct 18, 2016 at 4:08 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Another point which needs some thoughts is whether it is good idea to
use index relation size to calculate parallel workers for index scan.
I think ideally for index scans it should be based on number of pages
to be fetched/scanned from index.

IIUC, its not possible to know the exact number of pages scanned from an
index
in advance.

We can't find the exact numbers of index pages to be scanned, but I
think we can find estimated number of pages to be fetched (refer
cost_index).

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Amit Kapila (#2)

Re: Parallel Index Scans

On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build. It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Apparently you see a similar issue with other major database systems,
where similar storage parameter things are kind of "overloaded" like
this (they are used by both index creation, and by the optimizer in
considering whether it should use a parallel index scan). That can be
a kind of a gotcha for their users, but maybe it's still worth it. In
any case, the complaints I saw about that were from users who used
parallel CREATE INDEX with the equivalent of my parallel_workers index
storage parameter, and then unexpectedly found this also forced the
use of parallel index scan. Not the other way around.

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Peter Geoghegan (#5)

Re: Parallel Index Scans

On Thu, Oct 20, 2016 at 7:39 AM, Peter Geoghegan <pg@heroku.com> wrote:

On Mon, Oct 17, 2016 at 8:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Create Index .... With (parallel_workers = 4);

If above syntax looks sensible, then we might need to think what
should be used for parallel index build. It seems to me that parallel
tuple sort patch [1] proposed by Peter G. is using above syntax for
getting the parallel workers input from user for parallel index
builds.

Apparently you see a similar issue with other major database systems,
where similar storage parameter things are kind of "overloaded" like
this (they are used by both index creation, and by the optimizer in
considering whether it should use a parallel index scan). That can be
a kind of a gotcha for their users, but maybe it's still worth it.

I have also checked and found that you are right. In SQL Server, they
are using max degree of parallelism (MAXDOP) parameter which is I
think is common for all the sql statements.

In
any case, the complaints I saw about that were from users who used
parallel CREATE INDEX with the equivalent of my parallel_workers index
storage parameter, and then unexpectedly found this also forced the
use of parallel index scan. Not the other way around.

I can understand that it can be confusing to users, so other option
could be to provide separate parameters like parallel_workers_build
and parallel_workers where first can be used for index build and
second can be used for scan. My personal opinion is to have one
parameter, so that users have one less thing to learn about
parallelism.

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

about 9 years ago

In reply to: Amit Kapila (#6)

Re: Parallel Index Scans

On Wed, Oct 19, 2016 at 8:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I have also checked and found that you are right. In SQL Server, they
are using max degree of parallelism (MAXDOP) parameter which is I
think is common for all the sql statements.

It's not just that one that does things this way, for what it's worth.

I can understand that it can be confusing to users, so other option
could be to provide separate parameters like parallel_workers_build
and parallel_workers where first can be used for index build and
second can be used for scan. My personal opinion is to have one
parameter, so that users have one less thing to learn about
parallelism.

That's my first instinct too, but I don't really have an opinion yet.

I think that this is the kind of thing where it could make sense to
take a "wait and see" approach, and then make a firm decision
immediately prior to beta. This is what we did in deciding the name of
and fine details around what ultimately became the
max_parallel_workers_per_gather GUC (plus related GUCs and storage
parameters).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Kapila (#6)

Re: Parallel Index Scans

On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.

I think the parallel_workers reloption should override the degree of
parallelism for any sort of parallel scan on that table. Had I
intended it to apply only to sequential scans, I would have named it
differently.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Robert Haas (#8)

Re: Parallel Index Scans

On Thu, Oct 20, 2016 at 10:33 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 19, 2016 at 11:07 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Ideally, the parallel_workers storage parameter will rarely be
necessary because the optimizer will generally do the right thing in
all case.

Yeah, we can choose not to provide any parameter for parallel index
scans, but some users might want to have a parameter similar to
parallel table scans, so it could be handy for them to use.

I think the parallel_workers reloption should override the degree of
parallelism for any sort of parallel scan on that table. Had I
intended it to apply only to sequential scans, I would have named it
differently.

I think there is big difference of size of relation to scan between
parallel sequential scan and parallel (range) index scan which could
make it difficult for user to choose the value of this parameter. Why
do you think that the parallel_workers reloption should suffice all
type of scans for a table? I could only think of providing it based
on thinking that lesser config knobs makes life easier.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Kapila (#9)

Re: Parallel Index Scans

On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the parallel_workers reloption should override the degree of
parallelism for any sort of parallel scan on that table. Had I
intended it to apply only to sequential scans, I would have named it
differently.

I think there is big difference of size of relation to scan between
parallel sequential scan and parallel (range) index scan which could
make it difficult for user to choose the value of this parameter. Why
do you think that the parallel_workers reloption should suffice all
type of scans for a table? I could only think of providing it based
on thinking that lesser config knobs makes life easier.

Well, we could do that, but it would be fairly complicated and it
doesn't seem to me to be the right place to focus our efforts. I'd
rather try to figure out some way to make the planner smarter, because
even if users can override the number of workers on a
per-table-per-scan-type basis, they're probably still going to find
using parallel query pretty frustrating unless we make the
number-of-workers formula smarter than it is today. Anyway, even if
we do decide to add more reloptions than just parallel_degree someday,
couldn't that be left for a separate patch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Robert Haas (#10)

Re: Parallel Index Scans

On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 21, 2016 at 9:27 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the parallel_workers reloption should override the degree of
parallelism for any sort of parallel scan on that table. Had I
intended it to apply only to sequential scans, I would have named it
differently.

I think there is big difference of size of relation to scan between
parallel sequential scan and parallel (range) index scan which could
make it difficult for user to choose the value of this parameter. Why
do you think that the parallel_workers reloption should suffice all
type of scans for a table? I could only think of providing it based
on thinking that lesser config knobs makes life easier.

Well, we could do that, but it would be fairly complicated and it
doesn't seem to me to be the right place to focus our efforts. I'd
rather try to figure out some way to make the planner smarter, because
even if users can override the number of workers on a
per-table-per-scan-type basis, they're probably still going to find
using parallel query pretty frustrating unless we make the
number-of-workers formula smarter than it is today. Anyway, even if
we do decide to add more reloptions than just parallel_degree someday,
couldn't that be left for a separate patch?

That makes sense to me. As of now, patch doesn't consider reloptions
for parallel index scans. So, I think we can leave it as it is and
then later as a separate patch decide whether to use reloption of
table or a separate reloption for index would be better choice.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#11)

2 attachment(s)

Re: Parallel Index Scans

On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables). I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory. I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it). There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v2.patchapplication/octet-stream; name=parallel_index_scan_v2.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..8d4f9f7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -77,61 +77,64 @@
   <para>
    The structure <structname>IndexAmRoutine</structname> is defined thus:
 <programlisting>
-typedef struct IndexAmRoutine
-{
-    NodeTag     type;
-
-    /*
-     * Total number of strategies (operators) by which we can traverse/search
-     * this AM.  Zero if AM does not have a fixed set of strategy assignments.
-     */
-    uint16      amstrategies;
-    /* total number of support functions that this AM uses */
-    uint16      amsupport;
-    /* does AM support ORDER BY indexed column's value? */
-    bool        amcanorder;
-    /* does AM support ORDER BY result of an operator on indexed column? */
-    bool        amcanorderbyop;
-    /* does AM support backward scanning? */
-    bool        amcanbackward;
-    /* does AM support UNIQUE indexes? */
-    bool        amcanunique;
-    /* does AM support multi-column indexes? */
-    bool        amcanmulticol;
-    /* does AM require scans to have a constraint on the first index column? */
-    bool        amoptionalkey;
-    /* does AM handle ScalarArrayOpExpr quals? */
-    bool        amsearcharray;
-    /* does AM handle IS NULL/IS NOT NULL quals? */
-    bool        amsearchnulls;
-    /* can index storage data type differ from column data type? */
-    bool        amstorage;
-    /* can an index of this type be clustered on? */
-    bool        amclusterable;
-    /* does AM handle predicate locks? */
-    bool        ampredlocks;
-    /* type of data stored in index, or InvalidOid if variable */
-    Oid         amkeytype;
-
-    /* interface functions */
-    ambuild_function ambuild;
-    ambuildempty_function ambuildempty;
-    aminsert_function aminsert;
-    ambulkdelete_function ambulkdelete;
-    amvacuumcleanup_function amvacuumcleanup;
-    amcanreturn_function amcanreturn;   /* can be NULL */
-    amcostestimate_function amcostestimate;
-    amoptions_function amoptions;
-    amproperty_function amproperty;     /* can be NULL */
-    amvalidate_function amvalidate;
-    ambeginscan_function ambeginscan;
-    amrescan_function amrescan;
-    amgettuple_function amgettuple;     /* can be NULL */
-    amgetbitmap_function amgetbitmap;   /* can be NULL */
-    amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
-} IndexAmRoutine;
+  typedef struct IndexAmRoutine
+  {
+  NodeTag     type;
+
+  /*
+  * Total number of strategies (operators) by which we can traverse/search
+  * this AM.  Zero if AM does not have a fixed set of strategy assignments.
+  */
+  uint16      amstrategies;
+  /* total number of support functions that this AM uses */
+  uint16      amsupport;
+  /* does AM support ORDER BY indexed column's value? */
+  bool        amcanorder;
+  /* does AM support ORDER BY result of an operator on indexed column? */
+  bool        amcanorderbyop;
+  /* does AM support backward scanning? */
+  bool        amcanbackward;
+  /* does AM support UNIQUE indexes? */
+  bool        amcanunique;
+  /* does AM support multi-column indexes? */
+  bool        amcanmulticol;
+  /* does AM require scans to have a constraint on the first index column? */
+  bool        amoptionalkey;
+  /* does AM handle ScalarArrayOpExpr quals? */
+  bool        amsearcharray;
+  /* does AM handle IS NULL/IS NOT NULL quals? */
+  bool        amsearchnulls;
+  /* can index storage data type differ from column data type? */
+  bool        amstorage;
+  /* can an index of this type be clustered on? */
+  bool        amclusterable;
+  /* does AM handle predicate locks? */
+  bool        ampredlocks;
+  /* type of data stored in index, or InvalidOid if variable */
+  Oid         amkeytype;
+
+  /* interface functions */
+  ambuild_function ambuild;
+  ambuildempty_function ambuildempty;
+  aminsert_function aminsert;
+  ambulkdelete_function ambulkdelete;
+  amvacuumcleanup_function amvacuumcleanup;
+  amcanreturn_function amcanreturn;   /* can be NULL */
+  amcostestimate_function amcostestimate;
+  amoptions_function amoptions;
+  amproperty_function amproperty;     /* can be NULL */
+  amvalidate_function amvalidate;
+  amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+  aminitparallelscan_function aminitparallelscan;    /* can be NULL */
+  ambeginscan_function ambeginscan;
+  amrescan_function amrescan;
+  amparallelrescan_function amparallelrescan;    /* can be NULL */
+  amgettuple_function amgettuple;     /* can be NULL */
+  amgetbitmap_function amgetbitmap;   /* can be NULL */
+  amendscan_function amendscan;
+  ammarkpos_function ammarkpos;       /* can be NULL */
+  amrestrpos_function amrestrpos;     /* can be NULL */
+  } IndexAmRoutine;
 </programlisting>
   </para>
 
@@ -459,6 +462,36 @@ amvalidate (Oid opclassoid);
    invalid.  Problems should be reported with <function>ereport</> messages.
   </para>
 
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (Size index_size);
+</programlisting>
+    Estimate and return the storage space required for parallel index scan.
+    The <literal>index_size</> parameter indicate the size of generic parallel
+    index scan information.  The size of index-type-specific parallel information
+    will be added to <literal>index_size</> and result will be returned back to
+    caller.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and <function>amestimateparallelscan</>
+     functions need only be provided if the access method supports <quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its <structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>
 
   <para>
    The purpose of an index, of course, is to support scans for tuples matching
@@ -511,6 +544,23 @@ amrescan (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   Restart the parallel index scan.  It resets the parallel index scan state.
+   It must be called only during restart of scan which will be typically
+   required for the inner side of nest-loop join.
+  </para>
+
+  <para>
+   The <function>amparallelrescan</> function need only be provided if the
+   access method supports <quote>parallel</> index scans.  If it doesn't,
+   the <structfield>amparallelrescan</> field in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
 boolean
 amgettuple (IndexScanDesc scan,
             ScanDirection direction);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3de489e..7b57190 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1201,7 +1201,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1234,6 +1234,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelIndexScan</></entry>
+         <entry>Waiting for next block to be available for parallel index scan.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..5ff6a0f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -104,8 +104,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..6f7024e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..5727ef9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,8 +81,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e3b1eef..40ddda5 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -78,8 +78,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..84ebd72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -375,7 +375,7 @@ systable_beginscan(Relation heapRelation,
 		}
 
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -577,7 +577,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	}
 
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..72e1b03 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -207,6 +207,70 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate storage for ParallelIndexScanDesc
+ *
+ * It calls am specific routine to obtain size of am specific shared
+ * information.
+ */
+Size
+index_parallelscan_estimate(Relation indexrel, Snapshot snapshot)
+{
+	Size		index_size = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+									  EstimateSnapshotSpace(snapshot));
+
+	/* amestimateparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->amestimateparallelscan == NULL)
+		return index_size;
+	else
+		return indexrel->rd_amroutine->amestimateparallelscan(index_size);
+}
+
+/*
+ * index_parallelscan_initialize - initialize ParallelIndexScanDesc
+ *
+ * It calls access method specific initialization routine to initialize am
+ * specific information.  Call this just once in the leader process; then,
+ * individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+								  EstimateSnapshotSpace(snapshot));
+	void	   *amtarget = (char *) ((void *) target) + offset;
+
+	target->ps_relid = RelationGetRelid(heaprel);
+	target->ps_indexid = RelationGetRelid(indexrel);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->aminitparallelscan == NULL)
+		return;
+
+	indexrel->rd_amroutine->aminitparallelscan(amtarget);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan(heaprel, indexrel, snapshot, pscan, nkeys, norderbys, true);
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -214,8 +278,8 @@ index_insert(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
-				Snapshot snapshot,
-				int nkeys, int norderbys)
+				Snapshot snapshot, ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap)
 {
 	IndexScanDesc scan;
 
@@ -227,6 +291,8 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
 
 	return scan;
 }
@@ -251,6 +317,8 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = NULL;
+	scan->xs_temp_snap = false;
 
 	return scan;
 }
@@ -319,6 +387,22 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan == NULL)
+		return;
+
+	scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +425,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 128744c..d1d09b8 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +64,23 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber ps_nextPage;	/* latest or next page to be scanned */
+	uint8		ps_pageStatus;	/* state of scan, see below */
+	int			ps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		ps_mutex;		/* protects above variables */
+	ConditionVariable cv;		/* used to synchronize parallel scan */
+}	BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -110,8 +129,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = btparallelrescan;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
@@ -462,6 +484,171 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(Size index_size)
+{
+	return add_size(index_size, sizeof(BTParallelScanDescData));
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->ps_mutex);
+	bt_target->ps_nextPage = InvalidBlockNumber;
+	bt_target->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->ps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->cv);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ *	status - True indicates that the block number returned is either valid
+ *			 and scan is continued or block number is invalid and scan has just
+ *			 begun.  False indicates that we have reached the end of scan for
+ *			 current scankeys and for that we return block number as P_NONE.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	uint8		pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		SpinLockAcquire(&btscan->ps_mutex);
+		pageStatus = btscan->ps_pageStatus;
+		if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->ps_nextPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->ps_mutex);
+		if (exit_loop || *status == false)
+			break;
+		ConditionVariableSleep(&btscan->cv, WAIT_EVENT_PARALLEL_INDEX_SCAN);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (*status == false)
+		return P_NONE;
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.
+ *
+ * For backward scan, next_page holds the latest page being scanned.
+ * For forward scan, next_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_page)
+{
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->ps_mutex);
+	btscan->ps_nextPage = next_page;
+	btscan->ps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->ps_mutex);
+	ConditionVariableSignal(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (scan->parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (so->arrayKeyCount >= btscan->ps_arrayKeyCount &&
+		btscan->ps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->ps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * It updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (btscan->ps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->ps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+}
+
+/*
  *	btrescan() -- rescan an index relation
  */
 void
@@ -481,6 +668,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -521,6 +709,31 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 }
 
 /*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	if (scan->parallel_scan)
+	{
+		BTParallelScanDesc btscan;
+
+		btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+		/*
+		 * Ideally, we don't need to acquire spinlock here, but being
+		 * consistent with heap_rescan seems to be a good idea.
+		 */
+		SpinLockAcquire(&btscan->ps_mutex);
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		SpinLockRelease(&btscan->ps_mutex);
+	}
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 void
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ee46023..baa69fc 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,6 +30,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
@@ -544,8 +546,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +568,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (status == false)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +779,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,6 +1041,14 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
@@ -1060,6 +1116,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1211,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1345,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1381,25 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (status == false)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1407,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (status == false)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+			BTScanPosUnpinIfPinned(so->currPos);
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1476,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1490,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1548,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1565,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1586,60 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer on success.  We return TRUE to
+ * indicate success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..15ea6df 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..4f39e93 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -60,8 +60,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index dc1f79f..01b358c 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -903,7 +903,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	if (OldIndex != NULL && !use_sort)
 	{
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..f8e662b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -722,7 +722,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, index_natts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, index_natts, 0, false);
 	index_rescan(index_scan, scankeys, index_natts, NULL, 0);
 
 	while ((tup = index_getnext(index_scan,
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a364098..43c610c 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -26,6 +26,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
+#include "access/relscan.h"
 
 
 /* ----------------------------------------------------------------
@@ -48,6 +49,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	 * extract necessary information from index scan node
 	 */
 	scandesc = node->biss_ScanDesc;
+	scandesc->parallel_scan = NULL;
 
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..45566bd 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -540,9 +540,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	 */
 	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
 												indexstate->ioss_RelationDesc,
-												estate->es_snapshot,
+												estate->es_snapshot, NULL,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+									 indexstate->ioss_NumOrderByKeys, false);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..77d2990 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1022,9 +1022,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	 */
 	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
 											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
+											   estate->es_snapshot, NULL,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+									  indexstate->iss_NumOrderByKeys, false);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..248ceb5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3387,6 +3387,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_PARALLEL_INDEX_SCAN:
+			event_name = "ParallelIndexScan";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 56943f2..e058150 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5129,8 +5129,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 				 * extreme value has been deleted; that case motivates not
 				 * using SnapshotAny here.
 				 */
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in sortop's direction */
@@ -5161,8 +5161,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			/* If max is requested, and we didn't find the index is empty */
 			if (max && have_data)
 			{
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in reverse direction */
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..e777678 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (Size index_size);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81907d5..4410ea3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,16 +133,23 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys);
+				ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..75a4c96 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,25 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * Below flags are used indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04
+/*
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern Datum bthandler(PG_FUNCTION_ARGS);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -662,10 +682,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(Size size);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber next_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..ca843f9 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -21,6 +21,9 @@
 #include "access/tupdesc.h"
 #include "storage/spin.h"
 
+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))
+
 /*
  * Shared state for parallel heap scan.
  *
@@ -126,8 +129,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
+	ParallelIndexScanDesc parallel_scan;		/* parallel index scan
+												 * information */
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0b85b7a..83e3c11 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -784,6 +784,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_PARALLEL_INDEX_SCAN,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6c6d519..20ceb55 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -165,6 +165,7 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
@@ -1259,6 +1260,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

parallel_index_opt_exec_support_v2.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v2.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8d4f9f7..33ea60b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@
   bool        amclusterable;
   /* does AM handle predicate locks? */
   bool        ampredlocks;
+  /* does AM support parallel scan? */
+  bool        amcanparallel;
   /* type of data stored in index, or InvalidOid if variable */
   Oid         amkeytype;
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5ff6a0f..a13b5e7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6f7024e..38dd077 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5727ef9..d300545 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 40ddda5..304656a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index ae69a8e..f50854d 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -19,6 +19,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -117,6 +118,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4f39e93..98c5a4e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -48,6 +48,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9b29f09 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -60,6 +60,7 @@
 
 static bool TargetListSupportsBackwardScan(List *targetlist);
 static bool IndexSupportsBackwardScan(Oid indexid);
+static bool GatherSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -485,7 +486,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Gather:
-			return false;
+			return GatherSupportsBackwardScan(node);
 
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
@@ -566,6 +567,25 @@ IndexSupportsBackwardScan(Oid indexid)
 }
 
 /*
+ * GatherSupportsBackwardScan - does a gather plan supports backward scan?
+ *
+ * Returns true if the outer plan node of gather supports backward scan.
+ * As of now, we can support backward scan, iff outer node of gather has
+ * index node.
+ */
+bool
+GatherSupportsBackwardScan(Plan *node)
+{
+	Plan	   *outer_node = outerPlan(node);
+
+	if (nodeTag(outer_node) == T_IndexScan)
+		return IndexSupportsBackwardScan(((IndexScan *) outer_node)->indexid) &&
+			TargetListSupportsBackwardScan(outer_node->targetlist);
+	else
+		return false;
+}
+
+/*
  * ExecMaterializesOutput - does a plan type materialize its output?
  *
  * Returns true if the plan node type is one that automatically materializes
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f9c8598..3238cc7 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -193,6 +194,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -245,6 +250,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -689,6 +698,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 77d2990..a7d4c25 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -515,6 +518,15 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * if we are here to just update the scan keys, then don't reset parallel
+	 * scan
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -540,10 +552,16 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/* reset (parallel) index scan */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1018,22 +1036,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot, NULL,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot, NULL,
+												 indexstate->iss_NumScanKeys,
 									  indexstate->iss_NumOrderByKeys, false);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1595,3 +1620,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index e42895d..d1ee784 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -418,6 +418,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -620,11 +621,33 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = path->path.parallel_workers;
+		double		leader_contribution;
+
+		/*
+		 * We divide only the cpu cost among workers and the division uses the
+		 * same formula as we use for seq scan.  See cost_seqscan.
+		 */
+		leader_contribution = 1.0 - (0.3 * path->path.parallel_workers);
+		if (leader_contribution > 0)
+			parallel_divisor += leader_contribution;
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb..54a8a2f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include <limits.h>
 #include <math.h>
 
 #include "access/stratnum.h"
@@ -108,6 +109,8 @@ static bool bms_equal_any(Relids relids, List *relids_list);
 static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				IndexOptInfo *index, IndexClauseSet *clauses,
 				List **bitindexpaths);
+static int index_path_get_workers(PlannerInfo *root, RelOptInfo *rel,
+					   IndexOptInfo *index);
 static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
@@ -811,6 +814,51 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 }
 
 /*
+ * index_path_get_workers
+ *	  Build partial access paths for parallel scan of a index relation
+ */
+static int
+index_path_get_workers(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index)
+{
+	int			parallel_workers;
+	int			parallel_threshold;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child. In that
+	 * case, we want to generate a parallel path here anyway.  It might not be
+	 * worthwhile just for this relation, but when combined with all of its
+	 * inheritance siblings it may well pay off.
+	 */
+	if (index->pages < (BlockNumber) min_parallel_relation_size &&
+		rel->reloptkind == RELOPT_BASEREL)
+		return 0;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated,
+	 * but we need something here for now.  Note that the upper limit of the
+	 * min_parallel_relation_size GUC is chosen to prevent overflow here.
+	 */
+	parallel_workers = 1;
+	parallel_threshold = Max(min_parallel_relation_size, 1);
+	while (index->pages >= (BlockNumber) (parallel_threshold * 3))
+	{
+		parallel_workers++;
+		parallel_threshold *= 3;
+		if (parallel_threshold > INT_MAX / 3)
+			break;				/* avoid overflow */
+	}
+
+	/*
+	 * In no case use more than max_parallel_workers_per_gather workers.
+	 */
+	parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather);
+
+	return parallel_workers;
+}
+
+/*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
  *	  or more IndexPaths.
@@ -1042,8 +1090,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = index_path_get_workers(root, rel, index);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1149,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = index_path_get_workers(root, rel, index);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 41dde50..ea18fc1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5300,7 +5300,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 6d3ccfd..e08c330 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index ad07baa..f23e1b7 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -239,6 +239,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e777678..baca400 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..e03d591 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..17d712a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1331,6 +1331,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1357,6 +1358,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3a1255a..7e4b475 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -622,6 +622,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 71d9154..206fe0a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,

#13

Haribabu Kommi

kommi.haribabu@gmail.com

about 9 years ago

In reply to: Amit Kapila (#12)

Re: Parallel Index Scans

On Sat, Nov 26, 2016 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables). I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory. I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it). There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

#14

Rafia Sabih

rafia.sabih@enterprisedb.com

about 9 years ago

In reply to: Haribabu Kommi (#13)

2 attachment(s)

Re: Parallel Index Scans

Hello,
On evaluating parallel index scans on TPC-H benchmark queries, I came
across some interesting results.

For scale factor 20, queries 4, 6 and 14 are giving significant performance
improvements with parallel index:
Q | Head | PI
4 | 14 | 11
6 | 27 | 9
14 | 20 | 12

To confirm that the proposed patch is scalable I tested it on 300 scale
factor, there some queries switched to bitmap index scan instead of
parallel index, but there were other queries giving significant improvement
in performance:
Q | Head | PI
4 | 207 | 168
14 | 2662 | 1576
15 | 847 | 190

All the performance numbers given above are in seconds. The experimental
setup used in this exercise is as follows:
Server parameter settings:
work_mem = 64 MB,
max_parallel_workers_per_gather = 4,
random_page_cost = seq_page_cost = 0.1 = parallel_tuple_cost,
shared_buffers = 1 GB

Logical schema: Some additional indexes were created to ensure the use of
indexes,
on lineitem table -- l_shipdate, l_returnflag, l_shipmode,
on orders table -- o_comment, o_orderdate, and
on customer table -- c_mktsegment.

Machine used: IBM Power, 4 socket machine, 512 GB RAM

Main observations about the utility and power of this patch includes
availability of appropriate indexes, giving suitable value of
random_page_cost based on the RAM and DB sizes. E.g. in these
experimentation I ensured warm cache environment, hence giving a higher
value to random_page_cost than seq_page_cost does not makes much sense and
it would inhibit the use of indexes. Also, the value of this parameter
needs to be calibrated based on the underlying hardware, there is a recent
work in this direction that gives a mechanism to do this calibration
offline, also they experimented with Postgresql parameters [1]http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf.

Please find the attached file for have a look on these results in detail.
The file pi_perf_tpch.ods gives the performance numbers and the graphs for
both the scale factors. Attached zip folder gives the explain analyse
output for these queries on both head as well as with parallel index patch.

[1]: http://pages.cs.wisc.edu/~wentaowu/papers/prediction-full.pdf

On Mon, Dec 5, 2016 at 10:36 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

On Sat, Nov 26, 2016 at 10:35 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Sat, Oct 22, 2016 at 9:07 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Oct 21, 2016 at 10:55 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

I have rebased the patch (parallel_index_scan_v2) based on latest
commit e8ac886c (condition variables). I have removed the usage of
ConditionVariablePrepareToSleep as that is is no longer mandatory. I
have also updated docs for wait event introduced by this patch (thanks
to Dilip for noticing it). There is no change in
parallel_index_opt_exec_support patch, but just attaching here for
easier reference.

Moved to next CF with "needs review" status.

Regards,
Hari Babu
Fujitsu Australia

--
Regards,
Rafia Sabih
EnterpriseDB: http://www.enterprisedb.com/

#15

Anastasia Lubennikova

lubennikovaav@gmail.com

about 9 years ago

In reply to: Amit Kapila (#12)

Re: Parallel Index Scans

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?
I'm willing to oo a review. I hadn't do benchmarks yet, but I've read the patch and here are some
notes and questions about it.

I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?

+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);

I must say that I'm quite new with all this parallel stuff. If you give me a link,
where to read about snapshots for parallel workers, my review will be more helpful.
Anyway, it would be great to have more comments about it in the code.

2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
or something like this? As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
That leads to the next question.

3. Are there any changes in cost estimation? I didn't find related changes in the patch.
Parallel scan is expected to be faster and optimizer definitely should know that.

4. + uint8 ps_pageStatus; /* state of scan, see below */
There is no desciption below. I'd make the comment more helpful:
/* state of scan. See possible flags values in nbtree.h */
And why do you call it pageStatus? What does it have to do with page?

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status == false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()

+		blkno = _bt_parallel_seize(scan, &status);
+		if (status == false)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}

6. To avoid code duplication, I would wrap this into the function

+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

7. Just a couple of typos I've noticed:

* Below flags are used indicate the state of parallel scan.
* Below flags are used TO indicate the state of parallel scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter.

I will send another review of performance until the end of the week.

The new status of this patch is: Waiting on Author

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: Anastasia Lubennikova (#15)

Re: Parallel Index Scans

Thanks for reviewing! A few quick thoughts from me since I write a
bunch of the design for this patch.

On Wed, Dec 21, 2016 at 10:16 AM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
+       if (scan->xs_temp_snap)
+               UnregisterSnapshot(scan->xs_snapshot);
I must say that I'm quite new with all this parallel stuff. If you give me a link,
where to read about snapshots for parallel workers, my review will be more helpful.
Anyway, it would be great to have more comments about it in the code.

I suspect it would be better to keep those two things formally
separate, even though they may always be the same right now.

2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
or something like this? As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
That leads to the next question.

"estimate" is being used this way quite widely now, in places like
ExecParallelEstimate. So if we're going to change the terminology we
should do it broadly.

3. Are there any changes in cost estimation? I didn't find related changes in the patch.
Parallel scan is expected to be faster and optimizer definitely should know that.

Generally the way that's reflected in the optimized is by having the
parallel scan have a lower row count. See cost_seqscan() for an
example.

In general, you'll probably find a lot of parallels between this patch
and ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, which is probably a good
thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Anastasia Lubennikova (#15)

Re: Parallel Index Scans

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

I'm willing to oo a review.

Thanks, that will be helpful.

I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
+       if (scan->xs_temp_snap)
+               UnregisterSnapshot(scan->xs_snapshot);

I agree with what Rober has told in his reply. We do same way for
heap, refer heap_endscan().

I must say that I'm quite new with all this parallel stuff. If you give me a link,
where to read about snapshots for parallel workers, my review will be more helpful.

You can read transam/README.parallel. Refer "State Sharing" portion
of README to learn more about it.

Anyway, it would be great to have more comments about it in the code.

We are sharing snapshot to ensure that reads in both master backend
and worker backend can use the same snapshot. There is no harm in
adding comments, but I think it is better to be consistent with
similar heapam code. After reading README.parallel, if you still feel
that we should add more comments in the code, then we can definitely
do that.

2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
That leads to the next question.

Do you mean 'amcostestimate'? If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

3. Are there any changes in cost estimation?

Yes.

I didn't find related changes in the patch.
Parallel scan is expected to be faster and optimizer definitely should know that.

You can find the relavant changes in
parallel_index_opt_exec_support_v2.patch, refer cost_index().

4. + uint8 ps_pageStatus; /* state of scan, see below */
There is no desciption below. I'd make the comment more helpful:
/* state of scan. See possible flags values in nbtree.h */

makes sense. Will change.

And why do you call it pageStatus? What does it have to do with page?

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better. Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status == false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }

The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.

6. To avoid code duplication, I would wrap this into the function

+       /* initialize moreLeft/moreRight appropriately for scan direction */
+       if (ScanDirectionIsForward(dir))
+       {
+               so->currPos.moreLeft = false;
+               so->currPos.moreRight = true;
+       }
+       else
+       {
+               so->currPos.moreLeft = true;
+               so->currPos.moreRight = false;
+       }
+       so->numKilled = 0;                      /* just paranoia */
+       so->markItemIndex = -1;         /* ditto */

Okay, I think we can write a separate function (probably inline
function) for above.

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin. So replicating all these actions in the caller will
make the code in caller less readable as compared to now. Consider
this point and let me know your view on same.

7. Just a couple of typos I've noticed:

* Below flags are used indicate the state of parallel scan.
* Below flags are used TO indicate the state of parallel scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

Will fix.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter.

Yes, I am aware of that and I think it makes sense to add it now
rather than waiting until the end.

I will send another review of performance until the end of the week.

Okay, you can refer Rafia's mail above for non-default settings she
has used in her performance tests with TPC-H.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Amit Kapila (#17)

Re: Parallel Index Scans

On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

I'm willing to oo a review.

Thanks, that will be helpful.
I saw the discussion about parameters in the thread above. And I agree that we'd better concentrate
on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of xs_temp_snap flag?
+       if (scan->xs_temp_snap)
+               UnregisterSnapshot(scan->xs_snapshot);
I agree with what Rober has told in his reply.

Typo.
/Rober/Robert Haas

Thanks to Michael Paquier for noticing it and informing me offline.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

tushar

tushar.ahuja@enterprisedb.com

about 9 years ago

In reply to: Amit Kapila (#17)

3 attachment(s)

Re: Parallel Index Scans

On 12/22/2016 09:49 AM, Amit Kapila wrote:

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

Sure, We (Prabhat and I) have done some testing for this feature
internally but never published the test-scripts on this forum. PFA the
sql scripts ( along with the expected .out files) we have used for
testing for your ready reference.

In addition we had generated the LCOV (code coverage) report and
compared the files which are changed for the "Parallel index scan" patch.
You can see the numbers for "with patch" V/s "Without patch" (.pdf
file is attached)

--
regards,tushar

#20

tushar

tushar.ahuja@enterprisedb.com

about 9 years ago

In reply to: tushar (#19)

Re: Parallel Index Scans

On 12/22/2016 01:35 PM, tushar wrote:

On 12/22/2016 09:49 AM, Amit Kapila wrote:

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

Sure, We (Prabhat and I) have done some testing for this feature
internally but never published the test-scripts on this forum. PFA the
sql scripts ( along with the expected .out files) we have used for
testing for your ready reference.

In addition we had generated the LCOV (code coverage) report and
compared the files which are changed for the "Parallel index scan" patch.
You can see the numbers for "with patch" V/s "Without patch" (.pdf
file is attached)

In addition to that, we run the sqlsmith against PG v10+PIS (parallel
index scan) patches and found a crash but that is coming on plain PG
v10 (without applying any patches) as well

postgres=# select
70 as c0,
pg_catalog.has_server_privilege(
cast(ref_0.indexdef as text),
cast(cast(coalesce((select name from pg_catalog.pg_settings
limit 1 offset 16)
,
null) as text) as text)) as c1,
pg_catalog.pg_export_snapshot() as c2,
ref_0.indexdef as c3,
ref_0.indexname as c4
from
pg_catalog.pg_indexes as ref_0
where (ref_0.tablespace = ref_0.tablespace)
or (46 = 22)
limit 103;
TRAP: FailedAssertion("!(keylen < 64)", File: "hashfunc.c", Line: 139)
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2016-12-23
11:19:50.627 IST [2314] LOG: server process (PID 2322) was terminated
by signal 6: Aborted
2016-12-23 11:19:50.627 IST [2314] DETAIL: Failed process was running:
select
70 as c0,
pg_catalog.has_server_privilege(
cast(ref_0.indexdef as text),
cast(cast(coalesce((select name from
pg_catalog.pg_settings limit 1 offset 16)
,
null) as text) as text)) as c1,
pg_catalog.pg_export_snapshot() as c2,
ref_0.indexdef as c3,
ref_0.indexname as c4
from
pg_catalog.pg_indexes as ref_0
where (ref_0.tablespace = ref_0.tablespace)
or (46 = 22)
limit 103;
2016-12-23 11:19:50.627 IST [2314] LOG: terminating any other active
server processes
2016-12-23 11:19:50.627 IST [2319] WARNING: terminating connection
because of crash of another server process
2016-12-23 11:19:50.627 IST [2319] DETAIL: The postmaster has commanded
this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly corrupted
shared memory.
2016-12-23 11:19:50.627 IST [2319] HINT: In a moment you should be able
to reconnect to the database and repeat your command.
2016-12-23 11:19:50.629 IST [2323] FATAL: the database system is in
recovery mode
Failed.
!> 2016-12-23 11:19:50.629 IST [2314] LOG: all server processes
terminated; reinitializing
2016-12-23 11:19:50.658 IST [2324] LOG: database system was
interrupted; last known up at 2016-12-23 11:19:47 IST
2016-12-23 11:19:50.810 IST [2324] LOG: database system was not
properly shut down; automatic recovery in progress
2016-12-23 11:19:50.812 IST [2324] LOG: invalid record length at
0/155E408: wanted 24, got 0
2016-12-23 11:19:50.812 IST [2324] LOG: redo is not required
2016-12-23 11:19:50.819 IST [2324] LOG: MultiXact member wraparound
protections are now enabled
2016-12-23 11:19:50.822 IST [2314] LOG: database system is ready to
accept connections
2016-12-23 11:19:50.822 IST [2328] LOG: autovacuum launcher started

--
regards,tushar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Robert Haas

robertmhaas@gmail.com

about 9 years ago

In reply to: tushar (#20)

Re: Parallel Index Scans

On Fri, Dec 23, 2016 at 1:35 AM, tushar <tushar.ahuja@enterprisedb.com> wrote:

In addition to that, we run the sqlsmith against PG v10+PIS (parallel index
scan) patches and found a crash but that is coming on plain PG v10
(without applying any patches) as well

So why are you reporting it here rather than on a separate thread?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

tushar

tushar.ahuja@enterprisedb.com

about 9 years ago

In reply to: Robert Haas (#21)

Re: Parallel Index Scans

On 12/23/2016 05:38 PM, Robert Haas wrote:

So why are you reporting it here rather than on a separate thread?

We found it -while testing parallel index scan and later it turned out
to be crash in general.
Sure- make sense ,will do that.

--
regards,tushar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Rahila Syed

rahilasyed90@gmail.com

about 9 years ago

In reply to: Amit Kapila (#17)

Re: Parallel Index Scans

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status ==

false)

in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.

Probably this was confusing because we have not mentioned
that P_NONE can be returned even when status = TRUE and
not just when status is false.

I think, the comment above the function can be modified as follows,

+ /*
+ * True indicates that the block number returned is either valid including
P_NONE
+ * and scan is continued or block number is invalid and scan has just
+ * begun.

Thank you,
Rahila Syed

On Thu, Dec 22, 2016 at 9:49 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or

something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

I'm willing to oo a review.

Thanks, that will be helpful.

I saw the discussion about parameters in the thread above. And I agree

that we'd better concentrate

on the patch itself and add them later if necessary.

1. Can't we simply use "if (scan->parallel_scan != NULL)" instead of

xs_temp_snap flag?
+       if (scan->xs_temp_snap)
+               UnregisterSnapshot(scan->xs_snapshot);
I agree with what Rober has told in his reply. We do same way for
heap, refer heap_endscan().

I must say that I'm quite new with all this parallel stuff. If you give

me a link,

where to read about snapshots for parallel workers, my review will be

more helpful.

You can read transam/README.parallel. Refer "State Sharing" portion
of README to learn more about it.

Anyway, it would be great to have more comments about it in the code.

We are sharing snapshot to ensure that reads in both master backend
and worker backend can use the same snapshot. There is no harm in
adding comments, but I think it is better to be consistent with
similar heapam code. After reading README.parallel, if you still feel
that we should add more comments in the code, then we can definitely
do that.

2. Don't you mind to rename 'amestimateparallelscan' to let's say

'amparallelscan_spacerequired'

or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of

'heap_parallelscan_estimate'.

But now it looks similar to 'amestimate' which refers to indexscan cost

for optimizer.

That leads to the next question.

Do you mean 'amcostestimate'? If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

3. Are there any changes in cost estimation?

Yes.

I didn't find related changes in the patch.
Parallel scan is expected to be faster and optimizer definitely should

know that.

You can find the relavant changes in
parallel_index_opt_exec_support_v2.patch, refer cost_index().

4. + uint8 ps_pageStatus; /* state of scan, see below */
There is no desciption below. I'd make the comment more helpful:
/* state of scan. See possible flags values in nbtree.h */

makes sense. Will change.

And why do you call it pageStatus? What does it have to do with page?

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better. Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status ==

false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.

6. To avoid code duplication, I would wrap this into the function

+ /* initialize moreLeft/moreRight appropriately for scan

direction */
+       if (ScanDirectionIsForward(dir))
+       {
+               so->currPos.moreLeft = false;
+               so->currPos.moreRight = true;
+       }
+       else
+       {
+               so->currPos.moreLeft = true;
+               so->currPos.moreRight = false;
+       }
+       so->numKilled = 0;                      /* just paranoia */
+       so->markItemIndex = -1;         /* ditto */
Okay, I think we can write a separate function (probably inline
function) for above.

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin. So replicating all these actions in the caller will
make the code in caller less readable as compared to now. Consider
this point and let me know your view on same.

7. Just a couple of typos I've noticed:

* Below flags are used indicate the state of parallel scan.
* Below flags are used TO indicate the state of parallel scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

Will fix.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter.

Yes, I am aware of that and I think it makes sense to add it now
rather than waiting until the end.

I will send another review of performance until the end of the week.

Okay, you can refer Rafia's mail above for non-default settings she
has used in her performance tests with TPC-H.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#24

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

about 9 years ago

In reply to: Amit Kapila (#17)

Re: Parallel Index Scans

22.12.2016 07:19, Amit Kapila:

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

Check please code related to buffer locking and pinning once again.
I got the warning. Here are the steps to reproduce it:
Except "autovacuum = off" config is default.

pgbench -i -s 100 test
pgbench -c 10 -T 120 test

SELECT count(aid) FROM pgbench_accounts
WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
WARNING: buffer refcount leak: [8297] (rel=base/12289/16459,
blockNum=2469, flags=0x93800000, refcount=1 1)
count
-------
0
(1 row)

postgres=# select 16459::regclass;
regclass
-----------------------
pgbench_accounts_pkey

2. Don't you mind to rename 'amestimateparallelscan' to let's say 'amparallelscan_spacerequired'
or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of 'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost for optimizer.
That leads to the next question.

Do you mean 'amcostestimate'? If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

I think that 'amparallelscanestimate' seems less ambiguous than
amestimateparallelscan.
But it's up to you. There are enough comments to understand the purpose
of this field.

And why do you call it pageStatus? What does it have to do with page?

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better. Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

Yes, I think it describes the flag better.

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status == false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.

Got it,
I think you can add this explanation to the comment for
_bt_parallel_seize().

6. To avoid code duplication, I would wrap this into the function
+       /* initialize moreLeft/moreRight appropriately for scan direction */
+       if (ScanDirectionIsForward(dir))
+       {
+               so->currPos.moreLeft = false;
+               so->currPos.moreRight = true;
+       }
+       else
+       {
+               so->currPos.moreLeft = true;
+               so->currPos.moreRight = false;
+       }
+       so->numKilled = 0;                      /* just paranoia */
+       so->markItemIndex = -1;         /* ditto */
Okay, I think we can write a separate function (probably inline
function) for above.

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin. So replicating all these actions in the caller will
make the code in caller less readable as compared to now. Consider
this point and let me know your view on same.

Thank you for clarification, now I agree with your implementation.
I've just missed that we also handle lock in this function.

Performance results with 2 parallel workers are about 1.5-3 times
better, just like in your tests.
So, no doubt, this feature will be useful.
But I'm trying to find the worst cases for this feature. And I suppose
we should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the
concurrency.
I doubt that it can significantly decrease performance, because number
of parallel readers is not that big,
but it is worth testing.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Anastasia Lubennikova (#24)

2 attachment(s)

Re: Parallel Index Scans

On Fri, Dec 23, 2016 at 6:42 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

22.12.2016 07:19, Amit Kapila:

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or
something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

Check please code related to buffer locking and pinning once again.
I got the warning. Here are the steps to reproduce it:
Except "autovacuum = off" config is default.

pgbench -i -s 100 test
pgbench -c 10 -T 120 test

SELECT count(aid) FROM pgbench_accounts
WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
WARNING: buffer refcount leak: [8297] (rel=base/12289/16459, blockNum=2469,
flags=0x93800000, refcount=1 1)
count

The similar problem has occurred while testing "parallel index only
scan" patch and Rafia has included the fix in her patch [1]/messages/by-id/CAOGQiiNx4Ra9A-RyxjrgECownmVJ64EVpVgfN8ACR-MLupGnng@mail.gmail.com which
ideally should be included in this patch, so I have copied the fix
from her patch. Apart from that, I observed that similar problem can
happen for backward scans, so fixed the same as well.

2. Don't you mind to rename 'amestimateparallelscan' to let's say
'amparallelscan_spacerequired'
or something like this?

Sure, I am open to other names, but IMHO, lets keep "estimate" in the
name to keep it consistent with other parallel stuff. Refer
execParallel.c to see how widely this word is used.

As far as I understand there is nothing to estimate, we know this size
for sure. I guess that you've chosen this name because of
'heap_parallelscan_estimate'.
But now it looks similar to 'amestimate' which refers to indexscan cost
for optimizer.
That leads to the next question.

Do you mean 'amcostestimate'? If you want we can rename it
amparallelscanestimate to be consistent with amcostestimate.

I think that 'amparallelscanestimate' seems less ambiguous than
amestimateparallelscan.
But it's up to you. There are enough comments to understand the purpose of
this field.

Okay, then lets leave as it is, because we have aminitparallelscan
which should also be renamed to amparallelscaninit if we rename
amestimateparallelscan.

And why do you call it pageStatus? What does it have to do with page?

During scan this tells us whether next page is available for scan.
Another option could be to name it as scanStatus, but not sure if that
is better. Do you think if we add a comment like "indicates whether
next page is available for scan" for this variable then it will be
clear?

Yes, I think it describes the flag better.

Changed as per above suggestion.

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status ==
false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.
Got it,
I think you can add this explanation to the comment for
_bt_parallel_seize().

Expanded the comment as discussed.

6. To avoid code duplication, I would wrap this into the function

+       /* initialize moreLeft/moreRight appropriately for scan direction
*/
+       if (ScanDirectionIsForward(dir))
+       {
+               so->currPos.moreLeft = false;
+               so->currPos.moreRight = true;
+       }
+       else
+       {
+               so->currPos.moreLeft = true;
+               so->currPos.moreRight = false;
+       }
+       so->numKilled = 0;                      /* just paranoia */
+       so->markItemIndex = -1;         /* ditto */

Okay, I think we can write a separate function (probably inline
function) for above.

Added the above code in a separate inline function.

And after that we can also get rid of _bt_parallel_readpage() which only
bring another level of indirection to the code.

See, this function is responsible for multiple actions like
initializing moreLeft/moreRight positions, reading next page, dropping
the lock/pin. So replicating all these actions in the caller will
make the code in caller less readable as compared to now. Consider
this point and let me know your view on same.

Thank you for clarification, now I agree with your implementation.
I've just missed that we also handle lock in this function.

Okay.

7. Just a couple of typos I've noticed:

* Below flags are used indicate the state of parallel scan.
* Below flags are used TO indicate the state of parallel scan.

* On success, release lock and pin on buffer on success.
* On success release lock and pin on buffer.

Fixed.

8. I didn't find a description of the feature in documentation.
Probably we need to add a paragraph to the "Parallel Query" chapter.

Added the description in parallel_index_opt_exec_support_v3.patch.

Performance results with 2 parallel workers are about 1.5-3 times better,
just like in your tests.
So, no doubt, this feature will be useful.

Thanks for the tests.

But I'm trying to find the worst cases for this feature. And I suppose we
should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the
concurrency.
I doubt that it can significantly decrease performance, because number of
parallel readers is not that big,

I am not sure if such a test is meaningful for this patch because
parallelism is generally used for large data reads and in such cases
there are generally not many concurrent writes.

Thanks for your valuable inputs.

[1]: /messages/by-id/CAOGQiiNx4Ra9A-RyxjrgECownmVJ64EVpVgfN8ACR-MLupGnng@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v3.patchapplication/octet-stream; name=parallel_index_scan_v3.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..8d4f9f7 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -77,61 +77,64 @@
   <para>
    The structure <structname>IndexAmRoutine</structname> is defined thus:
 <programlisting>
-typedef struct IndexAmRoutine
-{
-    NodeTag     type;
-
-    /*
-     * Total number of strategies (operators) by which we can traverse/search
-     * this AM.  Zero if AM does not have a fixed set of strategy assignments.
-     */
-    uint16      amstrategies;
-    /* total number of support functions that this AM uses */
-    uint16      amsupport;
-    /* does AM support ORDER BY indexed column's value? */
-    bool        amcanorder;
-    /* does AM support ORDER BY result of an operator on indexed column? */
-    bool        amcanorderbyop;
-    /* does AM support backward scanning? */
-    bool        amcanbackward;
-    /* does AM support UNIQUE indexes? */
-    bool        amcanunique;
-    /* does AM support multi-column indexes? */
-    bool        amcanmulticol;
-    /* does AM require scans to have a constraint on the first index column? */
-    bool        amoptionalkey;
-    /* does AM handle ScalarArrayOpExpr quals? */
-    bool        amsearcharray;
-    /* does AM handle IS NULL/IS NOT NULL quals? */
-    bool        amsearchnulls;
-    /* can index storage data type differ from column data type? */
-    bool        amstorage;
-    /* can an index of this type be clustered on? */
-    bool        amclusterable;
-    /* does AM handle predicate locks? */
-    bool        ampredlocks;
-    /* type of data stored in index, or InvalidOid if variable */
-    Oid         amkeytype;
-
-    /* interface functions */
-    ambuild_function ambuild;
-    ambuildempty_function ambuildempty;
-    aminsert_function aminsert;
-    ambulkdelete_function ambulkdelete;
-    amvacuumcleanup_function amvacuumcleanup;
-    amcanreturn_function amcanreturn;   /* can be NULL */
-    amcostestimate_function amcostestimate;
-    amoptions_function amoptions;
-    amproperty_function amproperty;     /* can be NULL */
-    amvalidate_function amvalidate;
-    ambeginscan_function ambeginscan;
-    amrescan_function amrescan;
-    amgettuple_function amgettuple;     /* can be NULL */
-    amgetbitmap_function amgetbitmap;   /* can be NULL */
-    amendscan_function amendscan;
-    ammarkpos_function ammarkpos;       /* can be NULL */
-    amrestrpos_function amrestrpos;     /* can be NULL */
-} IndexAmRoutine;
+  typedef struct IndexAmRoutine
+  {
+  NodeTag     type;
+
+  /*
+  * Total number of strategies (operators) by which we can traverse/search
+  * this AM.  Zero if AM does not have a fixed set of strategy assignments.
+  */
+  uint16      amstrategies;
+  /* total number of support functions that this AM uses */
+  uint16      amsupport;
+  /* does AM support ORDER BY indexed column's value? */
+  bool        amcanorder;
+  /* does AM support ORDER BY result of an operator on indexed column? */
+  bool        amcanorderbyop;
+  /* does AM support backward scanning? */
+  bool        amcanbackward;
+  /* does AM support UNIQUE indexes? */
+  bool        amcanunique;
+  /* does AM support multi-column indexes? */
+  bool        amcanmulticol;
+  /* does AM require scans to have a constraint on the first index column? */
+  bool        amoptionalkey;
+  /* does AM handle ScalarArrayOpExpr quals? */
+  bool        amsearcharray;
+  /* does AM handle IS NULL/IS NOT NULL quals? */
+  bool        amsearchnulls;
+  /* can index storage data type differ from column data type? */
+  bool        amstorage;
+  /* can an index of this type be clustered on? */
+  bool        amclusterable;
+  /* does AM handle predicate locks? */
+  bool        ampredlocks;
+  /* type of data stored in index, or InvalidOid if variable */
+  Oid         amkeytype;
+
+  /* interface functions */
+  ambuild_function ambuild;
+  ambuildempty_function ambuildempty;
+  aminsert_function aminsert;
+  ambulkdelete_function ambulkdelete;
+  amvacuumcleanup_function amvacuumcleanup;
+  amcanreturn_function amcanreturn;   /* can be NULL */
+  amcostestimate_function amcostestimate;
+  amoptions_function amoptions;
+  amproperty_function amproperty;     /* can be NULL */
+  amvalidate_function amvalidate;
+  amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+  aminitparallelscan_function aminitparallelscan;    /* can be NULL */
+  ambeginscan_function ambeginscan;
+  amrescan_function amrescan;
+  amparallelrescan_function amparallelrescan;    /* can be NULL */
+  amgettuple_function amgettuple;     /* can be NULL */
+  amgetbitmap_function amgetbitmap;   /* can be NULL */
+  amendscan_function amendscan;
+  ammarkpos_function ammarkpos;       /* can be NULL */
+  amrestrpos_function amrestrpos;     /* can be NULL */
+  } IndexAmRoutine;
 </programlisting>
   </para>
 
@@ -459,6 +462,36 @@ amvalidate (Oid opclassoid);
    invalid.  Problems should be reported with <function>ereport</> messages.
   </para>
 
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (Size index_size);
+</programlisting>
+    Estimate and return the storage space required for parallel index scan.
+    The <literal>index_size</> parameter indicate the size of generic parallel
+    index scan information.  The size of index-type-specific parallel information
+    will be added to <literal>index_size</> and result will be returned back to
+    caller.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and <function>amestimateparallelscan</>
+     functions need only be provided if the access method supports <quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its <structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>
 
   <para>
    The purpose of an index, of course, is to support scans for tuples matching
@@ -511,6 +544,23 @@ amrescan (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   Restart the parallel index scan.  It resets the parallel index scan state.
+   It must be called only during restart of scan which will be typically
+   required for the inner side of nest-loop join.
+  </para>
+
+  <para>
+   The <function>amparallelrescan</> function need only be provided if the
+   access method supports <quote>parallel</> index scans.  If it doesn't,
+   the <structfield>amparallelrescan</> field in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.
+  </para>
+
+  <para>
+<programlisting>
 boolean
 amgettuple (IndexScanDesc scan,
             ScanDirection direction);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1545f03..42ef91f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1199,7 +1199,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1232,6 +1232,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelIndexScan</></entry>
+         <entry>Waiting for next block to be available for parallel index scan.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1b45a4c..5ff6a0f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -104,8 +104,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f07eedc..6f7024e 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b8aa9bc..5727ef9 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,8 +81,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 1fa087a..9e5740b 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -78,8 +78,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 65c941d..84ebd72 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -375,7 +375,7 @@ systable_beginscan(Relation heapRelation,
 		}
 
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -577,7 +577,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	}
 
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 54b71cb..72e1b03 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -207,6 +207,70 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate storage for ParallelIndexScanDesc
+ *
+ * It calls am specific routine to obtain size of am specific shared
+ * information.
+ */
+Size
+index_parallelscan_estimate(Relation indexrel, Snapshot snapshot)
+{
+	Size		index_size = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+									  EstimateSnapshotSpace(snapshot));
+
+	/* amestimateparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->amestimateparallelscan == NULL)
+		return index_size;
+	else
+		return indexrel->rd_amroutine->amestimateparallelscan(index_size);
+}
+
+/*
+ * index_parallelscan_initialize - initialize ParallelIndexScanDesc
+ *
+ * It calls access method specific initialization routine to initialize am
+ * specific information.  Call this just once in the leader process; then,
+ * individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+								  EstimateSnapshotSpace(snapshot));
+	void	   *amtarget = (char *) ((void *) target) + offset;
+
+	target->ps_relid = RelationGetRelid(heaprel);
+	target->ps_indexid = RelationGetRelid(indexrel);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->aminitparallelscan == NULL)
+		return;
+
+	indexrel->rd_amroutine->aminitparallelscan(amtarget);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan(heaprel, indexrel, snapshot, pscan, nkeys, norderbys, true);
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -214,8 +278,8 @@ index_insert(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
-				Snapshot snapshot,
-				int nkeys, int norderbys)
+				Snapshot snapshot, ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap)
 {
 	IndexScanDesc scan;
 
@@ -227,6 +291,8 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
 
 	return scan;
 }
@@ -251,6 +317,8 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = NULL;
+	scan->xs_temp_snap = false;
 
 	return scan;
 }
@@ -319,6 +387,22 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan == NULL)
+		return;
+
+	scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +425,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a264b92..e749a63 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +64,25 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber ps_nextPage;	/* latest or next page to be scanned */
+	uint8		ps_pageStatus;	/* indicates whether next page is available
+								 * for scan. see nbtree.h for possible states
+								 * of parallel scan. */
+	int			ps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		ps_mutex;		/* protects above variables */
+	ConditionVariable cv;		/* used to synchronize parallel scan */
+}	BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -110,8 +131,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = btparallelrescan;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
@@ -467,6 +491,175 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(Size index_size)
+{
+	return add_size(index_size, sizeof(BTParallelScanDescData));
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->ps_mutex);
+	bt_target->ps_nextPage = InvalidBlockNumber;
+	bt_target->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->ps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->cv);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ *	status - True indicates that the block number returned is valid and scan
+ *			 is continued or block number is invalid and scan has just begun
+ *			 or block number is P_NONE and scan is finished.  False indicates
+ *			 that we have reached the end of scan for current scankeys and for
+ *			 that we return block number as P_NONE.
+ *
+ *	The first time master backend or worker hits last page, it will return
 *	P_NONE and status as 'True', after that any worker tries to fetch next
+ *	page, it will return status as 'False'.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	uint8		pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		SpinLockAcquire(&btscan->ps_mutex);
+		pageStatus = btscan->ps_pageStatus;
+		if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->ps_nextPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->ps_mutex);
+		if (exit_loop || *status == false)
+			break;
+		ConditionVariableSleep(&btscan->cv, WAIT_EVENT_PARALLEL_INDEX_SCAN);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (*status == false)
+		return P_NONE;
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.
+ *
+ * For backward scan, next_page holds the latest page being scanned.
+ * For forward scan, next_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber next_page)
+{
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->ps_mutex);
+	btscan->ps_nextPage = next_page;
+	btscan->ps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->ps_mutex);
+	ConditionVariableSignal(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (scan->parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (so->arrayKeyCount >= btscan->ps_arrayKeyCount &&
+		btscan->ps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->ps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * It updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->ps_mutex);
+	if (btscan->ps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->ps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->ps_mutex);
+}
+
+/*
  *	btrescan() -- rescan an index relation
  */
 void
@@ -486,6 +679,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -526,6 +720,31 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 }
 
 /*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	if (scan->parallel_scan)
+	{
+		BTParallelScanDesc btscan;
+
+		btscan = (BTParallelScanDesc) OffsetToPointer(
+												(void *) scan->parallel_scan,
+											 scan->parallel_scan->ps_offset);
+
+		/*
+		 * Ideally, we don't need to acquire spinlock here, but being
+		 * consistent with heap_rescan seems to be a good idea.
+		 */
+		SpinLockAcquire(&btscan->ps_mutex);
+		btscan->ps_nextPage = InvalidBlockNumber;
+		btscan->ps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		SpinLockRelease(&btscan->ps_mutex);
+	}
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 void
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index ee46023..d9e0a5f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (status == false)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1370,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (status == false)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1398,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (status == false)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1467,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1481,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1539,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1556,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1577,48 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (status == false)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1896,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1925,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 063c988..15ea6df 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index d570ae5..4f39e93 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -60,8 +60,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 2131226..02ca783 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -903,7 +903,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	if (OldIndex != NULL && !use_sort)
 	{
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 009c1b7..f8e662b 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -722,7 +722,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, index_natts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, index_natts, 0, false);
 	index_rescan(index_scan, scankeys, index_natts, NULL, 0);
 
 	while ((tup = index_getnext(index_scan,
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a364098..43c610c 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -26,6 +26,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
+#include "access/relscan.h"
 
 
 /* ----------------------------------------------------------------
@@ -48,6 +49,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	 * extract necessary information from index scan node
 	 */
 	scandesc = node->biss_ScanDesc;
+	scandesc->parallel_scan = NULL;
 
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..45566bd 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -540,9 +540,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	 */
 	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
 												indexstate->ioss_RelationDesc,
-												estate->es_snapshot,
+												estate->es_snapshot, NULL,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+									 indexstate->ioss_NumOrderByKeys, false);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..77d2990 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1022,9 +1022,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	 */
 	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
 											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
+											   estate->es_snapshot, NULL,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+									  indexstate->iss_NumOrderByKeys, false);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61e6a2c..1bbeec6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3386,6 +3386,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_PARALLEL_INDEX_SCAN:
+			event_name = "ParallelIndexScan";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 4973396..9385325 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5143,8 +5143,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 				 * extreme value has been deleted; that case motivates not
 				 * using SnapshotAny here.
 				 */
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in sortop's direction */
@@ -5175,8 +5175,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			/* If max is requested, and we didn't find the index is empty */
 			if (max && have_data)
 			{
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in reverse direction */
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 1036cca..e777678 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (Size index_size);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 81907d5..4410ea3 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,16 +133,23 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys);
+				ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index c580f51..1c8aa8a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,25 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * Below flags are used to indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04
+/*
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern Datum bthandler(PG_FUNCTION_ARGS);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -662,10 +682,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(Size size);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber next_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index de98dd6..ca843f9 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -21,6 +21,9 @@
 #include "access/tupdesc.h"
 #include "storage/spin.h"
 
+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))
+
 /*
  * Shared state for parallel heap scan.
  *
@@ -126,8 +129,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
+	ParallelIndexScanDesc parallel_scan;		/* parallel index scan
+												 * information */
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 282f8ae..f038f64 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -784,6 +784,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_PARALLEL_INDEX_SCAN,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..0192810 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,7 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
@@ -1264,6 +1265,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

parallel_index_opt_exec_support_v3.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v3.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 8d4f9f7..33ea60b 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@
   bool        amclusterable;
   /* does AM handle predicate locks? */
   bool        ampredlocks;
+  /* does AM support parallel scan? */
+  bool        amcanparallel;
   /* type of data stored in index, or InvalidOid if variable */
   Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 5ff6a0f..a13b5e7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6f7024e..38dd077 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5727ef9..d300545 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 9e5740b..ca26c29 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e749a63..c7002d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -19,6 +19,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -119,6 +120,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4f39e93..98c5a4e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -48,6 +48,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9b29f09 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -60,6 +60,7 @@
 
 static bool TargetListSupportsBackwardScan(List *targetlist);
 static bool IndexSupportsBackwardScan(Oid indexid);
+static bool GatherSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -485,7 +486,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Gather:
-			return false;
+			return GatherSupportsBackwardScan(node);
 
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
@@ -566,6 +567,25 @@ IndexSupportsBackwardScan(Oid indexid)
 }
 
 /*
+ * GatherSupportsBackwardScan - does a gather plan supports backward scan?
+ *
+ * Returns true if the outer plan node of gather supports backward scan.
+ * As of now, we can support backward scan, iff outer node of gather has
+ * index node.
+ */
+bool
+GatherSupportsBackwardScan(Plan *node)
+{
+	Plan	   *outer_node = outerPlan(node);
+
+	if (nodeTag(outer_node) == T_IndexScan)
+		return IndexSupportsBackwardScan(((IndexScan *) outer_node)->indexid) &&
+			TargetListSupportsBackwardScan(outer_node->targetlist);
+	else
+		return false;
+}
+
+/*
  * ExecMaterializesOutput - does a plan type materialize its output?
  *
  * Returns true if the plan node type is one that automatically materializes
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index cc42946..972be82 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -195,6 +196,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -247,6 +252,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -724,6 +733,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 77d2990..a7d4c25 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -515,6 +518,15 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * if we are here to just update the scan keys, then don't reset parallel
+	 * scan
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -540,10 +552,16 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/* reset (parallel) index scan */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1018,22 +1036,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot, NULL,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot, NULL,
+												 indexstate->iss_NumScanKeys,
 									  indexstate->iss_NumOrderByKeys, false);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1595,3 +1620,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 415edad..414645e 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -418,6 +418,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -620,11 +621,33 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = path->path.parallel_workers;
+		double		leader_contribution;
+
+		/*
+		 * We divide only the cpu cost among workers and the division uses the
+		 * same formula as we use for seq scan.  See cost_seqscan.
+		 */
+		leader_contribution = 1.0 - (0.3 * path->path.parallel_workers);
+		if (leader_contribution > 0)
+			parallel_divisor += leader_contribution;
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 2952bfb..54a8a2f 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include <limits.h>
 #include <math.h>
 
 #include "access/stratnum.h"
@@ -108,6 +109,8 @@ static bool bms_equal_any(Relids relids, List *relids_list);
 static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				IndexOptInfo *index, IndexClauseSet *clauses,
 				List **bitindexpaths);
+static int index_path_get_workers(PlannerInfo *root, RelOptInfo *rel,
+					   IndexOptInfo *index);
 static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
@@ -811,6 +814,51 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 }
 
 /*
+ * index_path_get_workers
+ *	  Build partial access paths for parallel scan of a index relation
+ */
+static int
+index_path_get_workers(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index)
+{
+	int			parallel_workers;
+	int			parallel_threshold;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child. In that
+	 * case, we want to generate a parallel path here anyway.  It might not be
+	 * worthwhile just for this relation, but when combined with all of its
+	 * inheritance siblings it may well pay off.
+	 */
+	if (index->pages < (BlockNumber) min_parallel_relation_size &&
+		rel->reloptkind == RELOPT_BASEREL)
+		return 0;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated,
+	 * but we need something here for now.  Note that the upper limit of the
+	 * min_parallel_relation_size GUC is chosen to prevent overflow here.
+	 */
+	parallel_workers = 1;
+	parallel_threshold = Max(min_parallel_relation_size, 1);
+	while (index->pages >= (BlockNumber) (parallel_threshold * 3))
+	{
+		parallel_workers++;
+		parallel_threshold *= 3;
+		if (parallel_threshold > INT_MAX / 3)
+			break;				/* avoid overflow */
+	}
+
+	/*
+	 * In no case use more than max_parallel_workers_per_gather workers.
+	 */
+	parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather);
+
+	return parallel_workers;
+}
+
+/*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
  *	  or more IndexPaths.
@@ -1042,8 +1090,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = index_path_get_workers(root, rel, index);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1149,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = index_path_get_workers(root, rel, index);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 41dde50..ea18fc1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5300,7 +5300,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 6d3ccfd..e08c330 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 72272d9..342ceba 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e777678..baca400 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..e03d591 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d43ec56..7714a72 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1353,6 +1353,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1379,6 +1380,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 3a1255a..7e4b475 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -622,6 +622,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 71d9154..206fe0a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,

#26

Amit Kapila

amit.kapila16@gmail.com

about 9 years ago

In reply to: Rahila Syed (#23)

Re: Parallel Index Scans

On Fri, Dec 23, 2016 at 5:48 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

5. Comment for _bt_parallel_seize() says:
"False indicates that we have reached the end of scan for
current scankeys and for that we return block number as P_NONE."

What is the reason to check (blkno == P_NONE) after checking (status ==
false)
in _bt_first() (see code below)? If comment is correct
we'll never reach _bt_parallel_done()
+               blkno = _bt_parallel_seize(scan, &status);
+               if (status == false)
+               {
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
+               else if (blkno == P_NONE)
+               {
+                       _bt_parallel_done(scan);
+                       BTScanPosInvalidate(so->currPos);
+                       return false;
+               }
The first time master backend or worker hits last page (calls this
API), it will return P_NONE and after that any worker tries to fetch
next page, it will return status as false. I think we can expand a
comment to explain it clearly. Let me know, if you need more
clarification, I can explain it in detail.
Probably this was confusing because we have not mentioned
that P_NONE can be returned even when status = TRUE and
not just when status is false.

I think, the comment above the function can be modified as follows,
+ /*
+ * True indicates that the block number returned is either valid including
P_NONE
+ * and scan is continued or block number is invalid and scan has just
+ * begun.

I think the modification (including P_NONE and scan is continued)
suggested by you can confuse the reader, because if the returned block
number is P_NONE, then we don't continue the scan. I have used
slightly different words in the patch I have just posted, please check
and see if that looks fine to you.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

almost 9 years ago

In reply to: Amit Kapila (#25)

Re: Parallel Index Scans

27.12.2016 17:33, Amit Kapila:

On Fri, Dec 23, 2016 at 6:42 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

22.12.2016 07:19, Amit Kapila:

On Wed, Dec 21, 2016 at 8:46 PM, Anastasia Lubennikova
<lubennikovaav@gmail.com> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hi, thank you for the patch.
Results are very promising. Do you see any drawbacks of this feature or
something that requires more testing?

I think you can focus on the handling of array scan keys for testing.
In general, one of my colleagues has shown interest in testing this
patch and I think he has tested as well but never posted his findings.
I will request him to share his findings and what kind of tests he has
done, if any.

Check please code related to buffer locking and pinning once again.
I got the warning. Here are the steps to reproduce it:
Except "autovacuum = off" config is default.

pgbench -i -s 100 test
pgbench -c 10 -T 120 test

SELECT count(aid) FROM pgbench_accounts
WHERE aid > 1000 AND aid < 900000 AND bid > 800 AND bid < 900;
WARNING: buffer refcount leak: [8297] (rel=base/12289/16459, blockNum=2469,
flags=0x93800000, refcount=1 1)
count

The similar problem has occurred while testing "parallel index only
scan" patch and Rafia has included the fix in her patch [1] which
ideally should be included in this patch, so I have copied the fix
from her patch. Apart from that, I observed that similar problem can
happen for backward scans, so fixed the same as well.

I confirm that this problem is solved.

But I'm trying to find the worst cases for this feature. And I suppose we
should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the
concurrency.
I doubt that it can significantly decrease performance, because number of
parallel readers is not that big,

I am not sure if such a test is meaningful for this patch because
parallelism is generally used for large data reads and in such cases
there are generally not many concurrent writes.

I didn't find any case of noticeable performancedegradation,
so set status to "Ready for committer".
Thank you for this patch.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#28

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Anastasia Lubennikova (#27)

Re: Parallel Index Scans

On Fri, Jan 13, 2017 at 9:28 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I didn't find any case of noticeable performance degradation,
so set status to "Ready for committer".

The very first hunk of doc changes looks like it makes the whitespace
totally wrong - surely it can't be right to have 0-space indents in C
code.

+ The <literal>index_size</> parameter indicate the size of generic parallel

indicate -> indicates
size of generic -> size of the generic

+ index-type-specific parallel information which will be stored immediatedly

Typo.

+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and
<function>amestimateparallelscan</>
+     functions need only be provided if the access method supports
<quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its
<structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>

Inconsistent indentation. <quote> seems like a strange choice of tag.

+ /* amestimateparallelscan is optional; assume no-op if not
provided by AM */

The fact that amestimateparallelscan is optional even when parallel
index scans are supported is undocumented. Similarly for the other
functions, which also seem to be optional but not documented as such.
The code and documentation need to match.

+ void *amtarget = (char *) ((void *) target) + offset;

Passing an unaligned pointer to the AM sounds like a recipe for
crashes on obscure platforms that can't tolerate alignment violations,
and possibly bad performance on others. I'd arrange MAXALIGN the size
of the generic structure in index_parallelscan_estimate and
index_parallelscan_initialize. Also, why pass the size of the generic
structure to the AM-specific estimate routine, anyway? It can't
legally return a smaller value, and we can add_size() just as well as
the AM-specific code. Wouldn't it make more sense for the AM-specific
code to return the amount of space that is needed for AM-specific
stuff, and let the generic code deal with the generic stuff?

+ *    status - True indicates that the block number returned is valid and scan
+ *             is continued or block number is invalid and scan has just begun
+ *             or block number is P_NONE and scan is finished.  False indicates
+ *             that we have reached the end of scan for current
scankeys and for
+ *             that we return block number as P_NONE.

It is hard to parse a sentence with that many "and" and "or" clauses,
especially since English, unlike C, does not have strict operator
precedence rules. Perhaps you could reword to make it more clear.
Also, does that survive pgindent with that indentation?

+    BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+                                                (void *) scan->parallel_scan,
+                                             scan->parallel_scan->ps_offset);

You could avoid these uncomfortable line breaks by declaring the
variable on one line and the initializing it on a separate line.

+        SpinLockAcquire(&btscan->ps_mutex);
+        pageStatus = btscan->ps_pageStatus;
+        if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+            *status = false;
+        else if (pageStatus == BTPARALLEL_DONE)
+            *status = false;
+        else if (pageStatus != BTPARALLEL_ADVANCING)
+        {
+            btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+            nextPage = btscan->ps_nextPage;
+            exit_loop = true;
+        }
+        SpinLockRelease(&btscan->ps_mutex);

IMHO, this needs comments.

+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.

This construction is commonly used in India but sounds awkward to
other English-speakers, or at least to me. You can either drop the
word "it" and just start with the verb "Updates the value of ..." or
you can replace the first instance of "It" with "This function".
Although actually, I think this whole comment needs rewriting. Maybe
something like "Save information about scan position and wake up next
worker to continue scan."

+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.

Suggest: When there are no pages left to scan, this function should be
called to notify other workers. Otherwise, they might wait forever
for the scan to advance to the next page.

+ if (status == false)

if (!status) is usually preferred for bools. (Multiple instances.)

+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04

Let's change this to an enum. We can keep the names of the members
as-is, just use typedef enum { ... } instead of #defines.

+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))

Blech. Aside from the bad formatting, this is an awfully generic
thing to stick into relscan.h. I'm not sure we should have it at all,
but certainly not in this file.

+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+    BlockNumber ps_nextPage;    /* latest or next page to be scanned */
+    uint8        ps_pageStatus;    /* indicates whether next page is available
+                                 * for scan. see nbtree.h for possible states
+                                 * of parallel scan. */
+    int            ps_arrayKeyCount;        /* count indicating number of array
+                                         * scan keys processed by parallel
+                                         * scan */
+    slock_t        ps_mutex;        /* protects above variables */
+    ConditionVariable cv;        /* used to synchronize parallel scan */
+}    BTParallelScanDescData;

Why are the states declared a separate header file from the variable
that uses them? Let's put them all in the same place.

Why do all of these fields except for the last one have a ps_ prefix,
but the last one doesn't?

I assume "ps" stands for "parallel scan" but maybe "btps" would be
better since this is btree-specific.

ps_nextPage sometimes contains something other than the next page, so
maybe we should choose a different name, like ps_currentPage or
ps_scanPage.

This is not a totally complete review - there are some things I have
deeper questions about and need to examine more closely - but let's
get the simple stuff tidied up first.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Anastasia Lubennikova (#27)

Re: Parallel Index Scans

On Fri, Jan 13, 2017 at 7:58 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

27.12.2016 17:33, Amit Kapila:

The similar problem has occurred while testing "parallel index only
scan" patch and Rafia has included the fix in her patch [1] which
ideally should be included in this patch, so I have copied the fix
from her patch. Apart from that, I observed that similar problem can
happen for backward scans, so fixed the same as well.

I confirm that this problem is solved.

But I'm trying to find the worst cases for this feature. And I suppose we
should test parallel index scans with
concurrent insertions. More parallel readers we have, higher the
concurrency.
I doubt that it can significantly decrease performance, because number of
parallel readers is not that big,

I am not sure if such a test is meaningful for this patch because
parallelism is generally used for large data reads and in such cases
there are generally not many concurrent writes.

I didn't find any case of noticeable performance degradation,
so set status to "Ready for committer".

Thank you for the review!

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#28)

2 attachment(s)

Re: Parallel Index Scans

On Fri, Jan 13, 2017 at 11:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 13, 2017 at 9:28 AM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

I didn't find any case of noticeable performance degradation,
so set status to "Ready for committer".

The very first hunk of doc changes looks like it makes the whitespace
totally wrong - surely it can't be right to have 0-space indents in C
code.

Fixed.

+ The <literal>index_size</> parameter indicate the size of generic parallel

indicate -> indicates
size of generic -> size of the generic

Fixed.

+ index-type-specific parallel information which will be stored immediatedly

Typo.

Fixed.

+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediatedly
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+   <para>
+     The <function>aminitparallelscan</> and
<function>amestimateparallelscan</>
+     functions need only be provided if the access method supports
<quote>parallel</>
+     index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+     <structfield>amestimateparallelscan</> fields in its
<structname>IndexAmRoutine</>
+     struct must be set to NULL.
+   </para>

Inconsistent indentation.

Fixed.

<quote> seems like a strange choice of tag.

I have seen that <quote> is used in indexam.sgml at multiple places to
refer to "bitmap" and "plain" index scans. So thought of using same
for "parallel" index scans.

+ /* amestimateparallelscan is optional; assume no-op if not
provided by AM */

The fact that amestimateparallelscan is optional even when parallel
index scans are supported is undocumented.

Okay, I have added that information in docs.

Similarly for the other
functions, which also seem to be optional but not documented as such.
The code and documentation need to match.

All the functions introduced by this patch are documented in
indexam.sgml as optional. Not sure, which other place you are
expecting an update.

+ void *amtarget = (char *) ((void *) target) + offset;

Passing an unaligned pointer to the AM sounds like a recipe for
crashes on obscure platforms that can't tolerate alignment violations,
and possibly bad performance on others. I'd arrange MAXALIGN the size
of the generic structure in index_parallelscan_estimate and
index_parallelscan_initialize.

Right, changed as per suggestion.

Also, why pass the size of the generic
structure to the AM-specific estimate routine, anyway? It can't
legally return a smaller value, and we can add_size() just as well as
the AM-specific code. Wouldn't it make more sense for the AM-specific
code to return the amount of space that is needed for AM-specific
stuff, and let the generic code deal with the generic stuff?

makes sense, so changed accordingly.

+ *    status - True indicates that the block number returned is valid and scan
+ *             is continued or block number is invalid and scan has just begun
+ *             or block number is P_NONE and scan is finished.  False indicates
+ *             that we have reached the end of scan for current
scankeys and for
+ *             that we return block number as P_NONE.
It is hard to parse a sentence with that many "and" and "or" clauses,
especially since English, unlike C, does not have strict operator
precedence rules. Perhaps you could reword to make it more clear.

Okay, I have changed the comment.

Also, does that survive pgindent with that indentation?

Yes.

+    BTParallelScanDesc btscan = (BTParallelScanDesc) OffsetToPointer(
+                                                (void *) scan->parallel_scan,
+                                             scan->parallel_scan->ps_offset);
You could avoid these uncomfortable line breaks by declaring the
variable on one line and the initializing it on a separate line.

Okay, changed.

+        SpinLockAcquire(&btscan->ps_mutex);
+        pageStatus = btscan->ps_pageStatus;
+        if (so->arrayKeyCount < btscan->ps_arrayKeyCount)
+            *status = false;
+        else if (pageStatus == BTPARALLEL_DONE)
+            *status = false;
+        else if (pageStatus != BTPARALLEL_ADVANCING)
+        {
+            btscan->ps_pageStatus = BTPARALLEL_ADVANCING;
+            nextPage = btscan->ps_nextPage;
+            exit_loop = true;
+        }
+        SpinLockRelease(&btscan->ps_mutex);

IMHO, this needs comments.

Sure, added a comment.

+ * It updates the value of next page that allows parallel scan to move forward
+ * or backward depending on scan direction.  It wakes up one of the sleeping
+ * workers.
This construction is commonly used in India but sounds awkward to
other English-speakers, or at least to me. You can either drop the
word "it" and just start with the verb "Updates the value of ..." or
you can replace the first instance of "It" with "This function".
Although actually, I think this whole comment needs rewriting. Maybe
something like "Save information about scan position and wake up next
worker to continue scan."

Changed as per suggestion.

+ * This must be called when there are no pages left to scan. Notify end of
+ * parallel scan to all the other workers associated with this scan.
Suggest: When there are no pages left to scan, this function should be
called to notify other workers. Otherwise, they might wait forever
for the scan to advance to the next page.

+ if (status == false)

if (!status) is usually preferred for bools. (Multiple instances.)
+#define BTPARALLEL_NOT_INITIALIZED 0x01
+#define BTPARALLEL_ADVANCING 0x02
+#define BTPARALLEL_DONE 0x03
+#define BTPARALLEL_IDLE 0x04
Let's change this to an enum. We can keep the names of the members
as-is, just use typedef enum { ... } instead of #defines.

Changed as per suggestion.

+#define OffsetToPointer(base, offset)\
+((void *)((char *)base + offset))

Blech. Aside from the bad formatting, this is an awfully generic
thing to stick into relscan.h.

Agreed and moved to c.h where some similar defines are present.

I'm not sure we should have it at all,
but certainly not in this file.

Yeah, but I think there is no harm in keeping it and maybe start using
in code at other places as well.

+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+    BlockNumber ps_nextPage;    /* latest or next page to be scanned */
+    uint8        ps_pageStatus;    /* indicates whether next page is available
+                                 * for scan. see nbtree.h for possible states
+                                 * of parallel scan. */
+    int            ps_arrayKeyCount;        /* count indicating number of array
+                                         * scan keys processed by parallel
+                                         * scan */
+    slock_t        ps_mutex;        /* protects above variables */
+    ConditionVariable cv;        /* used to synchronize parallel scan */
+}    BTParallelScanDescData;

Why are the states declared a separate header file from the variable
that uses them? Let's put them all in the same place.

Agreed and changed accordingly.

Why do all of these fields except for the last one have a ps_ prefix,
but the last one doesn't?

No specific reason, so Changed as per suggestion.

I assume "ps" stands for "parallel scan" but maybe "btps" would be
better since this is btree-specific.

Changed as per suggestion.

ps_nextPage sometimes contains something other than the next page, so
maybe we should choose a different name, like ps_currentPage or
ps_scanPage.

Changed as per suggestion.

I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v4.patchapplication/octet-stream; name=parallel_index_scan_v4.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..c8957ce 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -124,8 +124,11 @@ typedef struct IndexAmRoutine
     amoptions_function amoptions;
     amproperty_function amproperty;     /* can be NULL */
     amvalidate_function amvalidate;
+    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
+    amparallelrescan_function amparallelrescan;    /* can be NULL */
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
@@ -459,6 +462,35 @@ amvalidate (Oid opclassoid);
    invalid.  Problems should be reported with <function>ereport</> messages.
   </para>
 
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (void);
+</programlisting>
+    Estimate and return the storage space required for parallel index scan.
+    The size of index-type-specific parallel information will be returned back
+    to caller.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   Initialize the parallel index scan state.  It will be used to initialize
+   index-type-specific parallel information which will be stored immediately
+   after generic parallel information required for parallel index scans.  The
+   required state information will be set in <literal>target</>.
+  </para>
+
+  <para>
+   The <function>aminitparallelscan</> and <function>amestimateparallelscan</>
+   functions need only be provided if the access method supports <quote>parallel</>
+   index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+   <structfield>amestimateparallelscan</> fields in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.  Note that these fields can be NULL even for
+   <quote>parallel</> index scans.
+  </para>
 
   <para>
    The purpose of an index, of course, is to support scans for tuples matching
@@ -511,6 +543,24 @@ amrescan (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   Restart the parallel index scan.  It resets the parallel index scan state.
+   It must be called only during restart of scan which will be typically
+   required for the inner side of nest-loop join.
+  </para>
+
+  <para>
+   The <function>amparallelrescan</> function need only be provided if the
+   access method supports <quote>parallel</> index scans.  If it doesn't,
+   the <structfield>amparallelrescan</> field in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.  Note that this field can be NULL even for
+   <quote>parallel</> index scans.
+  </para>
+
+  <para>
+<programlisting>
 boolean
 amgettuple (IndexScanDesc scan,
             ScanDirection direction);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1545f03..42ef91f 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1199,7 +1199,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1232,6 +1232,10 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelIndexScan</></entry>
+         <entry>Waiting for next block to be available for parallel index scan.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3fce672..2ceb85a 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -104,8 +104,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3909638..a80039f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 4092a8b..604bec5 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -81,8 +81,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0cbf6b0..74074c6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -78,8 +78,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index c4a393f..2920cc1 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -375,7 +375,7 @@ systable_beginscan(Relation heapRelation,
 		}
 
 		sysscan->iscan = index_beginscan(heapRelation, irel,
-										 snapshot, nkeys, 0);
+										 snapshot, NULL, nkeys, 0, false);
 		index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 		sysscan->scan = NULL;
 	}
@@ -577,7 +577,7 @@ systable_beginscan_ordered(Relation heapRelation,
 	}
 
 	sysscan->iscan = index_beginscan(heapRelation, indexRelation,
-									 snapshot, nkeys, 0);
+									 snapshot, NULL, nkeys, 0, false);
 	index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
 	sysscan->scan = NULL;
 
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..fc21605 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -207,6 +207,80 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate storage for ParallelIndexScanDesc
+ *
+ * It calls am specific routine to obtain size of am specific shared
+ * information.
+ */
+Size
+index_parallelscan_estimate(Relation indexrel, Snapshot snapshot)
+{
+	Size		index_size = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+									  EstimateSnapshotSpace(snapshot));
+
+	index_size = MAXALIGN(index_size);
+
+	/* amestimateparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->amestimateparallelscan == NULL)
+		return index_size;
+	else
+	{
+		Size		amindex_size = indexrel->rd_amroutine->amestimateparallelscan();
+
+		return add_size(index_size, amindex_size);
+	}
+}
+
+/*
+ * index_parallelscan_initialize - initialize ParallelIndexScanDesc
+ *
+ * It calls access method specific initialization routine to initialize am
+ * specific information.  Call this just once in the leader process; then,
+ * individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							  Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+								  EstimateSnapshotSpace(snapshot));
+	void	   *amtarget;
+
+	offset = MAXALIGN(offset);
+	amtarget = (char *) ((void *) target) + offset;
+
+	target->ps_relid = RelationGetRelid(heaprel);
+	target->ps_indexid = RelationGetRelid(indexrel);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->aminitparallelscan == NULL)
+		return;
+
+	indexrel->rd_amroutine->aminitparallelscan(amtarget);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan(heaprel, indexrel, snapshot, pscan, nkeys, norderbys, true);
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -214,8 +288,8 @@ index_insert(Relation indexRelation,
 IndexScanDesc
 index_beginscan(Relation heapRelation,
 				Relation indexRelation,
-				Snapshot snapshot,
-				int nkeys, int norderbys)
+				Snapshot snapshot, ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap)
 {
 	IndexScanDesc scan;
 
@@ -227,6 +301,8 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
 
 	return scan;
 }
@@ -251,6 +327,8 @@ index_beginscan_bitmap(Relation indexRelation,
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+	scan->parallel_scan = NULL;
+	scan->xs_temp_snap = false;
 
 	return scan;
 }
@@ -319,6 +397,22 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan == NULL)
+		return;
+
+	scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +435,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c6eed63..5f5e126 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -62,6 +64,46 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * Below flags are used to indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_DONE,
+	BTPARALLEL_IDLE
+} PS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	PS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -110,8 +152,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = btparallelrescan;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
@@ -467,6 +512,192 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ * status - The value of status tells caller whether to continue the scan or
+ * not.  The true value of status indicates either one of the following (a)
+ * the block number returned is valid and the scan can be continued (b) the
+ * block number is invalid and the scan has just begun (c) the block number
+ * is P_NONE and the scan is finished.  The false value indicates that we
+ * have reached the end of scan for current scankeys and for that we return
+ * block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	PS_State	pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !*status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_PARALLEL_INDEX_SCAN);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!*status)
+		return P_NONE;
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * Save information about scan position and wake up next worker to continue
+ * scan.
+ *
+ * For backward scan, scan_page holds the latest page being scanned.
+ * For forward scan, scan_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * It updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  *	btrescan() -- rescan an index relation
  */
 void
@@ -486,6 +717,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -526,6 +758,33 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 }
 
 /*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	if (parallel_scan)
+	{
+
+		BTParallelScanDesc btscan;
+
+		btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												   parallel_scan->ps_offset);
+
+		/*
+		 * Ideally, we don't need to acquire spinlock here, but being
+		 * consistent with heap_rescan seems to be a good idea.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		SpinLockRelease(&btscan->btps_mutex);
+	}
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 void
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4fba75a..b565e09 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1370,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1398,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1467,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1481,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1539,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1556,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1577,48 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1896,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1925,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..692ced4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index b9e4940..6b60abf 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -60,8 +60,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index f9309fc..e7d2988 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -903,7 +903,7 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	if (OldIndex != NULL && !use_sort)
 	{
 		heapScan = NULL;
-		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+		indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, NULL, 0, 0, false);
 		index_rescan(indexScan, NULL, 0, NULL, 0);
 	}
 	else
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 8d119f6..1f17f35 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -722,7 +722,7 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
 retry:
 	conflict = false;
 	found_self = false;
-	index_scan = index_beginscan(heap, index, &DirtySnapshot, index_natts, 0);
+	index_scan = index_beginscan(heap, index, &DirtySnapshot, NULL, index_natts, 0, false);
 	index_rescan(index_scan, scankeys, index_natts, NULL, 0);
 
 	while ((tup = index_getnext(index_scan,
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 4274e9a..7ae3500 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -26,6 +26,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "utils/memutils.h"
+#include "access/relscan.h"
 
 
 /* ----------------------------------------------------------------
@@ -48,6 +49,7 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	 * extract necessary information from index scan node
 	 */
 	scandesc = node->biss_ScanDesc;
+	scandesc->parallel_scan = NULL;
 
 	/*
 	 * If we have runtime keys and they've not already been set up, do it now.
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index ddef3a4..55c27a8 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -540,9 +540,9 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	 */
 	indexstate->ioss_ScanDesc = index_beginscan(currentRelation,
 												indexstate->ioss_RelationDesc,
-												estate->es_snapshot,
+												estate->es_snapshot, NULL,
 												indexstate->ioss_NumScanKeys,
-											indexstate->ioss_NumOrderByKeys);
+									 indexstate->ioss_NumOrderByKeys, false);
 
 	/* Set it up for index-only scan */
 	indexstate->ioss_ScanDesc->xs_want_itup = true;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 97a6fac..c28a456 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -1022,9 +1022,9 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	 */
 	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
 											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
+											   estate->es_snapshot, NULL,
 											   indexstate->iss_NumScanKeys,
-											 indexstate->iss_NumOrderByKeys);
+									  indexstate->iss_NumOrderByKeys, false);
 
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f37a0bf..bc89a07 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3386,6 +3386,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_PARALLEL_INDEX_SCAN:
+			event_name = "ParallelIndexScan";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 301dffa..40657ea 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -5143,8 +5143,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 				 * extreme value has been deleted; that case motivates not
 				 * using SnapshotAny here.
 				 */
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in sortop's direction */
@@ -5175,8 +5175,8 @@ get_actual_variable_range(PlannerInfo *root, VariableStatData *vardata,
 			/* If max is requested, and we didn't find the index is empty */
 			if (max && have_data)
 			{
-				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty,
-											 1, 0);
+				index_scan = index_beginscan(heapRel, indexRel, &SnapshotDirty, NULL,
+											 1, 0, false);
 				index_rescan(index_scan, scankeys, 1, NULL, 0);
 
 				/* Fetch first tuple in reverse direction */
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 48b0cc0..d841170 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (void);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b2e078a..0f74a7f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,16 +133,23 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel, Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
-				int nkeys, int norderbys);
+				ParallelIndexScanDesc pscan,
+				int nkeys, int norderbys, bool temp_snap);
 extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 					   Snapshot snapshot,
 					   int nkeys);
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 181f3ac..a277d4c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern Datum bthandler(PG_FUNCTION_ARGS);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
@@ -662,10 +665,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..793c46b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,8 +126,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
+	ParallelIndexScanDesc parallel_scan;		/* parallel index scan
+												 * information */
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/c.h b/src/include/c.h
index efbb77f..a2c043a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -527,6 +527,9 @@ typedef NameData *Name;
 #define PointerIsAligned(pointer, type) \
 		(((uintptr_t)(pointer) % (sizeof (type))) == 0)
 
+#define OffsetToPointer(base, offset) \
+		((void *)((char *) base + offset))
+
 #define OidIsValid(objectId)  ((bool) ((objectId) != InvalidOid))
 
 #define RegProcedureIsValid(p)	OidIsValid(p)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b37894..de387d7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -784,6 +784,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_PARALLEL_INDEX_SCAN,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..9563fdd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,8 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
@@ -1264,6 +1266,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle
@@ -1433,6 +1437,7 @@ PSQL_COMP_CASE
 PSQL_ECHO
 PSQL_ECHO_HIDDEN
 PSQL_ERROR_ROLLBACK
+PS_State
 PTOKEN_GROUPS
 PTOKEN_USER
 PUTENVPROC

parallel_index_opt_exec_support_v4.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v4.patchDownload

diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index c8957ce..3afb496 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2ceb85a..82578bf 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -92,6 +92,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a80039f..f6898ef 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 604bec5..09b3a9d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -69,6 +69,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 74074c6..cfd2bb3 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -66,6 +66,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5f5e126..9f1d4a2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -19,6 +19,7 @@
 #include "postgres.h"
 
 #include "access/nbtree.h"
+#include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/xlog.h"
 #include "catalog/index.h"
@@ -140,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 6b60abf..4c3de2d 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -48,6 +48,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 3ea3697..3bd7913 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -60,6 +60,7 @@
 
 static bool TargetListSupportsBackwardScan(List *targetlist);
 static bool IndexSupportsBackwardScan(Oid indexid);
+static bool GatherSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -485,7 +486,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Gather:
-			return false;
+			return GatherSupportsBackwardScan(node);
 
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
@@ -566,6 +567,25 @@ IndexSupportsBackwardScan(Oid indexid)
 }
 
 /*
+ * GatherSupportsBackwardScan - does a gather plan supports backward scan?
+ *
+ * Returns true if the outer plan node of gather supports backward scan.
+ * As of now, we can support backward scan, iff outer node of gather has
+ * index node.
+ */
+bool
+GatherSupportsBackwardScan(Plan *node)
+{
+	Plan	   *outer_node = outerPlan(node);
+
+	if (nodeTag(outer_node) == T_IndexScan)
+		return IndexSupportsBackwardScan(((IndexScan *) outer_node)->indexid) &&
+			TargetListSupportsBackwardScan(outer_node->targetlist);
+	else
+		return false;
+}
+
+/*
  * ExecMaterializesOutput - does a plan type materialize its output?
  *
  * Returns true if the plan node type is one that automatically materializes
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e01fe6d..cb91cc3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index c28a456..b7399f4 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -515,6 +518,15 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * if we are here to just update the scan keys, then don't reset parallel
+	 * scan
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -540,10 +552,16 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/* reset (parallel) index scan */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1018,22 +1036,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot, NULL,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot, NULL,
+												 indexstate->iss_NumScanKeys,
 									  indexstate->iss_NumOrderByKeys, false);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1595,3 +1620,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 458f139..f4253d3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -400,6 +400,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -602,11 +603,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 0a5c050..6c81dba 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -15,6 +15,7 @@
  */
 #include "postgres.h"
 
+#include <limits.h>
 #include <math.h>
 
 #include "access/stratnum.h"
@@ -108,6 +109,8 @@ static bool bms_equal_any(Relids relids, List *relids_list);
 static void get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				IndexOptInfo *index, IndexClauseSet *clauses,
 				List **bitindexpaths);
+static int index_path_get_workers(PlannerInfo *root, RelOptInfo *rel,
+					   IndexOptInfo *index);
 static List *build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 				  IndexOptInfo *index, IndexClauseSet *clauses,
 				  bool useful_predicate,
@@ -811,6 +814,51 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 }
 
 /*
+ * index_path_get_workers
+ *	  Build partial access paths for parallel scan of a index relation
+ */
+static int
+index_path_get_workers(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index)
+{
+	int			parallel_workers;
+	int			parallel_threshold;
+
+	/*
+	 * If this relation is too small to be worth a parallel scan, just return
+	 * without doing anything ... unless it's an inheritance child. In that
+	 * case, we want to generate a parallel path here anyway.  It might not be
+	 * worthwhile just for this relation, but when combined with all of its
+	 * inheritance siblings it may well pay off.
+	 */
+	if (index->pages < (BlockNumber) min_parallel_relation_size &&
+		rel->reloptkind == RELOPT_BASEREL)
+		return 0;
+
+	/*
+	 * Select the number of workers based on the log of the size of the
+	 * relation.  This probably needs to be a good deal more sophisticated,
+	 * but we need something here for now.  Note that the upper limit of the
+	 * min_parallel_relation_size GUC is chosen to prevent overflow here.
+	 */
+	parallel_workers = 1;
+	parallel_threshold = Max(min_parallel_relation_size, 1);
+	while (index->pages >= (BlockNumber) (parallel_threshold * 3))
+	{
+		parallel_workers++;
+		parallel_threshold *= 3;
+		if (parallel_threshold > INT_MAX / 3)
+			break;				/* avoid overflow */
+	}
+
+	/*
+	 * In no case use more than max_parallel_workers_per_gather workers.
+	 */
+	parallel_workers = Min(parallel_workers, max_parallel_workers_per_gather);
+
+	return parallel_workers;
+}
+
+/*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
  *	  or more IndexPaths.
@@ -1042,8 +1090,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = index_path_get_workers(root, rel, index);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1149,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = index_path_get_workers(root, rel, index);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f936710..69e25dc 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5297,7 +5297,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 3b7c56d..08313b8 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index d841170..feb0c8b 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ce13bf7..58ec9d2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1352,6 +1352,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1378,6 +1379,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index e1d31c7..a70b358 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -622,6 +622,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index d16f879..5edac67 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..140fb3c 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..d7dfd28 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#31

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#30)

Re: Parallel Index Scans

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

Thanks for the update. Some more comments:

It shouldn't be necessary for MultiExecBitmapIndexScan to modify the
IndexScanDesc. That seems really broken. If a parallel scan isn't
supported here (and I'm sure that's the case right now) then no such
IndexScanDesc should be getting created.

WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific. There's no
guarantee that any other AMs the implement parallel index scans will
use that wait event, and they might have others instead. I would make
it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER. (Waiting
for the page number needed to continue a parallel btree scan to become
available.)

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case? If the AM-specific method is
getting the IndexScanDesc, it can see for itself whether it is a
parallel scan.

I'd rename PS_State to BTPS_State, to match the other renamings.

If we're going to update all of the AMs to set the new support
functions to NULL, we should also update contrib/bloom.

index_parallelscan_estimate has multiple lines that go over 80
characters for no really good reason. Separate the initialization of
index_scan from the variable declaration. Do the same for
amindex_size. Also, you don't really need to encase the end of the
function in an "else" block when the "if" block is guaranteed to
returrn.

Several function header comments still use the style where the first
word of the description is "It". Say "this function" or similar the
first time, instead of "it". Then when you say "it" later, it's clear
that it refers back to where you said "this function".

index_parallelscan_initialize also has a line more than 80 characters
that looks easy to fix by splitting the declaration from the
initialization.

I think it's a bad idea to add a ParallelIndexScanDesc argument to
index_beginscan(). That function is used in lots of places, and
somebody might think that they are allowed to actually pass a non-NULL
value there, which they aren't: they must go through
index_beginscan_parallel. I think that the additional argument should
only be added to index_beginscan_internal, and
index_beginscan_parallel should remain unchanged. Either that, or get
rid of index_beginscan_parallel altogether and have everyone use
index_beginscan directly, and put the snapshot-restore logic in that
function.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Haribabu Kommi

kommi.haribabu@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#30)

Re: Parallel Index Scans

On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Changed as per suggestion.

I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

Thanks for the patch. Here are comments found during review.

parallel_index_scan_v4.patch:

+ amtarget = (char *) ((void *) target) + offset;

The above calcuation can be moved after NULL check?

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?

+ BTPARALLEL_DONE,
+ BTPARALLEL_IDLE
+} PS_State;

The order of above two enum values can be changed according to their use.

+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;

I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?

+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)

The return value of the above function is validated only in _bt_first
function, but in other cases it is not. From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?

+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);

Any better names for the above functions, as these function will provide/set
the next page that needs to be read.

parallel_index_opt_exec_support_v4.patch:

+#include "access/parallel.h"

Why is it required to be include nbtree.c? i didn't find
any code changes in the patch.

+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {

Why this if check required? There is an assert check in later function
calls.

Regards,
Hari Babu
Fujitsu Australia

#33

Rahila Syed

rahilasyed90@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#32)

Re: Parallel Index Scans

+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;

I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?

Consider following array scan keys

select * from test2 where j=ANY(ARRAY[1000,2000,3000]);

By the time the current worker has finished reading heap tuples
corresponding
to array key 1000(arrayKeyCount = 0), some other worker might have advanced
the scan to the
array key 2000(btps_arrayKeyCount =1). In this case when the current worker
fetches next page to scan,
it must advance its scan keys before scanning the next page of parallel
scan.
I hope this helps.

+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)

The return value of the above function is validated only in _bt_first
function, but in other cases it is not.

In other cases it is validated in _bt_readnextpage() which is called after
_bt_parallel_seize().

From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?

In case of the next block being P_NONE and status true, the code
calls _bt_parallel_done() to notify other workers followed by
BTScanPosInvalidate(). Similar check for block = P_NONE also
happens in existing code. See following in _bt_readnextpage(),

if (blkno == P_NONE || !so->currPos.moreRight)
{
_bt_parallel_done(scan);
BTScanPosInvalidate(so->currPos);
return false;
}
So to keep it consistent with the existing code, the check
is kept outside _bt_parallel_seize().

Thank you,
Rahila Syed

On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
wrote:

Show quoted text

On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Changed as per suggestion.

I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

Thanks for the patch. Here are comments found during review.

parallel_index_scan_v4.patch:

+ amtarget = (char *) ((void *) target) + offset;

The above calcuation can be moved after NULL check?

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?
+ BTPARALLEL_DONE,
+ BTPARALLEL_IDLE
+} PS_State;
The order of above two enum values can be changed according to their use.
+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;
I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
The return value of the above function is validated only in _bt_first
function, but in other cases it is not. From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);
Any better names for the above functions, as these function will
provide/set
the next page that needs to be read.

parallel_index_opt_exec_support_v4.patch:

+#include "access/parallel.h"

Why is it required to be include nbtree.c? i didn't find
any code changes in the patch.
+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {
Why this if check required? There is an assert check in later function
calls.

Regards,
Hari Babu
Fujitsu Australia

#34

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#31)

Re: Parallel Index Scans

On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific. There's no
guarantee that any other AMs the implement parallel index scans will
use that wait event, and they might have others instead. I would make
it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER. (Waiting
for the page number needed to continue a parallel btree scan to become
available.)

WAIT_EVENT_BTREE_PAGENUMBER - NUMBER sounds slightly inconvenient. How
about just WAIT_EVENT_BTREE_PAGE? We can keep the description as
suggested by you?

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that. Refer ExecReScanIndexScan() changes in patch
parallel_index_opt_exec_support_v4. Rescan is called from below place
during index scan.

ExecIndexScan(IndexScanState *node)
{
/*
* If we have runtime keys and they've not already been set up, do it now.
*/
if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
ExecReScan((PlanState *) node);

If the AM-specific method is
getting the IndexScanDesc, it can see for itself whether it is a
parallel scan.

I think if we want to do that way then we need to pass some additional
information related to runtime scan keys in index_rescan method and
then probably to amspecific rescan method. That sounds scary.

I think it's a bad idea to add a ParallelIndexScanDesc argument to
index_beginscan(). That function is used in lots of places, and
somebody might think that they are allowed to actually pass a non-NULL
value there, which they aren't: they must go through
index_beginscan_parallel. I think that the additional argument should
only be added to index_beginscan_internal, and
index_beginscan_parallel should remain unchanged.

If we go that way then we need to set few parameters like heapRelation
and xs_snapshot in index_beginscan_parallel as we are doing in
index_beginscan. Does going that way sound better to you?

Either that, or get
rid of index_beginscan_parallel altogether and have everyone use
index_beginscan directly, and put the snapshot-restore logic in that
function.

I think there is value in retaining index_beginscan_parallel as that
is parallel to heap_beginscan_parallel.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#32)

Re: Parallel Index Scans

On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?

This can be called by both the worker and leader of parallel index
scan. What problem do you see with it. heap_beginscan_parallel has
similar description, so not sure changing here alone makes sense.

+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);
Any better names for the above functions, as these function will provide/set
the next page that needs to be read.

These functions also set the state of scan. IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert). I am open to change if you or others have any better
suggestions.

+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {
Why this if check required? There is an assert check in later function
calls.

This is required because we don't initialize the scan descriptor for
parallel-aware nodes during ExecInitIndexScan. It got initialized
later at the time of execution when we initialize dsm. Now, it is
quite possible that Gather node can occur on inner side of join in
which case Rescan can be called before even execution starts. This is
the reason why we have similar check in ExecReScanSeqScan which is
added during parallel sequential scans (f0661c4e). Does that answer
your question?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#34)

Re: Parallel Index Scans

On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific. There's no
guarantee that any other AMs the implement parallel index scans will
use that wait event, and they might have others instead. I would make
it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER. (Waiting
for the page number needed to continue a parallel btree scan to become
available.)

WAIT_EVENT_BTREE_PAGENUMBER - NUMBER sounds slightly inconvenient. How
about just WAIT_EVENT_BTREE_PAGE? We can keep the description as
suggested by you?

Sure.

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that. Refer ExecReScanIndexScan() changes in patch
parallel_index_opt_exec_support_v4. Rescan is called from below place
during index scan.

Hmm, tricky. OK, I'll think about that some more.

I think it's a bad idea to add a ParallelIndexScanDesc argument to
index_beginscan(). That function is used in lots of places, and
somebody might think that they are allowed to actually pass a non-NULL
value there, which they aren't: they must go through
index_beginscan_parallel. I think that the additional argument should
only be added to index_beginscan_internal, and
index_beginscan_parallel should remain unchanged.

If we go that way then we need to set few parameters like heapRelation
and xs_snapshot in index_beginscan_parallel as we are doing in
index_beginscan. Does going that way sound better to you?

It's pretty minor code duplication; I don't think it's an issue.

I think there is value in retaining index_beginscan_parallel as that
is parallel to heap_beginscan_parallel.

OK.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#30)

Re: Parallel Index Scans

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

With respect to the second patch
(parallel_index_opt_exec_support_v4.patch), I'm hoping it can use the
new function from Dilip's bitmap heap scan patch set. See commit
716c7d4b242f0a64ad8ac4dc48c6fed6557ba12c.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Haribabu Kommi

kommi.haribabu@gmail.com

almost 9 years ago

In reply to: Rahila Syed (#33)

Re: Parallel Index Scans

On Wed, Jan 18, 2017 at 6:55 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

+ /* Check if the scan for current scan keys is finished */
+ if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+ *status = false;
I didn't clearly understand, in which scenario the arrayKeyCount is less
than btps_arrayKeyCount?

Consider following array scan keys

select * from test2 where j=ANY(ARRAY[1000,2000,3000]);

By the time the current worker has finished reading heap tuples
corresponding
to array key 1000(arrayKeyCount = 0), some other worker might have
advanced the scan to the
array key 2000(btps_arrayKeyCount =1). In this case when the current
worker fetches next page to scan,
it must advance its scan keys before scanning the next page of parallel
scan.
I hope this helps.

Thanks for the details.
One worker incremented arrayKeyCount and btps_arrayKeyCount both. As
btps_arrayKeyCount present in shared memory, so the other worker see the
update
and hits above the condition.

+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
The return value of the above function is validated only in _bt_first
function, but in other cases it is not.

In other cases it is validated in _bt_readnextpage() which is called after
_bt_parallel_seize().

From the function description,
it is possible to return P_NONE for the workers also with status as
true. I feel it is better to handle the P_NONE case internally only
so that callers just check for the status. Am i missing anything?

In case of the next block being P_NONE and status true, the code
calls _bt_parallel_done() to notify other workers followed by
BTScanPosInvalidate(). Similar check for block = P_NONE also
happens in existing code. See following in _bt_readnextpage(),

if (blkno == P_NONE || !so->currPos.moreRight)
{
_bt_parallel_done(scan);
BTScanPosInvalidate(so->currPos);
return false;
}
So to keep it consistent with the existing code, the check
is kept outside _bt_parallel_seize().

Thanks. Got it.

Regards,
Hari Babu
Fujitsu Australia

#39

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#31)

2 attachment(s)

Re: Parallel Index Scans

On Tue, Jan 17, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

Thanks for the update. Some more comments:

It shouldn't be necessary for MultiExecBitmapIndexScan to modify the
IndexScanDesc. That seems really broken. If a parallel scan isn't
supported here (and I'm sure that's the case right now) then no such
IndexScanDesc should be getting created.

Fixed.

WAIT_EVENT_PARALLEL_INDEX_SCAN is in fact btree-specific. There's no
guarantee that any other AMs the implement parallel index scans will
use that wait event, and they might have others instead. I would make
it a lot more specific, like WAIT_EVENT_BTREE_PAGENUMBER. (Waiting
for the page number needed to continue a parallel btree scan to become
available.)

Changed as per discussion.

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case? If the AM-specific method is
getting the IndexScanDesc, it can see for itself whether it is a
parallel scan.

Left as it is based on yesterdays discussion.

I'd rename PS_State to BTPS_State, to match the other renamings.

If we're going to update all of the AMs to set the new support
functions to NULL, we should also update contrib/bloom.

index_parallelscan_estimate has multiple lines that go over 80
characters for no really good reason. Separate the initialization of
index_scan from the variable declaration. Do the same for
amindex_size. Also, you don't really need to encase the end of the
function in an "else" block when the "if" block is guaranteed to
returrn.

Several function header comments still use the style where the first
word of the description is "It". Say "this function" or similar the
first time, instead of "it". Then when you say "it" later, it's clear
that it refers back to where you said "this function".

index_parallelscan_initialize also has a line more than 80 characters
that looks easy to fix by splitting the declaration from the
initialization.

Fixed all the above.

I think it's a bad idea to add a ParallelIndexScanDesc argument to
index_beginscan(). That function is used in lots of places, and
somebody might think that they are allowed to actually pass a non-NULL
value there, which they aren't: they must go through
index_beginscan_parallel. I think that the additional argument should
only be added to index_beginscan_internal, and
index_beginscan_parallel should remain unchanged. Either that, or get
rid of index_beginscan_parallel altogether and have everyone use
index_beginscan directly, and put the snapshot-restore logic in that
function.

Changed as per yesterday's discussion.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v5.patchapplication/octet-stream; name=parallel_index_scan_v5.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 06077af..f5ee7bc 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,8 +131,11 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = blvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = blbeginscan;
 	amroutine->amrescan = blrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..df71c06 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -124,8 +124,11 @@ typedef struct IndexAmRoutine
     amoptions_function amoptions;
     amproperty_function amproperty;     /* can be NULL */
     amvalidate_function amvalidate;
+    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
+    amparallelrescan_function amparallelrescan;    /* can be NULL */
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
@@ -459,6 +462,35 @@ amvalidate (Oid opclassoid);
    invalid.  Problems should be reported with <function>ereport</> messages.
   </para>
 
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (void);
+</programlisting>
+   Estimate and return the storage space required for parallel index scan.
+   The size of index-type-specific parallel information will be returned back
+   to caller.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   Initialize the parallel index scan state.  This function will be used to
+   initialize index-type-specific parallel information which will be stored
+   immediately after generic parallel information required for parallel index
+   scans. The required state information will be set in <literal>target</>.
+  </para>
+
+  <para>
+   The <function>aminitparallelscan</> and <function>amestimateparallelscan</>
+   functions need only be provided if the access method supports <quote>parallel</>
+   index scans.  If it doesn't, the <structfield>aminitparallelscan</> and
+   <structfield>amestimateparallelscan</> fields in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.  Note that these fields can be NULL even for
+   <quote>parallel</> index scans.
+  </para>
 
   <para>
    The purpose of an index, of course, is to support scans for tuples matching
@@ -511,6 +543,24 @@ amrescan (IndexScanDesc scan,
 
   <para>
 <programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   Restart the parallel index scan.  This function  resets the parallel index
+   scan state.  It must be called only during restart of scan which will be
+   typically required for the inner side of nest-loop join.
+  </para>
+
+  <para>
+   The <function>amparallelrescan</> function need only be provided if the
+   access method supports <quote>parallel</> index scans.  If it doesn't,
+   the <structfield>amparallelrescan</> field in its <structname>IndexAmRoutine</>
+   struct must be set to NULL.  Note that this field can be NULL even for
+   <quote>parallel</> index scans.
+  </para>
+
+  <para>
+<programlisting>
 boolean
 amgettuple (IndexScanDesc scan,
             ScanDirection direction);
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 1545f03..8451eb4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1199,7 +1199,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1232,6 +1232,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d60ddd2..826f3cc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -105,8 +105,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3909638..a80039f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 597056a..54f9dc4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -82,8 +82,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a64a9b9..17fd125 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,8 +79,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..a19bb49 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -120,7 +120,8 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot);
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap);
 
 
 /* ----------------------------------------------------------------
@@ -207,6 +208,89 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate storage for ParallelIndexScanDesc
+ *
+ * This function calls am specific routine to obtain size of am specific
+ * shared information.
+ */
+Size
+index_parallelscan_estimate(Relation indexrel, Snapshot snapshot)
+{
+	Size		index_size;
+	Size		amindex_size;
+
+	index_size = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
+	index_size = add_size(index_size, EstimateSnapshotSpace(snapshot));
+	index_size = MAXALIGN(index_size);
+
+	/* amestimateparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->amestimateparallelscan == NULL)
+		return index_size;
+
+	amindex_size = indexrel->rd_amroutine->amestimateparallelscan();
+	return add_size(index_size, amindex_size);
+}
+
+/*
+ * index_parallelscan_initialize - initialize ParallelIndexScanDesc
+ *
+ * This function calls access method specific initialization routine to
+ * initialize am specific information.  Call this just once in the leader
+ * process; then, individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							  Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset;
+	void	   *amtarget;
+
+	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+					  EstimateSnapshotSpace(snapshot));
+	offset = MAXALIGN(offset);
+
+	target->ps_relid = RelationGetRelid(heaprel);
+	target->ps_indexid = RelationGetRelid(indexrel);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexrel->rd_amroutine->aminitparallelscan == NULL)
+		return;
+
+	amtarget = (char *) ((void *) target) + offset;
+	indexrel->rd_amroutine->aminitparallelscan(amtarget);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
+									pscan, true);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by RelationGetIndexScan.
+	 */
+	scan->heapRelation = heaprel;
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -219,7 +303,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -244,7 +328,7 @@ index_beginscan_bitmap(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -260,8 +344,11 @@ index_beginscan_bitmap(Relation indexRelation,
  */
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot)
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
+	IndexScanDesc scan;
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -276,8 +363,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	return indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
+	scan = indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
 													norderbys);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
 }
 
 /* ----------------
@@ -319,6 +411,22 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan == NULL)
+		return;
+
+	scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +449,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acf..fad5f1b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,46 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * Below flags are used to indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -111,8 +153,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = btparallelrescan;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
@@ -468,6 +513,192 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ * status - The value of status tells caller whether to continue the scan or
+ * not.  The true value of status indicates either one of the following (a)
+ * the block number returned is valid and the scan can be continued (b) the
+ * block number is invalid and the scan has just begun (c) the block number
+ * is P_NONE and the scan is finished.  The false value indicates that we
+ * have reached the end of scan for current scankeys and for that we return
+ * block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !*status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!*status)
+		return P_NONE;
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * Save information about scan position and wake up next worker to continue
+ * scan.
+ *
+ * For backward scan, scan_page holds the latest page being scanned.
+ * For forward scan, scan_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  *	btrescan() -- rescan an index relation
  */
 void
@@ -487,6 +718,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -527,6 +759,33 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 }
 
 /*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	if (parallel_scan)
+	{
+
+		BTParallelScanDesc btscan;
+
+		btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												   parallel_scan->ps_offset);
+
+		/*
+		 * Ideally, we don't need to acquire spinlock here, but being
+		 * consistent with heap_rescan seems to be a good idea.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		SpinLockRelease(&btscan->btps_mutex);
+	}
+}
+
+/*
  *	btendscan() -- close down a scan
  */
 void
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4fba75a..b565e09 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1370,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1398,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1467,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1481,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1539,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1556,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1577,48 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1896,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1925,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..692ced4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index ca4b0bd..1729797 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -61,8 +61,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f37a0bf..090711a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3386,6 +3386,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6a5f279..18259ad 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (void);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b2e078a..d2258f6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,6 +133,13 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
+						 Relation indexrel, int nkeys, int norderbys,
+						 ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
@@ -141,6 +150,7 @@ extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..547c1cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -661,10 +664,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..793c46b 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -126,8 +126,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
+	ParallelIndexScanDesc parallel_scan;		/* parallel index scan
+												 * information */
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/c.h b/src/include/c.h
index efbb77f..a2c043a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -527,6 +527,9 @@ typedef NameData *Name;
 #define PointerIsAligned(pointer, type) \
 		(((uintptr_t)(pointer) % (sizeof (type))) == 0)
 
+#define OffsetToPointer(base, offset) \
+		((void *)((char *) base + offset))
+
 #define OidIsValid(objectId)  ((bool) ((objectId) != InvalidOid))
 
 #define RegProcedureIsValid(p)	OidIsValid(p)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5b37894..dfd095f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -784,6 +784,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos
@@ -1264,6 +1267,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

parallel_index_opt_exec_support_v5.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v5.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index f5ee7bc..f79ecfc 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index df71c06..4d17248 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 826f3cc..b9ddcea 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a80039f..f6898ef 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 54f9dc4..d4978b8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -70,6 +70,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 17fd125..f3debe2 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fad5f1b..28c9193 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 1729797..ae25adf 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index b52cfaa..32c0796 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -61,6 +61,7 @@
 
 static bool TargetListSupportsBackwardScan(List *targetlist);
 static bool IndexSupportsBackwardScan(Oid indexid);
+static bool GatherSupportsBackwardScan(Plan *node);
 
 
 /*
@@ -490,7 +491,7 @@ ExecSupportsBackwardScan(Plan *node)
 			return false;
 
 		case T_Gather:
-			return false;
+			return GatherSupportsBackwardScan(node);
 
 		case T_IndexScan:
 			return IndexSupportsBackwardScan(((IndexScan *) node)->indexid) &&
@@ -571,6 +572,25 @@ IndexSupportsBackwardScan(Oid indexid)
 }
 
 /*
+ * GatherSupportsBackwardScan - does a gather plan supports backward scan?
+ *
+ * Returns true if the outer plan node of gather supports backward scan.
+ * As of now, we can support backward scan, iff outer node of gather has
+ * index node.
+ */
+bool
+GatherSupportsBackwardScan(Plan *node)
+{
+	Plan	   *outer_node = outerPlan(node);
+
+	if (nodeTag(outer_node) == T_IndexScan)
+		return IndexSupportsBackwardScan(((IndexScan *) outer_node)->indexid) &&
+			TargetListSupportsBackwardScan(outer_node->targetlist);
+	else
+		return false;
+}
+
+/*
  * ExecMaterializesOutput - does a plan type materialize its output?
  *
  * Returns true if the plan node type is one that automatically materializes
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e01fe6d..cb91cc3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 97a6fac..2842cf9 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -515,6 +518,15 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * if we are here to just update the scan keys, then don't reset parallel
+	 * scan
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -540,10 +552,16 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/* reset (parallel) index scan */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1018,22 +1036,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1595,3 +1620,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 7c017fe..b37ed7e 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -126,7 +126,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 
 /*
@@ -2872,7 +2871,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * relation.  "pages" is the number of pages from the relation that we
  * expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 {
 	int			parallel_workers;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 458f139..f4253d3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -400,6 +400,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -602,11 +603,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..1905917 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1042,8 +1042,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = compute_parallel_worker(rel, index->pages);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1101,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = compute_parallel_worker(rel, index->pages);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b5902f..fd25749 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5393,7 +5393,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..cdf5523 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 18259ad..7fc8bd9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -184,6 +184,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1da1e1f..b565f28 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1370,6 +1370,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1396,6 +1397,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..c9884f2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81a9be7..fabf314 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -53,6 +53,7 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..140fb3c 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..d7dfd28 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#40

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#32)

Re: Parallel Index Scans

On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Changed as per suggestion.

I have also rebased the optimizer/executor support patch
(parallel_index_opt_exec_support_v4.patch) and added a test case in
it.

Thanks for the patch. Here are comments found during review.

parallel_index_scan_v4.patch:

+ amtarget = (char *) ((void *) target) + offset;

The above calcuation can be moved after NULL check?

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better description?
+ BTPARALLEL_DONE,
+ BTPARALLEL_IDLE
+} PS_State;
The order of above two enum values can be changed according to their use.

Changed code as per your suggestion.

parallel_index_opt_exec_support_v4.patch:

+#include "access/parallel.h"

Why is it required to be include nbtree.c? i didn't find
any code changes in the patch.

Removed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#37)

Re: Parallel Index Scans

On Thu, Jan 19, 2017 at 12:28 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 16, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

With respect to the second patch
(parallel_index_opt_exec_support_v4.patch), I'm hoping it can use the
new function from Dilip's bitmap heap scan patch set. See commit
716c7d4b242f0a64ad8ac4dc48c6fed6557ba12c.

Updated patch has used the function from above commit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#39)

Re: Parallel Index Scans

On Thu, Jan 19, 2017 at 4:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

I think that the whole idea of GatherSupportsBackwardScan is wrong.
Gather doesn't support a backwards scan under any circumstances,
whether the underlying node does or not. You can read the tuples
once, in order, and you can't back up. That's what supporting a
backward scan means: you can back up and then move forward again.
It's there to support cursor operations.

Also note this comment in ExecSupportsBackwardsScan, which seems just
as relevant to parallel index scans as anything else:

/*
* Parallel-aware nodes return a subset of the tuples in each worker, and
* in general we can't expect to have enough bookkeeping state to know
* which ones we returned in this worker as opposed to some other worker.
*/
if (node->parallel_aware)
return false;

If all of that were no issue, the considerations in
TargetListSupportsBackwardScan could be a problem, too. But I think
there shouldn't be any issue having Gather just continue to return
false.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#34)

1 attachment(s)

Re: Parallel Index Scans

On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that.

Why not? I see the code, but the comments don't seem to offer any
justification for that. And it seems wrong to me. If the scan keys
change, surely that only makes sense if we restart the scan. You
can't just blindly continue the same scan if the keys have changed,
can you?

I think the reason that this isn't breaking for you is that it's
difficult or impossible to get a parallel index scan someplace where
the keys would change at runtime. Normally, the parallel scan is on
the driving relation, and so there are no runtime keys. We currently
have no way for a parallel scan to occur on the inner side of a nested
loop unless there's an intervening Gather node - and in that case the
keys can't change without relaunching the workers. It's hard to see
how it could work otherwise. For example, suppose you had something
like this:

Gather
-> Nested Loop
-> Parallel Seq Scan on a
-> Hash Join
-> Seq Scan on b
-> Parallel Shared Hash
-> Parallel Index Scan on c
Index Cond: c.x = a.x

Well, the problem here is that there's nothing to ensure that various
workers who are cooperating to build the hash table all have the same
value for a.x, nor is there any thing to ensure that they'll all get
done with the shared hash table at the same time. So this is just
chaos. I think we have to disallow this kind of plan as nonsensical.
Given that, I'm not sure a reset of the runtime keys can ever really
happen. Have you investigated this?

I extracted the generic portions of this infrastructure (i.e. not the
btree-specific stuff) and spent some time working on it today. The
big thing I fixed was the documentation, which you added in a fairly
illogical part of the file. You had all of the new functions for
supporting parallel scans in a section that explicitly says it's for
mandatory functions, and which precedes the section on regular
non-parallel scans. I moved the documentation for all three new
methods to the same place, added some explanation of parallel scans in
general, and rewrote the descriptions for the individual functions to
be more clear. Also, in indexam.c, I adjusted things to use
RELATION_CHECKS in a couple of places, did some work on comments and
coding style, and fixed a place that should have used the new
OffsetToPointer macro but instead hand-rolled the thing with the casts
backwards. Adding an integer to a value of type "void *" does not
work on all compilers. The patch I ended up with is attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel-generic-index-scan.patchinvalid/octet-stream; name=parallel-generic-index-scan.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 06077af..f5ee7bc 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -131,8 +131,11 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = bloptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = blvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = blbeginscan;
 	amroutine->amrescan = blrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = blgetbitmap;
 	amroutine->amendscan = blendscan;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..e8ec8ea 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -124,8 +124,11 @@ typedef struct IndexAmRoutine
     amoptions_function amoptions;
     amproperty_function amproperty;     /* can be NULL */
     amvalidate_function amvalidate;
+    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
     ambeginscan_function ambeginscan;
     amrescan_function amrescan;
+    amparallelrescan_function amparallelrescan;    /* can be NULL */
     amgettuple_function amgettuple;     /* can be NULL */
     amgetbitmap_function amgetbitmap;   /* can be NULL */
     amendscan_function amendscan;
@@ -624,6 +627,68 @@ amrestrpos (IndexScanDesc scan);
    the <structfield>amrestrpos</> field in its <structname>IndexAmRoutine</>
    struct may be set to NULL.
   </para>
+
+  <para>
+   In addition to supporting ordinary index scans, some types of index
+   may wish to support <firstterm>parallel index scans</>, which allow
+   multiple backends to cooperate in performing an index scan.  The
+   index access method should arrange things so that each cooperating
+   process returns a subset of the tuples that would be performed by
+   an ordinary, non-parallel index scan, but in such a way that the
+   union of those subsets is equal to the set of tuples that would be
+   returned by an ordinary, non-parallel index scan.  Furthermore, while
+   there need not be any global ordering of tuples returned by a parallel
+   scan, the ordering of that subset of tuples returned within each
+   cooperating backend must match the requested ordering.  The following
+   functions may be implemented to support parallel index scans:
+  </para>
+
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (void);
+</programlisting>
+   Estimate and return the number of bytes of dynamic shared memory which
+   the access method will be needed to perform a parallel scan.  (This number
+   is in addition to, not in lieu of, the amount of space needed for
+   AM-independent data in <structname>ParallelIndexScanDescData</>.)
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or for which the number of additional bytes
+   of storage required is zero.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   This function will be called to initialize dynamic shared memory at the
+   beginning of a parallel scan.  <parameter>target</> will point to at least
+   the number of bytes previously returned by
+   <function>amestimateparallelscan</>, and this function may use that
+   amount of space to store whatever data it wishes.
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or in cases where the shared memory space
+   required needs no initialization.
+  </para>
+
+  <para>
+<programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   This function, if implemented, will be called when a parallel index scan
+   must be restarted.  It should reset any shared state set up by
+   <function>aminitparallelscan</> such that the scan will be restarted from
+   the beginning.
+  </para>
+
  </sect1>
 
  <sect1 id="index-scanning">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d60ddd2..826f3cc 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -105,8 +105,11 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = brinoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = brinvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = brinbeginscan;
 	amroutine->amrescan = brinrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = bringetbitmap;
 	amroutine->amendscan = brinendscan;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3909638..a80039f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -61,8 +61,11 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = ginvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = ginbeginscan;
 	amroutine->amrescan = ginrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = NULL;
 	amroutine->amgetbitmap = gingetbitmap;
 	amroutine->amendscan = ginendscan;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 597056a..54f9dc4 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -82,8 +82,11 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = gistoptions;
 	amroutine->amproperty = gistproperty;
 	amroutine->amvalidate = gistvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = gistbeginscan;
 	amroutine->amrescan = gistrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = gistgettuple;
 	amroutine->amgetbitmap = gistgetbitmap;
 	amroutine->amendscan = gistendscan;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a64a9b9..17fd125 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -79,8 +79,11 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = hashoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = hashvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = hashbeginscan;
 	amroutine->amrescan = hashrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = hashgettuple;
 	amroutine->amgetbitmap = hashgetbitmap;
 	amroutine->amendscan = hashendscan;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..f8c95ee 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -120,7 +120,8 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot);
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap);
 
 
 /* ----------------------------------------------------------------
@@ -207,6 +208,101 @@ index_insert(Relation indexRelation,
 }
 
 /*
+ * index_parallelscan_estimate - estimate shared memory for parallel scan
+ *
+ * Currently, we don't pass any information to the AM-specific estimator,
+ * so it can probably only return a constant.  In the future, we might need
+ * to pass more information.
+ */
+Size
+index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
+{
+	Size		nbytes;
+
+	RELATION_CHECKS;
+
+	nbytes = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
+	nbytes = add_size(nbytes, EstimateSnapshotSpace(snapshot));
+	nbytes = MAXALIGN(nbytes);
+
+	/*
+	 * If amestimateparallelscan is not provided, assume there is no
+	 * AM-specific data needed.  (It's hard to believe that could work, but
+	 * it's easy enough to cater to it here.)
+	 */
+	if (indexRelation->rd_amroutine->amestimateparallelscan != NULL)
+		nbytes = add_size(nbytes,
+					  indexRelation->rd_amroutine->amestimateparallelscan());
+
+	return nbytes;
+}
+
+/*
+ * index_parallelscan_initialize - initialize parallel scan
+ *
+ * We initialize both the ParallelIndexScanDesc proper and the AM-specific
+ * information which follows it.
+ *
+ * This function calls access method specific initialization routine to
+ * initialize am specific information.  Call this just once in the leader
+ * process; then, individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heapRelation, Relation indexRelation,
+							  Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset;
+
+	RELATION_CHECKS;
+
+	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+					  EstimateSnapshotSpace(snapshot));
+	offset = MAXALIGN(offset);
+
+	target->ps_relid = RelationGetRelid(heapRelation);
+	target->ps_indexid = RelationGetRelid(indexRelation);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexRelation->rd_amroutine->aminitparallelscan != NULL)
+	{
+		void	   *amtarget;
+
+		amtarget = OffsetToPointer(target, offset);
+		indexRelation->rd_amroutine->aminitparallelscan(amtarget);
+	}
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
+									pscan, true);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by index_beginscan_internal.
+	 */
+	scan->heapRelation = heaprel;
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
+/*
  * index_beginscan - start a scan of an index with amgettuple
  *
  * Caller must be holding suitable locks on the heap and the index.
@@ -219,7 +315,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -244,7 +340,7 @@ index_beginscan_bitmap(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -260,8 +356,11 @@ index_beginscan_bitmap(Relation indexRelation,
  */
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot)
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
+	IndexScanDesc scan;
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -276,8 +375,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	return indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
+	scan = indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
 													norderbys);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
 }
 
 /* ----------------
@@ -319,6 +423,20 @@ index_rescan(IndexScanDesc scan,
 }
 
 /* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
+		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/* ----------------
  *		index_endscan - end a scan
  * ----------------
  */
@@ -341,6 +459,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acf..480a3a0 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -111,8 +111,11 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = btoptions;
 	amroutine->amproperty = btproperty;
 	amroutine->amvalidate = btvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = btbeginscan;
 	amroutine->amrescan = btrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = btgettuple;
 	amroutine->amgetbitmap = btgetbitmap;
 	amroutine->amendscan = btendscan;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index ca4b0bd..1729797 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -61,8 +61,11 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amoptions = spgoptions;
 	amroutine->amproperty = NULL;
 	amroutine->amvalidate = spgvalidate;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
 	amroutine->ambeginscan = spgbeginscan;
 	amroutine->amrescan = spgrescan;
+	amroutine->amparallelrescan = NULL;
 	amroutine->amgettuple = spggettuple;
 	amroutine->amgetbitmap = spggetbitmap;
 	amroutine->amendscan = spgendscan;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6a5f279..18259ad 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -108,6 +108,12 @@ typedef bool (*amproperty_function) (Oid index_oid, int attno,
 /* validate definition of an opclass for this AM */
 typedef bool (*amvalidate_function) (Oid opclassoid);
 
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (void);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
 /* prepare for index scan */
 typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
 														   int nkeys,
@@ -120,6 +126,9 @@ typedef void (*amrescan_function) (IndexScanDesc scan,
 											   ScanKey orderbys,
 											   int norderbys);
 
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
+
 /* next valid tuple */
 typedef bool (*amgettuple_function) (IndexScanDesc scan,
 												 ScanDirection direction);
@@ -189,8 +198,11 @@ typedef struct IndexAmRoutine
 	amoptions_function amoptions;
 	amproperty_function amproperty;		/* can be NULL */
 	amvalidate_function amvalidate;
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
 	ambeginscan_function ambeginscan;
 	amrescan_function amrescan;
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 	amgettuple_function amgettuple;		/* can be NULL */
 	amgetbitmap_function amgetbitmap;	/* can be NULL */
 	amendscan_function amendscan;
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b2e078a..d2258f6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,6 +133,13 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
+						 Relation indexrel, int nkeys, int norderbys,
+						 ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
@@ -141,6 +150,7 @@ extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..ce3ca8d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -93,6 +93,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
@@ -126,8 +127,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+
+	/* parallel index scan information, in shared memory */
+	ParallelIndexScanDesc parallel_scan;
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/c.h b/src/include/c.h
index efbb77f..a2c043a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -527,6 +527,9 @@ typedef NameData *Name;
 #define PointerIsAligned(pointer, type) \
 		(((uintptr_t)(pointer) % (sizeof (type))) == 0)
 
+#define OffsetToPointer(base, offset) \
+		((void *)((char *) base + offset))
+
 #define OidIsValid(objectId)  ((bool) ((objectId) != InvalidOid))
 
 #define RegProcedureIsValid(p)	OidIsValid(p)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..c4235ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1264,6 +1264,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

#44

Haribabu Kommi

kommi.haribabu@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#35)

Re: Parallel Index Scans

On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Wed, Jan 18, 2017 at 6:25 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Mon, Jan 16, 2017 at 11:11 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

+ * index_beginscan_parallel - join parallel index scan

The name and the description doesn't sync properly, any better

description?

This can be called by both the worker and leader of parallel index
scan. What problem do you see with it. heap_beginscan_parallel has
similar description, so not sure changing here alone makes sense.

Ok.

+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool

*status);

+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);

Any better names for the above functions, as these function will

provide/set

the next page that needs to be read.

These functions also set the state of scan. IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert). I am open to change if you or others have any better
suggestions.

I didn't find any better names other than the following,

_bt_get_next_parallel_page
_bt_set_next_parallel_page

+ /* reset (parallel) index scan */
+ if (node->iss_ScanDesc)
+ {
Why this if check required? There is an assert check in later function
calls.
This is required because we don't initialize the scan descriptor for
parallel-aware nodes during ExecInitIndexScan. It got initialized
later at the time of execution when we initialize dsm. Now, it is
quite possible that Gather node can occur on inner side of join in
which case Rescan can be called before even execution starts. This is
the reason why we have similar check in ExecReScanSeqScan which is
added during parallel sequential scans (f0661c4e). Does that answer
your question?

Thanks for the details. got it.

Regards,
Hari Babu
Fujitsu Australia

#45

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#42)

Re: Parallel Index Scans

On Fri, Jan 20, 2017 at 12:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jan 19, 2017 at 4:26 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Fixed.

If all of that were no issue, the considerations in
TargetListSupportsBackwardScan could be a problem, too. But I think
there shouldn't be any issue having Gather just continue to return
false.

You are right. I have added that code under the assumption that if
the underlying node (in this case index scan) can support backward
scan, gather can also support. I forgot/missed that
ExecSupportsBackwardScan is to support cursors operations. Will fix
in next version of patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#43)

Re: Parallel Index Scans

On Fri, Jan 20, 2017 at 3:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 18, 2017 at 8:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why do we need separate AM support for index_parallelrescan()? Why
can't index_rescan() cover that case?

The reason is that sometime index_rescan is called when we have to
just update runtime scankeys in index and we don't want to reset
parallel scan for that.

Why not? I see the code, but the comments don't seem to offer any
justification for that. And it seems wrong to me. If the scan keys
change, surely that only makes sense if we restart the scan. You
can't just blindly continue the same scan if the keys have changed,
can you?

Sure, if scan keys have changed then we can't continue, but this is
the case where runtime keys are first time initialized.

if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)

In the above check, the second part of the check
(!node->iss_RuntimeKeysReady) ensures that it is for the first time.
Now, let me give you an example to explain what bad can happen if we
allow resetting parallel scan in this case. Consider a query like
select * from t1 where c1 < parallel_index(10);, in this if we allow
resetting parallel scan descriptor during first time initialization of
runtime keys, it can easily corrupt the parallel scan state. Suppose
leader has taken the lead and is scanning some page and worker reaches
to initialize its keys in ExecReScanIndexScan(), if worker resets the
parallel scan, then it will corrupt the state of the parallel scan
state.

If you agree with the above explanation, then I will expand the
comments in next update.

Just in case you want to debug the above query, below is the schema
and necessary steps.

create or replace function parallel_index(a integer) returns integer
as $$
begin
return a + 1;
end;
$$ language plpgsql STABLE PARALLEL SAFE;

create table t1(c1 int, c2 char(20));
insert into t1 values(generate_series(1,300000),'aaa');
create index idx_t1 on t1 (c1);

set max_parallel_workers_per_gather=1;
set parallel_setup_cost=0;
set parallel_tuple_cost=0;
set min_parallel_relation_size=0;
set enable_bitmapscan=off;
set enable_seqscan=off;

select * from t1 where c1 < parallel_index(1000);

I think the reason that this isn't breaking for you is that it's
difficult or impossible to get a parallel index scan someplace where
the keys would change at runtime. Normally, the parallel scan is on
the driving relation, and so there are no runtime keys. We currently
have no way for a parallel scan to occur on the inner side of a nested
loop unless there's an intervening Gather node - and in that case the
keys can't change without relaunching the workers. It's hard to see
how it could work otherwise. For example, suppose you had something
like this:

Gather
-> Nested Loop
-> Parallel Seq Scan on a
-> Hash Join
-> Seq Scan on b
-> Parallel Shared Hash
-> Parallel Index Scan on c
Index Cond: c.x = a.x

Well, the problem here is that there's nothing to ensure that various
workers who are cooperating to build the hash table all have the same
value for a.x, nor is there any thing to ensure that they'll all get
done with the shared hash table at the same time. So this is just
chaos. I think we have to disallow this kind of plan as nonsensical.
Given that, I'm not sure a reset of the runtime keys can ever really
happen. Have you investigated this?

Having parallelism on the right side under gather can only be possible
after Parallel hash patch of Robert, so maybe some investigation is
needed when we review that patch. Let me know if you want me to
investigate something more after my explanation above.

I extracted the generic portions of this infrastructure (i.e. not the
btree-specific stuff) and spent some time working on it today. The
big thing I fixed was the documentation, which you added in a fairly
illogical part of the file.

Hmm, it is not illogical. All the functions are described in the same
order as they are declared in IndexAmRoutine structure and I have
followed the same. I think both amestimateparallelscan and
aminitparallelscan should be added one para down which says (The
purpose of an index .. The scan-related functions that an index access
method must or may provide are:).

You had all of the new functions for
supporting parallel scans in a section that explicitly says it's for
mandatory functions, and which precedes the section on regular
non-parallel scans.

I think that section should say "must or may provide" instead of "must
provide" as the functions amcanreturn and amproperty are optional and
are described in that section.

I moved the documentation for all three new
methods to the same place, added some explanation of parallel scans in
general, and rewrote the descriptions for the individual functions to
be more clear.

I think the place where you have added these new functions breaks the
existing order which is to describe them in the order they are
declared in IndexAmRoutine. Apart from that extracted patch looks
good to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#46)

Re: Parallel Index Scans

On Fri, Jan 20, 2017 at 9:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, if scan keys have changed then we can't continue, but this is
the case where runtime keys are first time initialized.

if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)

In the above check, the second part of the check
(!node->iss_RuntimeKeysReady) ensures that it is for the first time.
Now, let me give you an example to explain what bad can happen if we
allow resetting parallel scan in this case. Consider a query like
select * from t1 where c1 < parallel_index(10);, in this if we allow
resetting parallel scan descriptor during first time initialization of
runtime keys, it can easily corrupt the parallel scan state. Suppose
leader has taken the lead and is scanning some page and worker reaches
to initialize its keys in ExecReScanIndexScan(), if worker resets the
parallel scan, then it will corrupt the state of the parallel scan
state.

Hmm, I see. So the problem if I understand it correctly is that every
participating process needs to update the backend-private state for
the runtime keys and only one of those processes can update the shared
state. But in the case of a "real" rescan, even the shared state
needs to be reset. OK, makes sense.

Why does btparallelrescan cater to the case where scan->parallel_scan
== NULL? I would assume it should never get called in that case.
Also, I think ExecReScanIndexScan needs significantly better comments.
After some thought I see what's it's doing - mostly anyway - but I was
quite confused at first. I still don't completely understand why it
needs this if-test:

+       /* reset (parallel) index scan */
+       if (node->iss_ScanDesc)
+       {

I extracted the generic portions of this infrastructure (i.e. not the
btree-specific stuff) and spent some time working on it today. The
big thing I fixed was the documentation, which you added in a fairly
illogical part of the file.

Hmm, it is not illogical. All the functions are described in the same
order as they are declared in IndexAmRoutine structure and I have
followed the same.

I see. Sorry, I didn't realize that was what you were going for.

I think both amestimateparallelscan and
aminitparallelscan should be added one para down which says (The
purpose of an index .. The scan-related functions that an index access
method must or may provide are:).

I think it's a good idea to put all three of those functions together
in the listing, similar to what we did in
69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs. After all they are
closely related in purpose, and it may be easiest to understand if
they are next to each other in the listing. I suggest that we move
them to the end in IndexAmRoutine similar to the way FdwRoutine was
done; in other words, my idea is to make the structure consistent with
the way that I revised the documentation instead of making the
documentation consistent with the order you picked for the structure
members. What I like about that is that it gives a good opportunity
to include some general remarks on parallel index scans in a central
place, as I did in that patch. Also, it makes it easier for people
who care about parallel index scans to find all of the related things
(since they are together) and for people who don't care about them to
ignore it all (for the same reason). What do you think about that
approach?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#47)

1 attachment(s)

Re: Parallel Index Scans

On Sat, Jan 21, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jan 20, 2017 at 9:29 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Sure, if scan keys have changed then we can't continue, but this is
the case where runtime keys are first time initialized.

if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)

In the above check, the second part of the check
(!node->iss_RuntimeKeysReady) ensures that it is for the first time.
Now, let me give you an example to explain what bad can happen if we
allow resetting parallel scan in this case. Consider a query like
select * from t1 where c1 < parallel_index(10);, in this if we allow
resetting parallel scan descriptor during first time initialization of
runtime keys, it can easily corrupt the parallel scan state. Suppose
leader has taken the lead and is scanning some page and worker reaches
to initialize its keys in ExecReScanIndexScan(), if worker resets the
parallel scan, then it will corrupt the state of the parallel scan
state.

Hmm, I see. So the problem if I understand it correctly is that every
participating process needs to update the backend-private state for
the runtime keys and only one of those processes can update the shared
state. But in the case of a "real" rescan, even the shared state
needs to be reset. OK, makes sense.

Exactly.

Why does btparallelrescan cater to the case where scan->parallel_scan
== NULL? I would assume it should never get called in that case.

Okay, will modify the patch accordingly.

Also, I think ExecReScanIndexScan needs significantly better comments.
After some thought I see what's it's doing - mostly anyway - but I was
quite confused at first. I still don't completely understand why it
needs this if-test:
+       /* reset (parallel) index scan */
+       if (node->iss_ScanDesc)
+       {

I have mentioned the reason towards the end of the e-mail [1]/messages/by-id/CAA4eK1+nBiCxtxcNuzpaiN+nrRrRB5YDgoaqb3hyn=YUxL-+Ow@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com (Refer
line, This is required because ..). Basically, this is required to
make plans like below work sanely.

Nested Loop
-> Seq Scan on a
-> Gather
-> Parallel Index Scan on b
Index Cond: b.x = 15

I understand that such plans don't make much sense, but we do support
them and I have seen somewhat similar plan getting select in TPC-H
benchmark Let me know if this needs more explanation.

I think it's a good idea to put all three of those functions together
in the listing, similar to what we did in
69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs. After all they are
closely related in purpose, and it may be easiest to understand if
they are next to each other in the listing. I suggest that we move
them to the end in IndexAmRoutine similar to the way FdwRoutine was
done; in other words, my idea is to make the structure consistent with
the way that I revised the documentation instead of making the
documentation consistent with the order you picked for the structure
members. What I like about that is that it gives a good opportunity
to include some general remarks on parallel index scans in a central
place, as I did in that patch. Also, it makes it easier for people
who care about parallel index scans to find all of the related things
(since they are together) and for people who don't care about them to
ignore it all (for the same reason). What do you think about that
approach?

Sounds sensible. Updated patch based on that approach is attached. I
will rebase the remaining work based on this patch and send them
separately.

[1]: /messages/by-id/CAA4eK1+nBiCxtxcNuzpaiN+nrRrRB5YDgoaqb3hyn=YUxL-+Ow@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel-generic-index-scan.1.patchapplication/octet-stream; name=parallel-generic-index-scan.1.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 06077af..858798d 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = blendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..5d8e557 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -131,6 +131,11 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+
+    /* interface functions to support parallel index scans */
+    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
+    amparallelrescan_function amparallelrescan;    /* can be NULL */
 } IndexAmRoutine;
 </programlisting>
   </para>
@@ -624,6 +629,68 @@ amrestrpos (IndexScanDesc scan);
    the <structfield>amrestrpos</> field in its <structname>IndexAmRoutine</>
    struct may be set to NULL.
   </para>
+
+  <para>
+   In addition to supporting ordinary index scans, some types of index
+   may wish to support <firstterm>parallel index scans</>, which allow
+   multiple backends to cooperate in performing an index scan.  The
+   index access method should arrange things so that each cooperating
+   process returns a subset of the tuples that would be performed by
+   an ordinary, non-parallel index scan, but in such a way that the
+   union of those subsets is equal to the set of tuples that would be
+   returned by an ordinary, non-parallel index scan.  Furthermore, while
+   there need not be any global ordering of tuples returned by a parallel
+   scan, the ordering of that subset of tuples returned within each
+   cooperating backend must match the requested ordering.  The following
+   functions may be implemented to support parallel index scans:
+  </para>
+
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (void);
+</programlisting>
+   Estimate and return the number of bytes of dynamic shared memory which
+   the access method will be needed to perform a parallel scan.  (This number
+   is in addition to, not in lieu of, the amount of space needed for
+   AM-independent data in <structname>ParallelIndexScanDescData</>.)
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or for which the number of additional bytes
+   of storage required is zero.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   This function will be called to initialize dynamic shared memory at the
+   beginning of a parallel scan.  <parameter>target</> will point to at least
+   the number of bytes previously returned by
+   <function>amestimateparallelscan</>, and this function may use that
+   amount of space to store whatever data it wishes.
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or in cases where the shared memory space
+   required needs no initialization.
+  </para>
+
+  <para>
+<programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   This function, if implemented, will be called when a parallel index scan
+   must be restarted.  It should reset any shared state set up by
+   <function>aminitparallelscan</> such that the scan will be restarted from
+   the beginning.
+  </para>
+
  </sect1>
 
  <sect1 id="index-scanning">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d60ddd2..b2afdb7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = brinendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3909638..02d920b 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -68,6 +68,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = ginendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 597056a..c2247ad 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -89,6 +89,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a64a9b9..ec8ed33 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,6 +86,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..ba27c1e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -20,6 +20,10 @@
  *		index_insert	- insert an index tuple into a relation
  *		index_markpos	- mark a scan position
  *		index_restrpos	- restore a scan position
+ *		index_parallelscan_estimate - estimate shared memory for parallel scan
+ *		index_parallelscan_initialize - initialize parallel scan
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext	- get the next heap tuple from a scan
@@ -120,7 +124,8 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot);
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap);
 
 
 /* ----------------------------------------------------------------
@@ -219,7 +224,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -244,7 +249,7 @@ index_beginscan_bitmap(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -260,8 +265,11 @@ index_beginscan_bitmap(Relation indexRelation,
  */
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot)
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
+	IndexScanDesc scan;
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -276,8 +284,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	return indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
+	scan = indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
 													norderbys);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
 }
 
 /* ----------------
@@ -341,6 +354,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -389,6 +405,115 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_amroutine->amrestrpos(scan);
 }
 
+/*
+ * index_parallelscan_estimate - estimate shared memory for parallel scan
+ *
+ * Currently, we don't pass any information to the AM-specific estimator,
+ * so it can probably only return a constant.  In the future, we might need
+ * to pass more information.
+ */
+Size
+index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
+{
+	Size		nbytes;
+
+	RELATION_CHECKS;
+
+	nbytes = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
+	nbytes = add_size(nbytes, EstimateSnapshotSpace(snapshot));
+	nbytes = MAXALIGN(nbytes);
+
+	/*
+	 * If amestimateparallelscan is not provided, assume there is no
+	 * AM-specific data needed.  (It's hard to believe that could work, but
+	 * it's easy enough to cater to it here.)
+	 */
+	if (indexRelation->rd_amroutine->amestimateparallelscan != NULL)
+		nbytes = add_size(nbytes,
+					  indexRelation->rd_amroutine->amestimateparallelscan());
+
+	return nbytes;
+}
+
+/*
+ * index_parallelscan_initialize - initialize parallel scan
+ *
+ * We initialize both the ParallelIndexScanDesc proper and the AM-specific
+ * information which follows it.
+ *
+ * This function calls access method specific initialization routine to
+ * initialize am specific information.  Call this just once in the leader
+ * process; then, individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heapRelation, Relation indexRelation,
+							  Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset;
+
+	RELATION_CHECKS;
+
+	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+					  EstimateSnapshotSpace(snapshot));
+	offset = MAXALIGN(offset);
+
+	target->ps_relid = RelationGetRelid(heapRelation);
+	target->ps_indexid = RelationGetRelid(indexRelation);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexRelation->rd_amroutine->aminitparallelscan != NULL)
+	{
+		void	   *amtarget;
+
+		amtarget = OffsetToPointer(target, offset);
+		indexRelation->rd_amroutine->aminitparallelscan(amtarget);
+	}
+}
+
+/* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
+		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
+									pscan, true);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by index_beginscan_internal.
+	 */
+	scan->heapRelation = heaprel;
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acf..469e7ab 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -118,6 +118,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index ca4b0bd..78846be 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6a5f279..e91e41d 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -137,6 +137,18 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
 /* restore marked scan position */
 typedef void (*amrestrpos_function) (IndexScanDesc scan);
 
+/*
+ * Callback function signatures - for parallel index scans.
+ */
+
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (void);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
 
 /*
  * API struct for an index AM.  Note this must be stored in a single palloc'd
@@ -196,6 +208,11 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;		/* can be NULL */
 	amrestrpos_function amrestrpos;		/* can be NULL */
+
+	/* interface functions to support parallel index scans */
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 } IndexAmRoutine;
 
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b2e078a..d2258f6 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -131,6 +133,13 @@ extern bool index_insert(Relation indexRelation,
 			 Relation heapRelation,
 			 IndexUniqueCheck checkUnique);
 
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							Snapshot snapshot, ParallelIndexScanDesc target);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
+						 Relation indexrel, int nkeys, int norderbys,
+						 ParallelIndexScanDesc pscan);
+
 extern IndexScanDesc index_beginscan(Relation heapRelation,
 				Relation indexRelation,
 				Snapshot snapshot,
@@ -141,6 +150,7 @@ extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
 extern void index_rescan(IndexScanDesc scan,
 			 ScanKey keys, int nkeys,
 			 ScanKey orderbys, int norderbys);
+extern void index_parallelrescan(IndexScanDesc scan);
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..ce3ca8d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -93,6 +93,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
@@ -126,8 +127,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+
+	/* parallel index scan information, in shared memory */
+	ParallelIndexScanDesc parallel_scan;
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/c.h b/src/include/c.h
index efbb77f..a2c043a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -527,6 +527,9 @@ typedef NameData *Name;
 #define PointerIsAligned(pointer, type) \
 		(((uintptr_t)(pointer) % (sizeof (type))) == 0)
 
+#define OffsetToPointer(base, offset) \
+		((void *)((char *) base + offset))
+
 #define OidIsValid(objectId)  ((bool) ((objectId) != InvalidOid))
 
 #define RegProcedureIsValid(p)	OidIsValid(p)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..c4235ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1264,6 +1264,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

#49

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#48)

3 attachment(s)

Re: Parallel Index Scans

On Sat, Jan 21, 2017 at 12:23 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Jan 21, 2017 at 1:15 AM, Robert Haas <robertmhaas@gmail.com> wrote:

I think it's a good idea to put all three of those functions together
in the listing, similar to what we did in
69d34408e5e7adcef8ef2f4e9c4f2919637e9a06 for FDWs. After all they are
closely related in purpose, and it may be easiest to understand if
they are next to each other in the listing. I suggest that we move
them to the end in IndexAmRoutine similar to the way FdwRoutine was
done; in other words, my idea is to make the structure consistent with
the way that I revised the documentation instead of making the
documentation consistent with the order you picked for the structure
members. What I like about that is that it gives a good opportunity
to include some general remarks on parallel index scans in a central
place, as I did in that patch. Also, it makes it easier for people
who care about parallel index scans to find all of the related things
(since they are together) and for people who don't care about them to
ignore it all (for the same reason). What do you think about that
approach?

Sounds sensible. Updated patch based on that approach is attached.

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

I
will rebase the remaining work based on this patch and send them
separately.

Rebased patches are attached. I have fixed few review comments in
these patches.
parallel_index_scan_v6 - Changed the function btparallelrescan so that
it always expects a valid parallel scan descriptor.
parallel_index_opt_exec_support_v6 - Removed the usage of
GatherSupportsBackwardScan. Expanded the comments in
ExecReScanIndexScan.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel-generic-index-scan.2.patchapplication/octet-stream; name=parallel-generic-index-scan.2.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 06077af..858798d 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -138,6 +138,9 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = blendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 40f201b..5d8e557 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -131,6 +131,11 @@ typedef struct IndexAmRoutine
     amendscan_function amendscan;
     ammarkpos_function ammarkpos;       /* can be NULL */
     amrestrpos_function amrestrpos;     /* can be NULL */
+
+    /* interface functions to support parallel index scans */
+    amestimateparallelscan_function amestimateparallelscan;    /* can be NULL */
+    aminitparallelscan_function aminitparallelscan;    /* can be NULL */
+    amparallelrescan_function amparallelrescan;    /* can be NULL */
 } IndexAmRoutine;
 </programlisting>
   </para>
@@ -624,6 +629,68 @@ amrestrpos (IndexScanDesc scan);
    the <structfield>amrestrpos</> field in its <structname>IndexAmRoutine</>
    struct may be set to NULL.
   </para>
+
+  <para>
+   In addition to supporting ordinary index scans, some types of index
+   may wish to support <firstterm>parallel index scans</>, which allow
+   multiple backends to cooperate in performing an index scan.  The
+   index access method should arrange things so that each cooperating
+   process returns a subset of the tuples that would be performed by
+   an ordinary, non-parallel index scan, but in such a way that the
+   union of those subsets is equal to the set of tuples that would be
+   returned by an ordinary, non-parallel index scan.  Furthermore, while
+   there need not be any global ordering of tuples returned by a parallel
+   scan, the ordering of that subset of tuples returned within each
+   cooperating backend must match the requested ordering.  The following
+   functions may be implemented to support parallel index scans:
+  </para>
+
+  <para>
+<programlisting>
+Size
+amestimateparallelscan (void);
+</programlisting>
+   Estimate and return the number of bytes of dynamic shared memory which
+   the access method will be needed to perform a parallel scan.  (This number
+   is in addition to, not in lieu of, the amount of space needed for
+   AM-independent data in <structname>ParallelIndexScanDescData</>.)
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or for which the number of additional bytes
+   of storage required is zero.
+  </para>
+
+  <para>
+<programlisting>
+void
+aminitparallelscan (void *target);
+</programlisting>
+   This function will be called to initialize dynamic shared memory at the
+   beginning of a parallel scan.  <parameter>target</> will point to at least
+   the number of bytes previously returned by
+   <function>amestimateparallelscan</>, and this function may use that
+   amount of space to store whatever data it wishes.
+  </para>
+
+  <para>
+   It is not necessary to implement this function for access methods which
+   do not support parallel scans or in cases where the shared memory space
+   required needs no initialization.
+  </para>
+
+  <para>
+<programlisting>
+void
+amparallelrescan (IndexScanDesc scan);
+</programlisting>
+   This function, if implemented, will be called when a parallel index scan
+   must be restarted.  It should reset any shared state set up by
+   <function>aminitparallelscan</> such that the scan will be restarted from
+   the beginning.
+  </para>
+
  </sect1>
 
  <sect1 id="index-scanning">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index d60ddd2..b2afdb7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -112,6 +112,9 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = brinendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 3909638..02d920b 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -68,6 +68,9 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = ginendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 597056a..c2247ad 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -89,6 +89,9 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = gistendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index a64a9b9..ec8ed33 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,6 +86,9 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = hashendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 4822af9..ba27c1e 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -20,6 +20,10 @@
  *		index_insert	- insert an index tuple into a relation
  *		index_markpos	- mark a scan position
  *		index_restrpos	- restore a scan position
+ *		index_parallelscan_estimate - estimate shared memory for parallel scan
+ *		index_parallelscan_initialize - initialize parallel scan
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ *		index_beginscan_parallel - join parallel index scan
  *		index_getnext_tid	- get the next TID from a scan
  *		index_fetch_heap		- get the scan's next heap tuple
  *		index_getnext	- get the next heap tuple from a scan
@@ -120,7 +124,8 @@ do { \
 } while(0)
 
 static IndexScanDesc index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot);
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap);
 
 
 /* ----------------------------------------------------------------
@@ -219,7 +224,7 @@ index_beginscan(Relation heapRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -244,7 +249,7 @@ index_beginscan_bitmap(Relation indexRelation,
 {
 	IndexScanDesc scan;
 
-	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot);
+	scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
 
 	/*
 	 * Save additional parameters into the scandesc.  Everything else was set
@@ -260,8 +265,11 @@ index_beginscan_bitmap(Relation indexRelation,
  */
 static IndexScanDesc
 index_beginscan_internal(Relation indexRelation,
-						 int nkeys, int norderbys, Snapshot snapshot)
+						 int nkeys, int norderbys, Snapshot snapshot,
+						 ParallelIndexScanDesc pscan, bool temp_snap)
 {
+	IndexScanDesc scan;
+
 	RELATION_CHECKS;
 	CHECK_REL_PROCEDURE(ambeginscan);
 
@@ -276,8 +284,13 @@ index_beginscan_internal(Relation indexRelation,
 	/*
 	 * Tell the AM to open a scan.
 	 */
-	return indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
+	scan = indexRelation->rd_amroutine->ambeginscan(indexRelation, nkeys,
 													norderbys);
+	/* Initialize information for parallel scan. */
+	scan->parallel_scan = pscan;
+	scan->xs_temp_snap = temp_snap;
+
+	return scan;
 }
 
 /* ----------------
@@ -341,6 +354,9 @@ index_endscan(IndexScanDesc scan)
 	/* Release index refcount acquired by index_beginscan */
 	RelationDecrementReferenceCount(scan->indexRelation);
 
+	if (scan->xs_temp_snap)
+		UnregisterSnapshot(scan->xs_snapshot);
+
 	/* Release the scan data structure itself */
 	IndexScanEnd(scan);
 }
@@ -389,6 +405,115 @@ index_restrpos(IndexScanDesc scan)
 	scan->indexRelation->rd_amroutine->amrestrpos(scan);
 }
 
+/*
+ * index_parallelscan_estimate - estimate shared memory for parallel scan
+ *
+ * Currently, we don't pass any information to the AM-specific estimator,
+ * so it can probably only return a constant.  In the future, we might need
+ * to pass more information.
+ */
+Size
+index_parallelscan_estimate(Relation indexRelation, Snapshot snapshot)
+{
+	Size		nbytes;
+
+	RELATION_CHECKS;
+
+	nbytes = offsetof(ParallelIndexScanDescData, ps_snapshot_data);
+	nbytes = add_size(nbytes, EstimateSnapshotSpace(snapshot));
+	nbytes = MAXALIGN(nbytes);
+
+	/*
+	 * If amestimateparallelscan is not provided, assume there is no
+	 * AM-specific data needed.  (It's hard to believe that could work, but
+	 * it's easy enough to cater to it here.)
+	 */
+	if (indexRelation->rd_amroutine->amestimateparallelscan != NULL)
+		nbytes = add_size(nbytes,
+					  indexRelation->rd_amroutine->amestimateparallelscan());
+
+	return nbytes;
+}
+
+/*
+ * index_parallelscan_initialize - initialize parallel scan
+ *
+ * We initialize both the ParallelIndexScanDesc proper and the AM-specific
+ * information which follows it.
+ *
+ * This function calls access method specific initialization routine to
+ * initialize am specific information.  Call this just once in the leader
+ * process; then, individual workers attach via index_beginscan_parallel.
+ */
+void
+index_parallelscan_initialize(Relation heapRelation, Relation indexRelation,
+							  Snapshot snapshot, ParallelIndexScanDesc target)
+{
+	Size		offset;
+
+	RELATION_CHECKS;
+
+	offset = add_size(offsetof(ParallelIndexScanDescData, ps_snapshot_data),
+					  EstimateSnapshotSpace(snapshot));
+	offset = MAXALIGN(offset);
+
+	target->ps_relid = RelationGetRelid(heapRelation);
+	target->ps_indexid = RelationGetRelid(indexRelation);
+	target->ps_offset = offset;
+	SerializeSnapshot(snapshot, target->ps_snapshot_data);
+
+	/* aminitparallelscan is optional; assume no-op if not provided by AM */
+	if (indexRelation->rd_amroutine->aminitparallelscan != NULL)
+	{
+		void	   *amtarget;
+
+		amtarget = OffsetToPointer(target, offset);
+		indexRelation->rd_amroutine->aminitparallelscan(amtarget);
+	}
+}
+
+/* ----------------
+ *		index_parallelrescan  - (re)start a parallel scan of an index
+ * ----------------
+ */
+void
+index_parallelrescan(IndexScanDesc scan)
+{
+	SCAN_CHECKS;
+
+	/* amparallelrescan is optional; assume no-op if not provided by AM */
+	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
+		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
+}
+
+/*
+ * index_beginscan_parallel - join parallel index scan
+ *
+ * Caller must be holding suitable locks on the heap and the index.
+ */
+IndexScanDesc
+index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
+						 int norderbys, ParallelIndexScanDesc pscan)
+{
+	Snapshot	snapshot;
+	IndexScanDesc scan;
+
+	Assert(RelationGetRelid(heaprel) == pscan->ps_relid);
+	snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
+	RegisterSnapshot(snapshot);
+	scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
+									pscan, true);
+
+	/*
+	 * Save additional parameters into the scandesc.  Everything else was set
+	 * up by index_beginscan_internal.
+	 */
+	scan->heapRelation = heaprel;
+	scan->xs_snapshot = snapshot;
+
+	return scan;
+}
+
 /* ----------------
  * index_getnext_tid - get the next TID from a scan
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1bb1acf..469e7ab 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -118,6 +118,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index ca4b0bd..78846be 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -68,6 +68,9 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = spgendscan;
 	amroutine->ammarkpos = NULL;
 	amroutine->amrestrpos = NULL;
+	amroutine->amestimateparallelscan = NULL;
+	amroutine->aminitparallelscan = NULL;
+	amroutine->amparallelrescan = NULL;
 
 	PG_RETURN_POINTER(amroutine);
 }
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6a5f279..e91e41d 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -137,6 +137,18 @@ typedef void (*ammarkpos_function) (IndexScanDesc scan);
 /* restore marked scan position */
 typedef void (*amrestrpos_function) (IndexScanDesc scan);
 
+/*
+ * Callback function signatures - for parallel index scans.
+ */
+
+/* estimate size of parallel scan descriptor */
+typedef Size (*amestimateparallelscan_function) (void);
+
+/* prepare for parallel index scan */
+typedef void (*aminitparallelscan_function) (void *target);
+
+/* (re)start parallel index scan */
+typedef void (*amparallelrescan_function) (IndexScanDesc scan);
 
 /*
  * API struct for an index AM.  Note this must be stored in a single palloc'd
@@ -196,6 +208,11 @@ typedef struct IndexAmRoutine
 	amendscan_function amendscan;
 	ammarkpos_function ammarkpos;		/* can be NULL */
 	amrestrpos_function amrestrpos;		/* can be NULL */
+
+	/* interface functions to support parallel index scans */
+	amestimateparallelscan_function amestimateparallelscan;		/* can be NULL */
+	aminitparallelscan_function aminitparallelscan;		/* can be NULL */
+	amparallelrescan_function amparallelrescan; /* can be NULL */
 } IndexAmRoutine;
 
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index b2e078a..51466b9 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -83,6 +83,8 @@ typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 typedef struct IndexScanDescData *IndexScanDesc;
 typedef struct SysScanDescData *SysScanDesc;
 
+typedef struct ParallelIndexScanDescData *ParallelIndexScanDesc;
+
 /*
  * Enumeration specifying the type of uniqueness check to perform in
  * index_insert().
@@ -144,6 +146,13 @@ extern void index_rescan(IndexScanDesc scan,
 extern void index_endscan(IndexScanDesc scan);
 extern void index_markpos(IndexScanDesc scan);
 extern void index_restrpos(IndexScanDesc scan);
+extern Size index_parallelscan_estimate(Relation indexrel, Snapshot snapshot);
+extern void index_parallelscan_initialize(Relation heaprel, Relation indexrel,
+							Snapshot snapshot, ParallelIndexScanDesc target);
+extern void index_parallelrescan(IndexScanDesc scan);
+extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
+						 Relation indexrel, int nkeys, int norderbys,
+						 ParallelIndexScanDesc pscan);
 extern ItemPointer index_getnext_tid(IndexScanDesc scan,
 				  ScanDirection direction);
 extern HeapTuple index_fetch_heap(IndexScanDesc scan);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 8746045..ce3ca8d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -93,6 +93,7 @@ typedef struct IndexScanDescData
 	ScanKey		keyData;		/* array of index qualifier descriptors */
 	ScanKey		orderByData;	/* array of ordering op descriptors */
 	bool		xs_want_itup;	/* caller requests index tuples */
+	bool		xs_temp_snap;	/* unregister snapshot at scan end? */
 
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
@@ -126,8 +127,20 @@ typedef struct IndexScanDescData
 
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
+
+	/* parallel index scan information, in shared memory */
+	ParallelIndexScanDesc parallel_scan;
 }	IndexScanDescData;
 
+/* Generic structure for parallel scans */
+typedef struct ParallelIndexScanDescData
+{
+	Oid			ps_relid;
+	Oid			ps_indexid;
+	Size		ps_offset;		/* Offset in bytes of am specific structure */
+	char		ps_snapshot_data[FLEXIBLE_ARRAY_MEMBER];
+} ParallelIndexScanDescData;
+
 /* Struct for heap-or-index scans of system tables */
 typedef struct SysScanDescData
 {
diff --git a/src/include/c.h b/src/include/c.h
index efbb77f..a2c043a 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -527,6 +527,9 @@ typedef NameData *Name;
 #define PointerIsAligned(pointer, type) \
 		(((uintptr_t)(pointer) % (sizeof (type))) == 0)
 
+#define OffsetToPointer(base, offset) \
+		((void *)((char *) base + offset))
+
 #define OidIsValid(objectId)  ((bool) ((objectId) != InvalidOid))
 
 #define RegProcedureIsValid(p)	OidIsValid(p)
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 993880d..c4235ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1264,6 +1264,8 @@ OverrideSearchPath
 OverrideStackEntry
 PACE_HEADER
 PACL
+ParallelIndexScanDesc
+ParallelIndexScanDescData
 PATH
 PBOOL
 PCtxtHandle

parallel_index_scan_v6.patchapplication/octet-stream; name=parallel_index_scan_v6.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 01fad38..638a9ab 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ba27c1e..28de5be 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -481,6 +481,9 @@ index_parallelrescan(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
 
+	if (!scan->parallel_scan)
+		return;
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
 		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 469e7ab..a1de4a2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,46 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * Below flags are used to indicate the state of parallel scan.
+ *
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit)
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +160,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -490,6 +532,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -652,6 +695,217 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan - estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ideally, we don't need to acquire spinlock here, but being
+	 * consistent with heap_rescan seems to be a good idea.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *		scans and latest block scanned for backward scans.
+ *
+ * status - The value of status tells caller whether to continue the scan or
+ * not.  The true value of status indicates either one of the following (a)
+ * the block number returned is valid and the scan can be continued (b) the
+ * block number is invalid and the scan has just begun (c) the block number
+ * is P_NONE and the scan is finished.  The false value indicates that we
+ * have reached the end of scan for current scankeys and for that we return
+ * block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the return value, if the status is false.
+ */
+BlockNumber
+_bt_parallel_seize(IndexScanDesc scan, bool *status)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	BlockNumber nextPage = InvalidBlockNumber;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	*status = true;
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			*status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			*status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			nextPage = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !*status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!*status)
+		return P_NONE;
+
+	*status = true;
+	return nextPage;
+}
+
+/*
+ * _bt_parallel_release() -- Advances the parallel scan to allow scan of next
+ *		page
+ *
+ * Save information about scan position and wake up next worker to continue
+ * scan.
+ *
+ * For backward scan, scan_page holds the latest page being scanned.
+ * For forward scan, scan_page holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Finishes the parallel scan
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_scan() -- Advances the parallel scan
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_scan(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 4fba75a..b565e09 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		blkno = _bt_parallel_seize(scan, &status);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1370,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1398,68 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			blkno = _bt_parallel_seize(scan, &status);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1345,10 +1467,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			CHECK_FOR_INTERRUPTS();
 			/* step right one page */
 			so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
-			/* check for deleted page */
 			page = BufferGetPage(so->currPos.buf);
 			TestForOldSnapshot(scan->xs_snapshot, rel, page);
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			/* check for deleted page */
 			if (!P_IGNORE(opaque))
 			{
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
@@ -1359,14 +1481,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1539,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1556,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1577,48 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  We could
+			 * avoid that by doing _bt_parallel_release once we have read the
+			 * current page, but it is bad to make other workers wait till we
+			 * read the page.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				blkno = _bt_parallel_seize(scan, &status);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1896,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1925,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..692ced4 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_scan(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..547c1cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -661,10 +664,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool *status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_scan(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

parallel_index_opt_exec_support_v6.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v6.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..f2eda67 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 5d8e557..623be4e 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b2afdb7..9ae1c39 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 02d920b..4feb524 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c2247ad..b2c9389 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -70,6 +70,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index ec8ed33..1944452 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a1de4a2..7f0f0e2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..e57ac49 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e01fe6d..cb91cc3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..7bb594c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -514,6 +517,18 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * If we are here to just update the scan keys, then don't reset parallel
+	 * scan.  We don't want each of the participating process in the parallel
+	 * scan to update the shared parallel scan state at the start of the scan.
+	 * It is quite possible that one of the participants has already begun
+	 * scanning the index when another has yet to start it.
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -539,10 +554,21 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/*
+	 * Reset (parallel) index scan.  For parallel-aware nodes, the scan
+	 * descriptor is initialized during actual execution of node and we can
+	 * reach here before that (ex. during execution of nest loop join).  So,
+	 * avoid updating the scan descriptor at that time.
+	 */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1013,22 +1039,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1590,3 +1623,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5c18987..20934fd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -126,7 +126,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 
 /*
@@ -2879,7 +2878,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * relation.  "pages" is the number of pages from the relation that we
  * expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 {
 	int			parallel_workers;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 458f139..f4253d3 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -400,6 +400,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -602,11 +603,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..1905917 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -1042,8 +1042,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			int			parallel_workers = 0;
+
+			parallel_workers = compute_parallel_worker(rel, index->pages);
+
+			if (parallel_workers > 0)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  orderbyclauses,
+										  orderbyclausecols,
+										  useful_pathkeys,
+										  index_is_ordered ?
+										  ForwardScanDirection :
+										  NoMovementScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  parallel_workers);
+
+				add_partial_path(rel, (Path *) ipath);
+			}
+		}
 	}
 
 	/*
@@ -1066,8 +1101,38 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				int			parallel_workers = 0;
+
+				parallel_workers = compute_parallel_worker(rel, index->pages);
+
+				if (parallel_workers > 0)
+				{
+					ipath = create_index_path(root, index,
+											  index_clauses,
+											  clause_columns,
+											  NIL,
+											  NIL,
+											  useful_pathkeys,
+											  BackwardScanDirection,
+											  index_only_scan,
+											  outer_relids,
+											  loop_count,
+											  parallel_workers);
+
+					add_partial_path(rel, (Path *) ipath);
+				}
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b5902f..fd25749 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5393,7 +5393,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..cdf5523 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1019,7 +1019,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Convert clauses to indexquals the executor can handle */
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e91e41d..ed62ac1 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -187,6 +187,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..0048aa9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1359,6 +1359,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1385,6 +1386,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..c9884f2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81a9be7..fabf314 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -53,6 +53,7 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..140fb3c 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..d7dfd28 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#50

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#44)

Re: Parallel Index Scans

On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool
*status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);
Any better names for the above functions, as these function will
provide/set
the next page that needs to be read.
These functions also set the state of scan. IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert). I am open to change if you or others have any better
suggestions.
I didn't find any better names other than the following,

_bt_get_next_parallel_page
_bt_set_next_parallel_page

I am not sure using *_next_* here will convey the message because for
backward scans we set the last page. I am open to changing the names
of functions if committer and or others prefer the names suggested by
you.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#51

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#49)

Re: Parallel Index Scans

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

Cool. Committed parallel-generic-index-scan.2.patch. Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Haribabu Kommi

kommi.haribabu@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#50)

Re: Parallel Index Scans

On Mon, Jan 23, 2017 at 5:07 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
On Thu, Jan 19, 2017 at 1:18 AM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
+extern BlockNumber _bt_parallel_seize(IndexScanDesc scan, bool
*status);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber
scan_page);
Any better names for the above functions, as these function will
provide/set
the next page that needs to be read.
These functions also set the state of scan. IIRC, these names were
being agreed between Robert and Rahila as well (suggested offlist by
Robert). I am open to change if you or others have any better
suggestions.
I didn't find any better names other than the following,

_bt_get_next_parallel_page
_bt_set_next_parallel_page
I am not sure using *_next_* here will convey the message because for
backward scans we set the last page. I am open to changing the names
of functions if committer and or others prefer the names suggested by
you.

OK. I am fine with it.
I don't have any other comments on the patch.

Regards,
Hari Babu
Fujitsu Australia

#53

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Haribabu Kommi (#52)

Re: Parallel Index Scans

On Fri, Jan 27, 2017 at 6:53 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

On Mon, Jan 23, 2017 at 5:07 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

On Fri, Jan 20, 2017 at 7:29 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:

I didn't find any better names other than the following,

_bt_get_next_parallel_page
_bt_set_next_parallel_page

I am not sure using *_next_* here will convey the message because for
backward scans we set the last page. I am open to changing the names
of functions if committer and or others prefer the names suggested by
you.

OK. I am fine with it.
I don't have any other comments on the patch.

Thanks for the review.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#49)

Re: Parallel Index Scans

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

parallel_index_opt_exec_support_v6 - Removed the usage of
GatherSupportsBackwardScan. Expanded the comments in
ExecReScanIndexScan.

I looked through this and in general it looks reasonable to me.
However, I did notice one thing that I think is wrong. In the
parallel bitmap heap scan patch, the second argument to
compute_parallel_worker() is the number of pages that the parallel
scan is expected to fetch from the heap. In this patch, it's the
total number of pages in the index. The former seems to me to be
better, because the point of having a threshold relation size for
parallelism is that we don't want to use a lot of workers to scan a
small number of pages -- the distribution of work won't be even, and
the potential savings are limited. If we've got a big index but are
using a very selective qual to pull out only one or a small number of
rows on a single page or a small handful of pages, we shouldn't
generate a parallel path for that.

Now, against that theory, the GUC that controls the behavior of
compute_parallel_worker() is called min_parallel_relation_size, which
might make you think that the decision is supposed to be based on the
whole size of some relation. But I think that just means we need to
rename the GUC to something like min_parallel_scan_size. Possibly we
also ought consider reducing the default value somewhat, because it
seems like both sequential and index scans can benefit even when
scanning less than 8MB.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#54)

2 attachment(s)

Re: Parallel Index Scans

On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

parallel_index_opt_exec_support_v6 - Removed the usage of
GatherSupportsBackwardScan. Expanded the comments in
ExecReScanIndexScan.

I looked through this and in general it looks reasonable to me.
However, I did notice one thing that I think is wrong. In the
parallel bitmap heap scan patch, the second argument to
compute_parallel_worker() is the number of pages that the parallel
scan is expected to fetch from the heap. In this patch, it's the
total number of pages in the index. The former seems to me to be
better, because the point of having a threshold relation size for
parallelism is that we don't want to use a lot of workers to scan a
small number of pages -- the distribution of work won't be even, and
the potential savings are limited. If we've got a big index but are
using a very selective qual to pull out only one or a small number of
rows on a single page or a small handful of pages, we shouldn't
generate a parallel path for that.

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers. I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages). I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap. OTOH, we can consider just pages
in the index as that is where mainly the parallelism works. In the
attached patch (parallel_index_opt_exec_support_v7.patch), I have
considered both index and heap pages, let me know if you think some
other way is better. I have also prepared a separate independent
patch (compute_index_pages_v1) on HEAD to compute index pages which
can be used by parallel index scan. There is no change in
parallel_index_scan (parallel btree scan) patch, so I am not attaching
its new version.

Now, against that theory, the GUC that controls the behavior of
compute_parallel_worker() is called min_parallel_relation_size, which
might make you think that the decision is supposed to be based on the
whole size of some relation. But I think that just means we need to
rename the GUC to something like min_parallel_scan_size. Possibly we
also ought consider reducing the default value somewhat, because it
seems like both sequential and index scans can benefit even when
scanning less than 8MB.

Agreed, but let's consider it separately.

The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1]/messages/by-id/CAA4eK1J=LSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw@mail.gmail.com,
parallel_index_opt_exec_support_v7.patch

[1]: /messages/by-id/CAA4eK1J=LSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

compute_index_pages_v1.patchapplication/octet-stream; name=compute_index_pages_v1.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a43daa7..ef125b4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4785,6 +4785,179 @@ get_parallel_divisor(Path *path)
 }
 
 /*
+ * compute_index_pages
+ *
+ * compute number of pages fetched from index in index scan.
+ */
+double
+compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
+					List *indexQuals, int loop_count, double numIndexTuples,
+					double *random_page_cost, double *sa_scans,
+					double *indexTuples, double *indexPages,
+					Selectivity *indexSel, Cost *cost)
+{
+	List	   *selectivityQuals;
+	double		pages_fetched;
+	double		num_sa_scans;
+	double		num_outer_scans;
+	double		num_scans;
+	double		spc_random_page_cost;
+	double		numIndexPages;
+	Selectivity indexSelectivity;
+	Cost		indexTotalCost;
+	ListCell   *l;
+
+	/*
+	 * If the index is partial, AND the index predicate with the explicitly
+	 * given indexquals to produce a more accurate idea of the index
+	 * selectivity.
+	 */
+	selectivityQuals = add_predicate_to_quals(index, indexQuals);
+
+	/*
+	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
+	 * index scans that will be performed.
+	 */
+	num_sa_scans = 1;
+	foreach(l, indexQuals)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (IsA(rinfo->clause, ScalarArrayOpExpr))
+		{
+			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+			int			alength = estimate_array_length(lsecond(saop->args));
+
+			if (alength > 1)
+				num_sa_scans *= alength;
+		}
+	}
+
+	/* Estimate the fraction of main-table tuples that will be visited */
+	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
+											  index->rel->relid,
+											  JOIN_INNER,
+											  NULL);
+
+	/*
+	 * If caller didn't give us an estimate, estimate the number of index
+	 * tuples that will be visited.  We do it in this rather peculiar-looking
+	 * way in order to get the right answer for partial indexes.
+	 */
+	if (numIndexTuples <= 0.0)
+	{
+		numIndexTuples = indexSelectivity * index->rel->tuples;
+
+		/*
+		 * The above calculation counts all the tuples visited across all
+		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
+		 * average per-indexscan number, so adjust.  This is a handy place to
+		 * round to integer, too.  (If caller supplied tuple estimate, it's
+		 * responsible for handling these considerations.)
+		 */
+		numIndexTuples = rint(numIndexTuples / num_sa_scans);
+	}
+
+	/*
+	 * We can bound the number of tuples by the index size in any case. Also,
+	 * always estimate at least one tuple is touched, even when
+	 * indexSelectivity estimate is tiny.
+	 */
+	if (numIndexTuples > index->tuples)
+		numIndexTuples = index->tuples;
+	if (numIndexTuples < 1.0)
+		numIndexTuples = 1.0;
+
+	/*
+	 * Estimate the number of index pages that will be retrieved.
+	 *
+	 * We use the simplistic method of taking a pro-rata fraction of the total
+	 * number of index pages.  In effect, this counts only leaf pages and not
+	 * any overhead such as index metapage or upper tree levels.
+	 *
+	 * In practice access to upper index levels is often nearly free because
+	 * those tend to stay in cache under load; moreover, the cost involved is
+	 * highly dependent on index type.  We therefore ignore such costs here
+	 * and leave it to the caller to add a suitable charge if needed.
+	 */
+	if (index->pages > 1 && index->tuples > 1)
+		numIndexPages = ceil(numIndexTuples * index->pages / index->tuples);
+	else
+		numIndexPages = 1.0;
+
+	/* fetch estimated page cost for tablespace containing index */
+	get_tablespace_page_costs(index->reltablespace,
+							  &spc_random_page_cost,
+							  NULL);
+
+	/*
+	 * Now compute the disk access costs.
+	 *
+	 * The above calculations are all per-index-scan.  However, if we are in a
+	 * nestloop inner scan, we can expect the scan to be repeated (with
+	 * different search keys) for each row of the outer relation.  Likewise,
+	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
+	 * the potential for cache effects to reduce the number of disk page
+	 * fetches needed.  We want to estimate the average per-scan I/O cost in
+	 * the presence of caching.
+	 *
+	 * We use the Mackert-Lohman formula (see costsize.c for details) to
+	 * estimate the total number of page fetches that occur.  While this
+	 * wasn't what it was designed for, it seems a reasonable model anyway.
+	 * Note that we are counting pages not tuples anymore, so we take N = T =
+	 * index size, as if there were one "tuple" per page.
+	 */
+	num_outer_scans = loop_count;
+	num_scans = num_sa_scans * num_outer_scans;
+
+	if (num_scans > 1)
+	{
+		/* total page fetches ignoring cache effects */
+		pages_fetched = numIndexPages * num_scans;
+
+		/* use Mackert and Lohman formula to adjust for cache effects */
+		pages_fetched = index_pages_fetched(pages_fetched,
+											index->pages,
+											(double) index->pages,
+											root);
+
+		/*
+		 * Now compute the total disk access cost, and then report a pro-rated
+		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
+		 * since that's internal to the indexscan.)
+		 */
+		indexTotalCost = (pages_fetched * spc_random_page_cost)
+			/ num_outer_scans;
+	}
+	else
+	{
+		pages_fetched = numIndexPages;
+
+		/*
+		 * For a single index scan, we just charge spc_random_page_cost per
+		 * page touched.
+		 */
+		indexTotalCost = numIndexPages * spc_random_page_cost;
+	}
+
+	/* fill the output parameters */
+	if (random_page_cost)
+		*random_page_cost = spc_random_page_cost;
+	if (sa_scans)
+		*sa_scans = num_sa_scans;
+	if (indexTuples)
+		*indexTuples = numIndexTuples;
+	if (indexSel)
+		*indexSel = indexSelectivity;
+	if (indexPages)
+		*indexPages = numIndexPages;
+	if (cost)
+		*cost = indexTotalCost;
+
+	return pages_fetched;
+}
+
+/*
  * compute_bitmap_pages
  *
  * compute number of pages fetched from heap in bitmap heap scan.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..5d22496 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -863,6 +863,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	IndexPath  *ipath;
 	List	   *index_clauses;
 	List	   *clause_columns;
+	List	   *indexquals;
+	List	   *indexqualcols;
 	Relids		outer_relids;
 	double		loop_count;
 	List	   *orderbyclauses;
@@ -1014,6 +1016,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		orderbyclausecols = NIL;
 	}
 
+	/* Convert clauses to indexquals the executor can handle */
+	expand_indexqual_conditions(index, index_clauses, clause_columns,
+								&indexquals, &indexqualcols);
+
+
 	/*
 	 * 3. Check if an index-only scan is possible.  If we're not building
 	 * plain indexscans, this isn't relevant since bitmap scans don't support
@@ -1034,6 +1041,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		ipath = create_index_path(root, index,
 								  index_clauses,
 								  clause_columns,
+								  indexquals,
+								  indexqualcols,
 								  orderbyclauses,
 								  orderbyclausecols,
 								  useful_pathkeys,
@@ -1060,6 +1069,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			ipath = create_index_path(root, index,
 									  index_clauses,
 									  clause_columns,
+									  indexquals,
+									  indexqualcols,
 									  NIL,
 									  NIL,
 									  useful_pathkeys,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 4b5902f..bf6550b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5391,7 +5391,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of index scan */
 	indexScanPath = create_index_path(root, indexInfo,
-									  NIL, NIL, NIL, NIL, NIL,
+									  NIL, NIL, NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
 									  NULL, 1.0);
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..8cd80db 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1013,6 +1013,8 @@ create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
 				  List *indexclausecols,
+				  List *indexquals,
+				  List *indexqualcols,
 				  List *indexorderbys,
 				  List *indexorderbycols,
 				  List *pathkeys,
@@ -1023,8 +1025,6 @@ create_index_path(PlannerInfo *root,
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
-	List	   *indexquals,
-			   *indexqualcols;
 
 	pathnode->path.pathtype = indexonly ? T_IndexOnlyScan : T_IndexScan;
 	pathnode->path.parent = rel;
@@ -1036,10 +1036,6 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.parallel_workers = 0;
 	pathnode->path.pathkeys = pathkeys;
 
-	/* Convert clauses to indexquals the executor can handle */
-	expand_indexqual_conditions(index, indexclauses, indexclausecols,
-								&indexquals, &indexqualcols);
-
 	/* Fill in the pathnode */
 	pathnode->indexinfo = index;
 	pathnode->indexclauses = indexclauses;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa32e9e..6762bf9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -205,7 +205,6 @@ static Selectivity regex_selectivity(const char *patt, int pattlen,
 static Datum string_to_datum(const char *str, Oid datatype);
 static Const *string_to_const(const char *str, Oid datatype);
 static Const *string_to_bytea_const(const char *str, size_t str_len);
-static List *add_predicate_to_quals(IndexOptInfo *index, List *indexQuals);
 
 
 /*
@@ -6245,146 +6244,13 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
-	List	   *selectivityQuals;
-	ListCell   *l;
-
-	/*
-	 * If the index is partial, AND the index predicate with the explicitly
-	 * given indexquals to produce a more accurate idea of the index
-	 * selectivity.
-	 */
-	selectivityQuals = add_predicate_to_quals(index, indexQuals);
-
-	/*
-	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
-	 */
-	num_sa_scans = 1;
-	foreach(l, indexQuals)
-	{
-		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
-
-		if (IsA(rinfo->clause, ScalarArrayOpExpr))
-		{
-			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
-			int			alength = estimate_array_length(lsecond(saop->args));
-
-			if (alength > 1)
-				num_sa_scans *= alength;
-		}
-	}
-
-	/* Estimate the fraction of main-table tuples that will be visited */
-	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
-											  index->rel->relid,
-											  JOIN_INNER,
-											  NULL);
-
-	/*
-	 * If caller didn't give us an estimate, estimate the number of index
-	 * tuples that will be visited.  We do it in this rather peculiar-looking
-	 * way in order to get the right answer for partial indexes.
-	 */
-	numIndexTuples = costs->numIndexTuples;
-	if (numIndexTuples <= 0.0)
-	{
-		numIndexTuples = indexSelectivity * index->rel->tuples;
-
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
-	/*
-	 * We can bound the number of tuples by the index size in any case. Also,
-	 * always estimate at least one tuple is touched, even when
-	 * indexSelectivity estimate is tiny.
-	 */
-	if (numIndexTuples > index->tuples)
-		numIndexTuples = index->tuples;
-	if (numIndexTuples < 1.0)
-		numIndexTuples = 1.0;
-
-	/*
-	 * Estimate the number of index pages that will be retrieved.
-	 *
-	 * We use the simplistic method of taking a pro-rata fraction of the total
-	 * number of index pages.  In effect, this counts only leaf pages and not
-	 * any overhead such as index metapage or upper tree levels.
-	 *
-	 * In practice access to upper index levels is often nearly free because
-	 * those tend to stay in cache under load; moreover, the cost involved is
-	 * highly dependent on index type.  We therefore ignore such costs here
-	 * and leave it to the caller to add a suitable charge if needed.
-	 */
-	if (index->pages > 1 && index->tuples > 1)
-		numIndexPages = ceil(numIndexTuples * index->pages / index->tuples);
-	else
-		numIndexPages = 1.0;
 
-	/* fetch estimated page cost for tablespace containing index */
-	get_tablespace_page_costs(index->reltablespace,
-							  &spc_random_page_cost,
-							  NULL);
-
-	/*
-	 * Now compute the disk access costs.
-	 *
-	 * The above calculations are all per-index-scan.  However, if we are in a
-	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
-	 *
-	 * We use the Mackert-Lohman formula (see costsize.c for details) to
-	 * estimate the total number of page fetches that occur.  While this
-	 * wasn't what it was designed for, it seems a reasonable model anyway.
-	 * Note that we are counting pages not tuples anymore, so we take N = T =
-	 * index size, as if there were one "tuple" per page.
-	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
-	{
-		double		pages_fetched;
-
-		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
-
-		/* use Mackert and Lohman formula to adjust for cache effects */
-		pages_fetched = index_pages_fetched(pages_fetched,
-											index->pages,
-											(double) index->pages,
-											root);
-
-		/*
-		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
-		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
-	}
-	else
-	{
-		/*
-		 * For a single index scan, we just charge spc_random_page_cost per
-		 * page touched.
-		 */
-		indexTotalCost = numIndexPages * spc_random_page_cost;
-	}
+	(void) compute_index_pages(root, index, indexQuals, loop_count,
+							   costs->numIndexTuples, &spc_random_page_cost,
+							   &num_sa_scans, &numIndexTuples, &numIndexPages,
+							   &indexSelectivity, &indexTotalCost);
 
 	/*
 	 * CPU cost: any complex expressions in the indexquals will need to be
@@ -6446,7 +6312,7 @@ genericcostestimate(PlannerInfo *root,
  * predicate_implied_by() and clauselist_selectivity(), but might be
  * problematic if the result were passed to other things.
  */
-static List *
+List *
 add_predicate_to_quals(IndexOptInfo *index, List *indexQuals)
 {
 	List	   *predExtraQuals = NIL;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0e68264..8525917 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -183,6 +183,10 @@ extern void set_cte_size_estimates(PlannerInfo *root, RelOptInfo *rel,
 					   double cte_rows);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
+					List *indexclauses, int loop_count, double numIndexTuples,
+			 double *random_page_cost, double *sa_scans, double *indexTuples,
+					double *indexPages, Selectivity *indexSel, Cost *cost);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 				  Path *bitmapqual, int loop_count, Cost *cost, double *tuple);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..44e143c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -41,6 +41,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
 				  List *indexclausecols,
+				  List *indexquals,
+				  List *indexqualcols,
 				  List *indexorderbys,
 				  List *indexorderbycols,
 				  List *pathkeys,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 9f9d2dc..b1ffca3 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -212,6 +212,7 @@ extern void genericcostestimate(PlannerInfo *root, IndexPath *path,
 					double loop_count,
 					List *qinfos,
 					GenericCosts *costs);
+extern List *add_predicate_to_quals(IndexOptInfo *index, List *indexQuals);
 
 /* Functions in array_selfuncs.c */

parallel_index_opt_exec_support_v7.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v7.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..f2eda67 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 5d8e557..623be4e 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b2afdb7..9ae1c39 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 02d920b..4feb524 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c2247ad..b2c9389 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -70,6 +70,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index ec8ed33..1944452 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a1de4a2..7f0f0e2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -141,6 +141,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..e57ac49 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e01fe6d..cb91cc3 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..7bb594c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -514,6 +517,18 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * If we are here to just update the scan keys, then don't reset parallel
+	 * scan.  We don't want each of the participating process in the parallel
+	 * scan to update the shared parallel scan state at the start of the scan.
+	 * It is quite possible that one of the participants has already begun
+	 * scanning the index when another has yet to start it.
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -539,10 +554,21 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/*
+	 * Reset (parallel) index scan.  For parallel-aware nodes, the scan
+	 * descriptor is initialized during actual execution of node and we can
+	 * reach here before that (ex. during execution of nest loop join).  So,
+	 * avoid updating the scan descriptor at that time.
+	 */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1013,22 +1039,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1590,3 +1623,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5c18987..20934fd 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -126,7 +126,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 
 /*
@@ -2879,7 +2878,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * relation.  "pages" is the number of pages from the relation that we
  * expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 {
 	int			parallel_workers;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ef125b4..1561dee 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -400,6 +400,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -602,11 +603,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
@@ -4958,6 +4972,43 @@ compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
 }
 
 /*
+ * compute_index_and_heap_pages
+ *
+ * compute number of pages fetched from index and heap in index scan.
+ */
+double
+compute_index_and_heap_pages(PlannerInfo *root, IndexOptInfo *index,
+							 List *indexQuals, int loop_count)
+{
+	RelOptInfo *baserel = index->rel;
+	Selectivity indexSelectivity;
+	double		index_pages;
+	double		heap_pages;
+	double		tuples_fetched;
+
+	index_pages = compute_index_pages(root, index, indexQuals, loop_count,
+									  0.0, NULL, NULL, NULL, NULL,
+									  &indexSelectivity, NULL);
+
+	/* estimate number of main-table tuples fetched */
+	tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
+
+	if (loop_count > 1)
+		tuples_fetched *= loop_count;
+
+	/*
+	 * apply the Mackert and Lohman formula to compute number of pages
+	 * required to be fetched from heap.
+	 */
+	heap_pages = index_pages_fetched(tuples_fetched,
+									 baserel->pages,
+									 (double) index->pages,
+									 root);
+
+	return index_pages + heap_pages;
+}
+
+/*
  * compute_bitmap_pages
  *
  * compute number of pages fetched from heap in bitmap heap scan.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5d22496..d713c62 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -813,7 +813,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
- *	  or more IndexPaths.
+ *	  or more IndexPaths. It also constructs zero or more partial IndexPaths.
  *
  * We return a list of paths because (1) this routine checks some cases
  * that should cause us to not generate any IndexPath, and (2) in some
@@ -1051,8 +1051,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap or index only scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+			create_partial_index_path(root, index,
+									  index_clauses,
+									  clause_columns,
+									  indexquals,
+									  indexqualcols,
+									  orderbyclauses,
+									  orderbyclausecols,
+									  useful_pathkeys,
+									  index_is_ordered ?
+									  ForwardScanDirection :
+									  NoMovementScanDirection,
+									  index_only_scan,
+									  outer_relids,
+									  loop_count);
 	}
 
 	/*
@@ -1077,8 +1102,28 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+				create_partial_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  indexquals,
+										  indexqualcols,
+										  NIL,
+										  NIL,
+										  useful_pathkeys,
+										  BackwardScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bf6550b..bd08c55 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5393,7 +5393,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8cd80db..26249a3 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1021,7 +1021,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Fill in the pathnode */
@@ -1051,6 +1052,63 @@ create_index_path(PlannerInfo *root,
 }
 
 /*
+ * create_partial_index_path
+ *	  Build partial access path for a parallel index scan.
+ */
+void
+create_partial_index_path(PlannerInfo *root,
+						  IndexOptInfo *index,
+						  List *indexclauses,
+						  List *indexclausecols,
+						  List *indexquals,
+						  List *indexqualcols,
+						  List *indexorderbys,
+						  List *indexorderbycols,
+						  List *pathkeys,
+						  ScanDirection indexscandir,
+						  bool indexonly,
+						  Relids required_outer,
+						  double loop_count)
+{
+	double		pages_fetched;
+	int			parallel_workers = 0;
+	RelOptInfo *rel = index->rel;
+	IndexPath  *ipath;
+
+	/*
+	 * Use both index and heap pages (that can be scanned) with a maximum of
+	 * total index pages to compute parallel workers required.  This will work
+	 * for now, but perhaps someday, we will come up with something better.
+	 */
+	pages_fetched = compute_index_and_heap_pages(root, index,
+												 indexquals,
+												 loop_count);
+	if (pages_fetched > index->pages)
+		pages_fetched = index->pages;
+
+	parallel_workers = compute_parallel_worker(rel, pages_fetched);
+
+	if (parallel_workers <= 0)
+		return;
+
+	ipath = create_index_path(root, index,
+							  indexclauses,
+							  indexclausecols,
+							  indexquals,
+							  indexqualcols,
+							  indexorderbys,
+							  indexorderbycols,
+							  pathkeys,
+							  indexscandir,
+							  indexonly,
+							  required_outer,
+							  loop_count,
+							  parallel_workers);
+
+	add_partial_path(rel, (Path *) ipath);
+}
+
+/*
  * create_bitmap_heap_path
  *	  Creates a path node for a bitmap scan.
  *
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e91e41d..ed62ac1 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -187,6 +187,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..0048aa9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1359,6 +1359,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1385,6 +1386,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 8525917..4f63858 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -187,6 +187,8 @@ extern double compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
 					List *indexclauses, int loop_count, double numIndexTuples,
 			 double *random_page_cost, double *sa_scans, double *indexTuples,
 					double *indexPages, Selectivity *indexSel, Cost *cost);
+extern double compute_index_and_heap_pages(PlannerInfo *root, IndexOptInfo *index,
+							 List *indexQuals, int loop_count);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 				  Path *bitmapqual, int loop_count, Cost *cost, double *tuple);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 44e143c..e2fd53a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,21 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
+extern void create_partial_index_path(PlannerInfo *root,
+						  IndexOptInfo *index,
+						  List *indexclauses,
+						  List *indexclausecols,
+						  List *indexquals,
+						  List *indexqualcols,
+						  List *indexorderbys,
+						  List *indexorderbycols,
+						  List *pathkeys,
+						  ScanDirection indexscandir,
+						  bool indexonly,
+						  Relids required_outer,
+						  double loop_count);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81a9be7..fabf314 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -53,6 +53,7 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..140fb3c 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..d7dfd28 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#56

Rahila Syed

rahilasyed90@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#55)

Re: Parallel Index Scans

Hello,

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers. I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages). I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap. OTOH, we can consider just pages
in the index as that is where mainly the parallelism works

IMO, considering just index pages will give a better estimate of work to be
done
in parallel. As the amount of work/number of pages divided amongst workers
is irrespective of
the number of heap pages scanned.

Thank you,
Rahila Syed

On Mon, Jan 30, 2017 at 2:52 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:

Show quoted text

On Sat, Jan 28, 2017 at 1:34 AM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

parallel_index_opt_exec_support_v6 - Removed the usage of
GatherSupportsBackwardScan. Expanded the comments in
ExecReScanIndexScan.

I looked through this and in general it looks reasonable to me.
However, I did notice one thing that I think is wrong. In the
parallel bitmap heap scan patch, the second argument to
compute_parallel_worker() is the number of pages that the parallel
scan is expected to fetch from the heap. In this patch, it's the
total number of pages in the index. The former seems to me to be
better, because the point of having a threshold relation size for
parallelism is that we don't want to use a lot of workers to scan a
small number of pages -- the distribution of work won't be even, and
the potential savings are limited. If we've got a big index but are
using a very selective qual to pull out only one or a small number of
rows on a single page or a small handful of pages, we shouldn't
generate a parallel path for that.

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers. I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages). I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap. OTOH, we can consider just pages
in the index as that is where mainly the parallelism works. In the
attached patch (parallel_index_opt_exec_support_v7.patch), I have
considered both index and heap pages, let me know if you think some
other way is better. I have also prepared a separate independent
patch (compute_index_pages_v1) on HEAD to compute index pages which
can be used by parallel index scan. There is no change in
parallel_index_scan (parallel btree scan) patch, so I am not attaching
its new version.

Now, against that theory, the GUC that controls the behavior of
compute_parallel_worker() is called min_parallel_relation_size, which
might make you think that the decision is supposed to be based on the
whole size of some relation. But I think that just means we need to
rename the GUC to something like min_parallel_scan_size. Possibly we
also ought consider reducing the default value somewhat, because it
seems like both sequential and index scans can benefit even when
scanning less than 8MB.

Agreed, but let's consider it separately.

The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v6.patch[1],
parallel_index_opt_exec_support_v7.patch

[1] - /messages/by-id/CAA4eK1J%25
3DLSBpDx7i_izGJxGVUryqPe-2SKT02De-PrQvywiMxw%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#57

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#51)

Re: Parallel Index Scans

On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

Cool. Committed parallel-generic-index-scan.2.patch. Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

The hunk in indexam.c looks like a leftover that can be omitted.

+/*
+ * Below flags are used to indicate the state of parallel scan.

They aren't really flags any more; they're members of an enum. I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values. But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan

Initializing -> initialize

+ * btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two. Make it
consistent, please.

+     * Ideally, we don't need to acquire spinlock here, but being
+     * consistent with heap_rescan seems to be a good idea.

How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.

+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *      scans and latest block scanned for backward scans.

I think the description should be more like "Begin the process of
advancing the scan to a new page. Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()." And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page. We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE. It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

_bt_readnextpage isn't very clear about the meaning of its blkno
argument. It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward? That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. We could
* avoid that by doing _bt_parallel_release once we have read the
* current page, but it is bad to make other workers wait till we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

-            /* check for deleted page */
             page = BufferGetPage(so->currPos.buf);
             TestForOldSnapshot(scan->xs_snapshot, rel, page);
             opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+            /* check for deleted page */

This is an independent change; committed.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs? What's the highest degree of parallelism that's
been tested? Did that test include array keys?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Rahila Syed (#56)

Re: Parallel Index Scans

On Tue, Jan 31, 2017 at 5:53 PM, Rahila Syed <rahilasyed90@gmail.com> wrote:

Hello,

Agreed, that it makes sense to consider only the number of pages to
scan for computation of parallel workers. I think for index scan we
should consider both index and heap pages that need to be scanned
(costing of index scan consider both index and heap pages). I thin
where considering heap pages matter more is when the finally selected
rows are scattered across heap pages or we need to apply a filter on
rows after fetching from the heap. OTOH, we can consider just pages
in the index as that is where mainly the parallelism works

IMO, considering just index pages will give a better estimate of work to be
done
in parallel. As the amount of work/number of pages divided amongst workers
is irrespective of
the number of heap pages scanned.

Yeah, I understand that point and I can see there is strong argument
to do that way, but let's wait and see what others including Robert
have to say about this point.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Rahila Syed

rahilasyed90@gmail.com

almost 9 years ago

In reply to: Robert Haas (#57)

Re: Parallel Index Scans

Hello Robert,

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE. It seems like one of those two things has
to be wrong.

btps_pageStatus is to be set to BTPARALLEL_DONE only by the first worker
which is
performing scan for latest array key and which has encountered end of scan.
This is ensured by
following check in _bt_parallel_done(),

if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
btscan->btps_pageStatus != BTPARALLEL_DONE)

Thus, BTPARALLEL_DONE marks end of scan only for latest array keys. This
ensures that when any worker reaches
_bt_advance_array_keys() it advances latest scan which is marked by
btscan->btps_arrayKeyCount only when latest scan
has ended by checking if(btps_pageStatus == BTPARALLEL_DONE) in
_bt_advance_array_keys(). Otherwise, the worker just
advances its local so->arrayKeyCount.

I hope this provides some clarification.

Thank you,
Rahila Syed

On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com>
wrote:

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

Cool. Committed parallel-generic-index-scan.2.patch. Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

The hunk in indexam.c looks like a leftover that can be omitted.
+/*
+ * Below flags are used to indicate the state of parallel scan.
They aren't really flags any more; they're members of an enum. I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values. But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel
btree scan

Initializing -> initialize

+ * btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two. Make it
consistent, please.
+     * Ideally, we don't need to acquire spinlock here, but being
+     * consistent with heap_rescan seems to be a good idea.
How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.
+ * _bt_parallel_seize() -- returns the next block to be scanned for
forward
+ *      scans and latest block scanned for backward scans.
I think the description should be more like "Begin the process of
advancing the scan to a new page. Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()." And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page. We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE. It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

_bt_readnextpage isn't very clear about the meaning of its blkno
argument. It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward? That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page,
other
* worker has also decided to scan that previous page. We
could
* avoid that by doing _bt_parallel_release once we have read
the
* current page, but it is bad to make other workers wait till
we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).
-            /* check for deleted page */
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+            /* check for deleted page */
This is an independent change; committed.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs? What's the highest degree of parallelism that's
been tested? Did that test include array keys?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#60

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#57)

1 attachment(s)

Re: Parallel Index Scans

On Wed, Feb 1, 2017 at 3:52 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jan 24, 2017 at 4:51 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 23, 2017 at 1:03 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

In spite of being careful, I missed reorganizing the functions in
genam.h which I have done in attached patch.

Cool. Committed parallel-generic-index-scan.2.patch. Thanks.

Reviewing parallel_index_scan_v6.patch:

I think it might be better to swap the return value and the out
parameter for _bt_parallel_seize; that is, return a bool, and have
callers ignore the value of the out parameter (e.g. *pageno).

makes sense, so changed accordingly.

I think _bt_parallel_advance_scan should be renamed something that
includes the word "keys", like _bt_parallel_advance_array_keys.

Changed as per suggestion.

The hunk in indexam.c looks like a leftover that can be omitted.

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1]/messages/by-id/CA+TgmoZv0+cLUV7fZRo76_PB9cfu5mBCVmoXKmaqrc7F30nJzw@mail.gmail.com (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.

+/*
+ * Below flags are used to indicate the state of parallel scan.
They aren't really flags any more; they're members of an enum. I
think you could just leave this sentence out entirely and start right
in with descriptions of the individual values. But maybe all of those
descriptions should end in a period (instead of one ending in a period
but not the other three) since they read like complete sentences.

+ * btinitparallelscan - Initializing BTParallelScanDesc for parallel btree scan

Initializing -> initialize

+ * btparallelrescan() -- reset parallel scan

Previous two prototypes have one dash, this one has two. Make it
consistent, please.
+     * Ideally, we don't need to acquire spinlock here, but being
+     * consistent with heap_rescan seems to be a good idea.
How about: In theory, we don't need to acquire the spinlock here,
because there shouldn't be any other workers running at this point,
but we do so for consistency.
+ * _bt_parallel_seize() -- returns the next block to be scanned for forward
+ *      scans and latest block scanned for backward scans.
I think the description should be more like "Begin the process of
advancing the scan to a new page. Other scans must wait until we call
bt_parallel_release() or bt_parallel_done()." And likewise
_bt_parallel_release() should say something like "Complete the process
of advancing the scan to a new page. We now have the new value for
btps_scanPage; some other backend can now begin advancing the scan."
And _bt_parallel_done should say something like "Mark the parallel
scan as complete."

Changed the code as per your above suggestions.

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE.

This is just to ensure that btps_arrayKeyCount is advanced once and
btps_pageStatus is changed to BTPARALLEL_DONE once per array element.
So it goes something like, if we have array with values [1,2,3], then
all the workers will complete the scan with key 1 and one of them will
mark btps_pageStatus as BTPARALLEL_DONE and then first one to hit
_bt_parallel_advance_scan will increment the value of
btps_arrayKeyCount, then same will happen for key 2 and 3. It is
quite possible that by the time one of the participant advances it
local key, the scan for that key is already finished and we handle
that situation in _bt_parallel_seize() with below check:

if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
*status = false;

I think Rahila has also mentioned something on above lines, let us
know if we are missing something here? Do you want to add more
comments in the code to explain this handling, if yes, then where (on
top of function _bt_parallel_advance_scan)?

It seems like one of those two things has
to be wrong.

_bt_readpage's header comment should be updated to note that, in the
case of a parallel scan, _bt_parallel_seize should have been called
prior to entering this function, and _bt_parallel_release will be
called prior to return (or this could alternatively be phrased in
terms of btps_pageStatus on entry/exit).

Changed as per suggestion.

_bt_readnextpage isn't very clear about the meaning of its blkno
argument. It looks like it always has to be valid when scanning
forward, but only in the parallel case when scanning backward?

It can be only invalid for non-parallel backward scan and in that case
the appropriate value for so->currPos will be set. Refer
_bt_steppage(). I have updated the comments.

That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. We could
* avoid that by doing _bt_parallel_release once we have read the
* current page, but it is bad to make other workers wait till we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

Yeah, but the reason of difference in comments is that for
non-parallel backwards scans there is no code at that place to move to
previous page and it basically relies on next call to _bt_walk_left()
whereas for parallel-scans, we can't simply rely on _bt_walk_left().
I have slightly modified the comments for backward scan case, see if
that looks better and if not, then suggest what you think is better.

-            /* check for deleted page */
page = BufferGetPage(so->currPos.buf);
TestForOldSnapshot(scan->xs_snapshot, rel, page);
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+            /* check for deleted page */

This is an independent change; committed.

Thanks.

What kind of testing has been done to ensure that this doesn't have
concurrency bugs?

Used large table parallel index scans (both forward and backward
scans). These tests have been done by Tushar and you can find
detailed report up thread [2]/messages/by-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com. Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well. TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch. Tushar,
can you please share your test results?

What's the highest degree of parallelism that's
been tested?

Did that test include array keys?

Yes.

Note - The order in which patches needs to be applied:
compute_index_pages_v1.patch, parallel_index_scan_v7.patch,
parallel_index_opt_exec_support_v7.patch. The first and third patches
are posted up thread [3]/messages/by-id/CAA4eK1KowGSYYVpd2qPpaPPA5R90r++QwDFbrRECTE9H_HvpOg@mail.gmail.com.

[1]: /messages/by-id/CA+TgmoZv0+cLUV7fZRo76_PB9cfu5mBCVmoXKmaqrc7F30nJzw@mail.gmail.com
[2]: /messages/by-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com
[3]: /messages/by-id/CAA4eK1KowGSYYVpd2qPpaPPA5R90r++QwDFbrRECTE9H_HvpOg@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v7.patchapplication/octet-stream; name=parallel_index_scan_v7.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 01fad38..638a9ab 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ba27c1e..28de5be 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -481,6 +481,9 @@ index_parallelrescan(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
 
+	if (!scan->parallel_scan)
+		return;
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
 		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 469e7ab..5809700 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,44 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started.
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +158,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -490,6 +530,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -652,6 +693,215 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan -- estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan -- initialize BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * In theory, we don't need to acquire the spinlock here, because there
+	 * shouldn't be any other workers running at this point, but we do so for
+	 * consistency.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call bt_parallel_release() or
+ *		bt_parallel_done().
+ *
+ * The return value tells caller whether to continue the scan or not.  The
+ * true value indicates either one of the following (a) the block number
+ * returned is valid and the scan can be continued (b) the block number is
+ * invalid and the scan has just begun (c) the block number is P_NONE and the
+ * scan is finished.  The false value indicates that we have reached the end
+ * of scan for current scankeys and for that we return block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the value of pageno, if false is returned.
+ */
+bool
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	bool		status = true;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			*pageno = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!status)
+		*pageno = P_NONE;
+
+	return status;
+}
+
+/*
+ * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ *		new page.  We now have the new value btps_scanPage; some other backend
+ *		can now begin advancing the scan.
+ *
+ * scan_page - For backward scans, it holds the latest page being scanned
+ *		and for forward scans, it holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Mark the parallel scan as complete.
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
+ *			keys.
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b6459d2..a211c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		status = _bt_parallel_seize(scan, &blkno);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1132,6 +1178,10 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * In the case of parallel scans, caller must have called _bt_parallel_seize
+ * and _bt_parallel_release will be called in this function to advance the
+ * parallel scan.
+ *
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
@@ -1154,6 +1204,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1338,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1374,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1402,72 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ *
+ * Callers pass valid blkno for forward and backward scans with an exception
+ * for backward scans in which case so->currPos is expected to contain a valid
+ * value.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1359,14 +1489,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1547,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1564,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1585,46 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  So we
+			 * can't rely on _bt_walk_left call.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1902,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1931,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..5b259a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_array_keys(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..60ca1d5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -661,10 +664,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefsdefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

#61

tushar

tushar.ahuja@enterprisedb.com

almost 9 years ago

In reply to: Amit Kapila (#60)

Re: Parallel Index Scans

On 02/01/2017 06:50 PM, Amit Kapila wrote:

Used large table parallel index scans (both forward and backward
scans). These tests have been done by Tushar and you can find
detailed report up thread [2]. Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well. TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch. Tushar,
can you please share your test results?

Yes, We have
0)Tested on a high end machine with this following configuration

[edb@ip-10-0-38-61 pg_log]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E7-8880 v3 @ 2.30GHz

[edb@ip-10-0-38-61 pg_log]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 961G 60K 961G 1% /dev
tmpfs 961G 556K 961G 1% /dev/shm
/dev/xvda1 197G 156G 42G 80% /

[edb@ip-10-0-38-61 pg_log]$ free
total used free shared buffers cached
Mem: 2014742800 170971292 1843771508 142668 166128 162463396
-/+ buffers/cache: 8341768 2006401032
Swap: 0 0 0

1)Executed the testcases with multiple clients ( e.g run our testcase
file against 4 different psql terminal of the same server
simultaneously) for concurrency,
We made a effort to execute same set of tests (testcase.sql file)
via different terminals against the same server.
2) We checked count(*) of the query before and after disabling/enabling
max_parallel_workers_per_gather to make sure end result(o/p) is consistent.
3) We are able to get parallel workers =14 (highest degree of
parallelism ) in our case

pgbench with -scaling factor =10,000 ( taken 149 GB data in the
database, 100 million rows is inserted) on amanzon instance (128 cores
,4 nodes)

We are able to see 14 workers launched out of 14 workers planned
against this below query

public | pgbench_accounts_pkey | index | edb | pgbench_accounts |*21
GB *|
(1 row)

index size is now 21 GB

postgres=# explain analyse verbose select * from pgbench_accounts where
aid <50000000 and bid <=1 ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=0.57..1745380.10 rows=4691 width=97) (actual
time=0.546..2316.118 rows=100000 loops=1)
Output: aid, bid, abalance, filler
Workers Planned: 14
Workers Launched: 14
-> Parallel Index Scan using pgbench_accounts_pkey on
public.pgbench_accounts (cost=0.57..1745380.10 rows=335 width=97)
(actual time=0.081..2253.234 rows=6667 loops=15)
Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.aid < 50000000)
Filter: (pgbench_accounts.bid <= 1)
Rows Removed by Filter: 3326667
Worker 0: actual time=0.069..2251.456 rows=7036 loops=1
Worker 1: actual time=0.070..2256.772 rows=6588 loops=1
Worker 2: actual time=0.071..2257.164 rows=6954 loops=1
Worker 3: actual time=0.079..2255.166 rows=6222 loops=1
Worker 4: actual time=0.063..2254.814 rows=6588 loops=1
Worker 5: actual time=0.091..2253.872 rows=6588 loops=1
Worker 6: actual time=0.093..2254.237 rows=6222 loops=1
Worker 7: actual time=0.068..2254.749 rows=7320 loops=1
Worker 8: actual time=0.060..2253.953 rows=6588 loops=1
Worker 9: actual time=0.127..2253.546 rows=8052 loops=1
Worker 10: actual time=0.091..2252.737 rows=7686 loops=1
Worker 11: actual time=0.087..2252.056 rows=7320 loops=1
Worker 12: actual time=0.091..2252.600 rows=7320 loops=1
Worker 13: actual time=0.057..2252.341 rows=7686 loops=1
Planning time: 0.165 ms
Execution time: 2357.132 ms
(25 rows)

even for array keys, index size is in MB . we are able to see 09
workers launched out of 09 workers planned

postgres=# explain analyze verbose select count(1) from ary_tab where
ARRAY[7,8,9,10]=c2 and c1 = 'four';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=47083.83..47083.84 rows=1 width=8) (actual
time=141.766..141.767 rows=1 loops=1)
Output: count(1)
-> Gather (cost=47083.80..47083.81 rows=9 width=8) (actual
time=141.547..141.753 rows=10 loops=1)
Output: (PARTIAL count(1))
Workers Planned: 9
Workers Launched: 9
-> Partial Aggregate (cost=47083.80..47083.81 rows=1
width=8) (actual time=136.679..136.679 rows=1 loops=10)
Output: PARTIAL count(1)
Worker 0: actual time=135.215..135.215 rows=1 loops=1
Worker 1: actual time=136.158..136.158 rows=1 loops=1
Worker 2: actual time=136.348..136.349 rows=1 loops=1
Worker 3: actual time=136.564..136.565 rows=1 loops=1
Worker 4: actual time=135.759..135.760 rows=1 loops=1
Worker 5: actual time=136.405..136.405 rows=1 loops=1
Worker 6: actual time=136.158..136.158 rows=1 loops=1
Worker 7: actual time=136.319..136.319 rows=1 loops=1
Worker 8: actual time=136.597..136.597 rows=1 loops=1
-> Parallel Index Scan using ary_idx on public.ary_tab
(cost=0.42..47083.79 rows=4 width=0) (actual time=122.557..136.673
rows=5 loops=10)
Index Cond: ('{7,8,9,10}'::integer[] = ary_tab.c2)
Filter: (ary_tab.c1 = 'four'::text)
Rows Removed by Filter: 100000
Worker 0: actual time=135.211..135.211 rows=0 loops=1
Worker 1: actual time=136.153..136.153 rows=0 loops=1
Worker 2: actual time=136.342..136.342 rows=0 loops=1
Worker 3: actual time=136.559..136.559 rows=0 loops=1
Worker 4: actual time=135.756..135.756 rows=0 loops=1
Worker 5: actual time=136.402..136.402 rows=0 loops=1
Worker 6: actual time=136.150..136.150 rows=0 loops=1
Worker 7: actual time=136.314..136.314 rows=0 loops=1
Worker 8: actual time=136.592..136.592 rows=0 loops=1
Planning time: 0.813 ms
Execution time: 145.881 ms
(32 rows)

4)LCOV/Sql report can found for the same @
/messages/by-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#62

tushar

tushar.ahuja@enterprisedb.com

almost 9 years ago

In reply to: Amit Kapila (#60)

Re: Parallel Index Scans

On 02/01/2017 06:50 PM, Amit Kapila wrote:

Used large table parallel index scans (both forward and backward
scans). These tests have been done by Tushar and you can find
detailed report up thread [2]. Apart from that, the patch has been
tested with TPC-H queries at various scale factors and it is being
used in multiple queries and we have verified the results of same as
well. TPC-H tests have been done by Rafia.

Tushar has done some further extensive test of this patch. Tushar,
can you please share your test results?

Yes, We have
0)Tested on a high end machine with this following configuration

[edb@ip-10-0-38-61 pg_log]$ df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 961G 60K 961G 1% /dev
tmpfs 961G 556K 961G 1% /dev/shm
/dev/xvda1 197G 156G 42G 80% /

[edb@ip-10-0-38-61 pg_log]$ free
total used free shared buffers cached
Mem: 2014742800 170971292 1843771508 142668 166128 162463396
-/+ buffers/cache: 8341768 2006401032
Swap: 0 0 0

pgbench with -scaling factor =10,000 ( taken 149 GB data in the
database, 100 million rows is inserted) on amanzon instance (128 cores
,4 nodes)

We are able to see 14 workers launched out of 14 workers planned
against this below query

public | pgbench_accounts_pkey | index | edb | pgbench_accounts |*21
GB *|
(1 row)

index size is now 21 GB

even for array keys, index size is in MB . we are able to see 09
workers launched out of 09 workers planned

4)LCOV/Sql report can found for the same @
/messages/by-id/1d6353a0-63cb-65d9-a70c-0913899d5b06@enterprisedb.com

--
regards,tushar
EnterpriseDB https://www.enterprisedb.com/
The Enterprise PostgreSQL Company

#63

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#60)

Re: Parallel Index Scans

On Wed, Feb 1, 2017 at 10:20 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

makes sense, so changed accordingly.

I have moved this patch to CF 2017-03.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#58)

Re: Parallel Index Scans

On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I understand that point and I can see there is strong argument
to do that way, but let's wait and see what others including Robert
have to say about this point.

It seems to me that you can make an argument for any point of view.
In a parallel sequential scan, the smallest unit of work that can be
given to one worker is one heap page; in a parallel index scan, it's
one index page. By that logic, as Rahila says, we ought to do this
based on the number of index pages. On the other hand, it's weird to
use the same GUC to measure index pages at some times and heap pages
at other times, and it could result in failing to engage parallelism
where we really should do so, or using an excessively small number of
workers. An index scan that hits 25 index pages could hit 1000 heap
pages; if it's OK to use a parallel sequential scan for a table with
1000 heap pages, why is it not OK to use a parallel index scan to scan
1000 heap pages? I can't think of any reason.

On balance, I'm somewhat inclined to think that we ought to base
everything on heap pages, so that we're always measuring in the same
units. That's what Dilip's patch for parallel bitmap heap scan does,
and I think it's a reasonable choice. However, for parallel index
scan, we might want to also cap the number of workers to, say,
index_pages/10, just so we don't pick an index scan that's going to
result in a very lopsided work distribution.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#65

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#64)

Re: Parallel Index Scans

On Sat, Feb 4, 2017 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Yeah, I understand that point and I can see there is strong argument
to do that way, but let's wait and see what others including Robert
have to say about this point.

It seems to me that you can make an argument for any point of view.
In a parallel sequential scan, the smallest unit of work that can be
given to one worker is one heap page; in a parallel index scan, it's
one index page. By that logic, as Rahila says, we ought to do this
based on the number of index pages. On the other hand, it's weird to
use the same GUC to measure index pages at some times and heap pages
at other times, and it could result in failing to engage parallelism
where we really should do so, or using an excessively small number of
workers. An index scan that hits 25 index pages could hit 1000 heap
pages; if it's OK to use a parallel sequential scan for a table with
1000 heap pages, why is it not OK to use a parallel index scan to scan
1000 heap pages? I can't think of any reason.

I think one difference is that if we want to scan 1000 heap pages with
parallel index scan, scanning index cost is additional as compare to
parallel sequential scan.

On balance, I'm somewhat inclined to think that we ought to base
everything on heap pages, so that we're always measuring in the same
units. That's what Dilip's patch for parallel bitmap heap scan does,
and I think it's a reasonable choice. However, for parallel index
scan, we might want to also cap the number of workers to, say,
index_pages/10, just so we don't pick an index scan that's going to
result in a very lopsided work distribution.

I guess in the above context you mean heap_pages or index_pages that
are expected to be *fetched* during index scan.

Yet another thought is that for parallel index scan we use
index_pages_fetched, but use either a different GUC
(min_parallel_index_rel_size) with a relatively lower default value
(say equal to min_parallel_relation_size/4 = 2MB) or directly use
min_parallel_relation_size/4 for parallel index scans.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#66

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#65)

4 attachment(s)

Re: Parallel Index Scans

On Sat, Feb 4, 2017 at 7:14 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Feb 4, 2017 at 5:54 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2017 at 12:58 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On balance, I'm somewhat inclined to think that we ought to base
everything on heap pages, so that we're always measuring in the same
units. That's what Dilip's patch for parallel bitmap heap scan does,
and I think it's a reasonable choice. However, for parallel index
scan, we might want to also cap the number of workers to, say,
index_pages/10, just so we don't pick an index scan that's going to
result in a very lopsided work distribution.

I guess in the above context you mean heap_pages or index_pages that
are expected to be *fetched* during index scan.

Yet another thought is that for parallel index scan we use
index_pages_fetched, but use either a different GUC
(min_parallel_index_rel_size) with a relatively lower default value
(say equal to min_parallel_relation_size/4 = 2MB) or directly use
min_parallel_relation_size/4 for parallel index scans.

I had some offlist discussion with Robert about the above point and we
feel that keeping only heap pages for parallel computation might not
be future proof as for parallel index only scans there might not be
any heap pages. So, it is better to use separate GUC for parallel
index (only) scans. We can have two guc's
min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
(512kB) for computing parallel scans. The parallel sequential scan
and parallel bitmap heap scans can use min_parallel_table_scan_size as
a threshold to compute parallel workers as we are doing now. For
parallel index scans, both min_parallel_table_scan_size and
min_parallel_index_scan_size can be used for threshold; We can
compute parallel workers both based on heap_pages to be scanned and
index_pages to be scanned and then keep the minimum of those. This
will help us to engage parallel index scans when the index pages are
lower than threshold but there are many heap pages to be scanned and
will also allow keeping a maximum cap on the number of workers based
on index scan size.

guc_parallel_index_scan_v1.patch - Change name of existing
min_parallel_relation_size to min_parallel_table_scan_size and added a
new guc min_parallel_index_scan_size with default value of 512kB.
This patch also adjusted the computation in compute_parallel_worker
based on two guc's.

compute_index_pages_v2.patch - This function extracts the computation
of index pages to be scanned in a separate function and used it in
existing code. You will notice that I have pulled up the logic of
conversion of clauses to indexquals from create_index_path to
build_index_paths as that is required to compute the number of index
and heap pages to be scanned by scan in patch
parallel_index_opt_exec_support_v8.patch. This doesn't impact any
existing functionality.

parallel_index_scan_v7 - patch to parallelize btree scans, nothing is
changed from previous version (just rebased on latest head).

parallel_index_opt_exec_support_v8.patch - This contain changes to
compute parallel workers using both heap and index pages that need to
be scanned.

Patches guc_parallel_index_scan_v1.patch and
compute_index_pages_v2.patch are independent patches. Both the
patches are required by parallel index scan patches.

The current set of patches handles all the reported comments.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

guc_parallel_index_scan_v1.patchapplication/octet-stream; name=guc_parallel_index_scan_v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d..b4b7f0b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3835,20 +3835,34 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-min-parallel-relation-size" xreflabel="min_parallel_relation_size">
-      <term><varname>min_parallel_relation_size</varname> (<type>integer</type>)
+     <varlistentry id="guc-min-parallel-table-scan-size" xreflabel="min_parallel_table_scan_size">
+      <term><varname>min_parallel_table_scan_size</varname> (<type>integer</type>)
       <indexterm>
-       <primary><varname>min_parallel_relation_size</> configuration parameter</primary>
+       <primary><varname>min_parallel_table_scan_size</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
-        Sets the minimum size of relations to be considered for parallel scan.
+        Sets the minimum size of tables to be considered for parallel scan.
         The default is 8 megabytes (<literal>8MB</>).
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-min-parallel-index-scan-size" xreflabel="min_parallel_index_scan_size">
+      <term><varname>min_parallel_index_scan_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>min_parallel_index_scan_size</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the minimum size of indexes to be considered for parallel scan.
+        The default is 512 kilobytes (<literal>512kB</>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-effective-cache-size" xreflabel="effective_cache_size">
       <term><varname>effective_cache_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5c18987..85505c5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -57,7 +57,8 @@ typedef struct pushdown_safety_info
 /* These parameters are set by GUC */
 bool		enable_geqo = false;	/* just in case GUC doesn't set it */
 int			geqo_threshold;
-int			min_parallel_relation_size;
+int			min_parallel_table_scan_size;
+int			min_parallel_index_scan_size;
 
 /* Hook for plugins to get control in set_rel_pathlist() */
 set_rel_pathlist_hook_type set_rel_pathlist_hook = NULL;
@@ -126,7 +127,8 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
+static int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages);
 
 
 /*
@@ -679,7 +681,7 @@ create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
 {
 	int			parallel_workers;
 
-	parallel_workers = compute_parallel_worker(rel, rel->pages);
+	parallel_workers = compute_parallel_worker(rel, rel->pages, 0);
 
 	/* If any limit was set to zero, the user doesn't want a parallel scan. */
 	if (parallel_workers <= 0)
@@ -2876,13 +2878,20 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
 
 /*
  * Compute the number of parallel workers that should be used to scan a
- * relation.  "pages" is the number of pages from the relation that we
- * expect to scan.
+ * relation.  We compute the parallel workers based on the size of the heap to
+ * be scanned and the size of the index to be scanned, then choose a minimum
+ * of those.
+ *
+ * "heap_pages" is the number of pages from the table that we expect to scan.
+ * "index_pages" is the number of pages from the index that we expect to scan.
  */
 static int
-compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
+compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages)
 {
-	int			parallel_workers;
+	int			parallel_workers = 0;
+	int			heap_parallel_workers = 1;
+	int			index_parallel_workers = 1;
 
 	/*
 	 * If the user has set the parallel_workers reloption, use that; otherwise
@@ -2892,7 +2901,8 @@ compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 		parallel_workers = rel->rel_parallel_workers;
 	else
 	{
-		int			parallel_threshold;
+		int			heap_parallel_threshold;
+		int			index_parallel_threshold;
 
 		/*
 		 * If this relation is too small to be worth a parallel scan, just
@@ -2901,25 +2911,48 @@ compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 		 * might not be worthwhile just for this relation, but when combined
 		 * with all of its inheritance siblings it may well pay off.
 		 */
-		if (pages < (BlockNumber) min_parallel_relation_size &&
+		if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
+			index_pages < (BlockNumber) min_parallel_index_scan_size &&
 			rel->reloptkind == RELOPT_BASEREL)
 			return 0;
 
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.  Note that the
-		 * upper limit of the min_parallel_relation_size GUC is chosen to
-		 * prevent overflow here.
-		 */
-		parallel_workers = 1;
-		parallel_threshold = Max(min_parallel_relation_size, 1);
-		while (pages >= (BlockNumber) (parallel_threshold * 3))
+		if (heap_pages > 0)
+		{
+			/*
+			 * Select the number of workers based on the log of the size of
+			 * the relation.  This probably needs to be a good deal more
+			 * sophisticated, but we need something here for now.  Note that
+			 * the upper limit of the min_parallel_table_scan_size GUC is
+			 * chosen to prevent overflow here.
+			 */
+			heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+			while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+			{
+				heap_parallel_workers++;
+				heap_parallel_threshold *= 3;
+				if (heap_parallel_threshold > INT_MAX / 3)
+					break;		/* avoid overflow */
+			}
+
+			parallel_workers = heap_parallel_workers;
+		}
+
+		if (index_pages > 0)
 		{
-			parallel_workers++;
-			parallel_threshold *= 3;
-			if (parallel_threshold > INT_MAX / 3)
-				break;			/* avoid overflow */
+			/* same calculation as for heap_pages above */
+			index_parallel_threshold = Max(min_parallel_index_scan_size, 1);
+			while (index_pages >= (BlockNumber) (index_parallel_threshold * 3))
+			{
+				index_parallel_workers++;
+				index_parallel_threshold *= 3;
+				if (index_parallel_threshold > INT_MAX / 3)
+					break;		/* avoid overflow */
+			}
+
+			if (parallel_workers > 0)
+				parallel_workers = Min(parallel_workers, index_parallel_workers);
+			else
+				parallel_workers = index_parallel_workers;
 		}
 	}
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de85eca..0df45b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2776,17 +2776,28 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"min_parallel_relation_size", PGC_USERSET, QUERY_TUNING_COST,
-			gettext_noop("Sets the minimum size of relations to be considered for parallel scan."),
+		{"min_parallel_table_scan_size", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the minimum size of tables to be considered for parallel scan."),
 			NULL,
 			GUC_UNIT_BLOCKS,
 		},
-		&min_parallel_relation_size,
+		&min_parallel_table_scan_size,
 		(8 * 1024 * 1024) / BLCKSZ, 0, INT_MAX / 3,
 		NULL, NULL, NULL
 	},
 
 	{
+		{"min_parallel_index_scan_size", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the minimum size of indexes to be considered for parallel scan."),
+			NULL,
+			GUC_UNIT_BLOCKS,
+		},
+		&min_parallel_index_scan_size,
+		(512 * 1024) / BLCKSZ, 0, INT_MAX / 3,
+		NULL, NULL, NULL
+	},
+
+	{
 		/* Can't be set in postgresql.conf */
 		{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the server version as an integer."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 661b0fa..157d775 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -300,7 +300,8 @@
 #cpu_operator_cost = 0.0025		# same scale as above
 #parallel_tuple_cost = 0.1		# same scale as above
 #parallel_setup_cost = 1000.0	# same scale as above
-#min_parallel_relation_size = 8MB
+#min_parallel_table_scan_size = 8MB
+#min_parallel_index_scan_size = 512kB
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81a9be7..81e7a42 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -22,7 +22,8 @@
  */
 extern bool enable_geqo;
 extern int	geqo_threshold;
-extern int	min_parallel_relation_size;
+extern int	min_parallel_table_scan_size;
+extern int	min_parallel_index_scan_size;
 
 /* Hook for plugins to get control in set_rel_pathlist() */
 typedef void (*set_rel_pathlist_hook_type) (PlannerInfo *root,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..b967b45 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -9,7 +9,7 @@ begin isolation level repeatable read;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
-set min_parallel_relation_size=0;
+set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 explain (costs off)
   select count(*) from a_star;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..e3c32de 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -12,7 +12,7 @@ begin isolation level repeatable read;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
-set min_parallel_relation_size=0;
+set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 
 explain (costs off)

compute_index_pages_v2.patchapplication/octet-stream; name=compute_index_pages_v2.patchDownload

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a43daa7..ef125b4 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -4785,6 +4785,179 @@ get_parallel_divisor(Path *path)
 }
 
 /*
+ * compute_index_pages
+ *
+ * compute number of pages fetched from index in index scan.
+ */
+double
+compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
+					List *indexQuals, int loop_count, double numIndexTuples,
+					double *random_page_cost, double *sa_scans,
+					double *indexTuples, double *indexPages,
+					Selectivity *indexSel, Cost *cost)
+{
+	List	   *selectivityQuals;
+	double		pages_fetched;
+	double		num_sa_scans;
+	double		num_outer_scans;
+	double		num_scans;
+	double		spc_random_page_cost;
+	double		numIndexPages;
+	Selectivity indexSelectivity;
+	Cost		indexTotalCost;
+	ListCell   *l;
+
+	/*
+	 * If the index is partial, AND the index predicate with the explicitly
+	 * given indexquals to produce a more accurate idea of the index
+	 * selectivity.
+	 */
+	selectivityQuals = add_predicate_to_quals(index, indexQuals);
+
+	/*
+	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
+	 * index scans that will be performed.
+	 */
+	num_sa_scans = 1;
+	foreach(l, indexQuals)
+	{
+		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
+
+		if (IsA(rinfo->clause, ScalarArrayOpExpr))
+		{
+			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
+			int			alength = estimate_array_length(lsecond(saop->args));
+
+			if (alength > 1)
+				num_sa_scans *= alength;
+		}
+	}
+
+	/* Estimate the fraction of main-table tuples that will be visited */
+	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
+											  index->rel->relid,
+											  JOIN_INNER,
+											  NULL);
+
+	/*
+	 * If caller didn't give us an estimate, estimate the number of index
+	 * tuples that will be visited.  We do it in this rather peculiar-looking
+	 * way in order to get the right answer for partial indexes.
+	 */
+	if (numIndexTuples <= 0.0)
+	{
+		numIndexTuples = indexSelectivity * index->rel->tuples;
+
+		/*
+		 * The above calculation counts all the tuples visited across all
+		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
+		 * average per-indexscan number, so adjust.  This is a handy place to
+		 * round to integer, too.  (If caller supplied tuple estimate, it's
+		 * responsible for handling these considerations.)
+		 */
+		numIndexTuples = rint(numIndexTuples / num_sa_scans);
+	}
+
+	/*
+	 * We can bound the number of tuples by the index size in any case. Also,
+	 * always estimate at least one tuple is touched, even when
+	 * indexSelectivity estimate is tiny.
+	 */
+	if (numIndexTuples > index->tuples)
+		numIndexTuples = index->tuples;
+	if (numIndexTuples < 1.0)
+		numIndexTuples = 1.0;
+
+	/*
+	 * Estimate the number of index pages that will be retrieved.
+	 *
+	 * We use the simplistic method of taking a pro-rata fraction of the total
+	 * number of index pages.  In effect, this counts only leaf pages and not
+	 * any overhead such as index metapage or upper tree levels.
+	 *
+	 * In practice access to upper index levels is often nearly free because
+	 * those tend to stay in cache under load; moreover, the cost involved is
+	 * highly dependent on index type.  We therefore ignore such costs here
+	 * and leave it to the caller to add a suitable charge if needed.
+	 */
+	if (index->pages > 1 && index->tuples > 1)
+		numIndexPages = ceil(numIndexTuples * index->pages / index->tuples);
+	else
+		numIndexPages = 1.0;
+
+	/* fetch estimated page cost for tablespace containing index */
+	get_tablespace_page_costs(index->reltablespace,
+							  &spc_random_page_cost,
+							  NULL);
+
+	/*
+	 * Now compute the disk access costs.
+	 *
+	 * The above calculations are all per-index-scan.  However, if we are in a
+	 * nestloop inner scan, we can expect the scan to be repeated (with
+	 * different search keys) for each row of the outer relation.  Likewise,
+	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
+	 * the potential for cache effects to reduce the number of disk page
+	 * fetches needed.  We want to estimate the average per-scan I/O cost in
+	 * the presence of caching.
+	 *
+	 * We use the Mackert-Lohman formula (see costsize.c for details) to
+	 * estimate the total number of page fetches that occur.  While this
+	 * wasn't what it was designed for, it seems a reasonable model anyway.
+	 * Note that we are counting pages not tuples anymore, so we take N = T =
+	 * index size, as if there were one "tuple" per page.
+	 */
+	num_outer_scans = loop_count;
+	num_scans = num_sa_scans * num_outer_scans;
+
+	if (num_scans > 1)
+	{
+		/* total page fetches ignoring cache effects */
+		pages_fetched = numIndexPages * num_scans;
+
+		/* use Mackert and Lohman formula to adjust for cache effects */
+		pages_fetched = index_pages_fetched(pages_fetched,
+											index->pages,
+											(double) index->pages,
+											root);
+
+		/*
+		 * Now compute the total disk access cost, and then report a pro-rated
+		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
+		 * since that's internal to the indexscan.)
+		 */
+		indexTotalCost = (pages_fetched * spc_random_page_cost)
+			/ num_outer_scans;
+	}
+	else
+	{
+		pages_fetched = numIndexPages;
+
+		/*
+		 * For a single index scan, we just charge spc_random_page_cost per
+		 * page touched.
+		 */
+		indexTotalCost = numIndexPages * spc_random_page_cost;
+	}
+
+	/* fill the output parameters */
+	if (random_page_cost)
+		*random_page_cost = spc_random_page_cost;
+	if (sa_scans)
+		*sa_scans = num_sa_scans;
+	if (indexTuples)
+		*indexTuples = numIndexTuples;
+	if (indexSel)
+		*indexSel = indexSelectivity;
+	if (indexPages)
+		*indexPages = numIndexPages;
+	if (cost)
+		*cost = indexTotalCost;
+
+	return pages_fetched;
+}
+
+/*
  * compute_bitmap_pages
  *
  * compute number of pages fetched from heap in bitmap heap scan.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..5d22496 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -863,6 +863,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 	IndexPath  *ipath;
 	List	   *index_clauses;
 	List	   *clause_columns;
+	List	   *indexquals;
+	List	   *indexqualcols;
 	Relids		outer_relids;
 	double		loop_count;
 	List	   *orderbyclauses;
@@ -1014,6 +1016,11 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		orderbyclausecols = NIL;
 	}
 
+	/* Convert clauses to indexquals the executor can handle */
+	expand_indexqual_conditions(index, index_clauses, clause_columns,
+								&indexquals, &indexqualcols);
+
+
 	/*
 	 * 3. Check if an index-only scan is possible.  If we're not building
 	 * plain indexscans, this isn't relevant since bitmap scans don't support
@@ -1034,6 +1041,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 		ipath = create_index_path(root, index,
 								  index_clauses,
 								  clause_columns,
+								  indexquals,
+								  indexqualcols,
 								  orderbyclauses,
 								  orderbyclausecols,
 								  useful_pathkeys,
@@ -1060,6 +1069,8 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 			ipath = create_index_path(root, index,
 									  index_clauses,
 									  clause_columns,
+									  indexquals,
+									  indexqualcols,
 									  NIL,
 									  NIL,
 									  useful_pathkeys,
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 881742f..f5366d8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5377,7 +5377,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 
 	/* Estimate the cost of index scan */
 	indexScanPath = create_index_path(root, indexInfo,
-									  NIL, NIL, NIL, NIL, NIL,
+									  NIL, NIL, NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
 									  NULL, 1.0);
 
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..8cd80db 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1013,6 +1013,8 @@ create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
 				  List *indexclausecols,
+				  List *indexquals,
+				  List *indexqualcols,
 				  List *indexorderbys,
 				  List *indexorderbycols,
 				  List *pathkeys,
@@ -1023,8 +1025,6 @@ create_index_path(PlannerInfo *root,
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
-	List	   *indexquals,
-			   *indexqualcols;
 
 	pathnode->path.pathtype = indexonly ? T_IndexOnlyScan : T_IndexScan;
 	pathnode->path.parent = rel;
@@ -1036,10 +1036,6 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.parallel_workers = 0;
 	pathnode->path.pathkeys = pathkeys;
 
-	/* Convert clauses to indexquals the executor can handle */
-	expand_indexqual_conditions(index, indexclauses, indexclausecols,
-								&indexquals, &indexqualcols);
-
 	/* Fill in the pathnode */
 	pathnode->indexinfo = index;
 	pathnode->indexclauses = indexclauses;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa32e9e..6762bf9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -205,7 +205,6 @@ static Selectivity regex_selectivity(const char *patt, int pattlen,
 static Datum string_to_datum(const char *str, Oid datatype);
 static Const *string_to_const(const char *str, Oid datatype);
 static Const *string_to_bytea_const(const char *str, size_t str_len);
-static List *add_predicate_to_quals(IndexOptInfo *index, List *indexQuals);
 
 
 /*
@@ -6245,146 +6244,13 @@ genericcostestimate(PlannerInfo *root,
 	double		numIndexTuples;
 	double		spc_random_page_cost;
 	double		num_sa_scans;
-	double		num_outer_scans;
-	double		num_scans;
 	double		qual_op_cost;
 	double		qual_arg_cost;
-	List	   *selectivityQuals;
-	ListCell   *l;
-
-	/*
-	 * If the index is partial, AND the index predicate with the explicitly
-	 * given indexquals to produce a more accurate idea of the index
-	 * selectivity.
-	 */
-	selectivityQuals = add_predicate_to_quals(index, indexQuals);
-
-	/*
-	 * Check for ScalarArrayOpExpr index quals, and estimate the number of
-	 * index scans that will be performed.
-	 */
-	num_sa_scans = 1;
-	foreach(l, indexQuals)
-	{
-		RestrictInfo *rinfo = (RestrictInfo *) lfirst(l);
-
-		if (IsA(rinfo->clause, ScalarArrayOpExpr))
-		{
-			ScalarArrayOpExpr *saop = (ScalarArrayOpExpr *) rinfo->clause;
-			int			alength = estimate_array_length(lsecond(saop->args));
-
-			if (alength > 1)
-				num_sa_scans *= alength;
-		}
-	}
-
-	/* Estimate the fraction of main-table tuples that will be visited */
-	indexSelectivity = clauselist_selectivity(root, selectivityQuals,
-											  index->rel->relid,
-											  JOIN_INNER,
-											  NULL);
-
-	/*
-	 * If caller didn't give us an estimate, estimate the number of index
-	 * tuples that will be visited.  We do it in this rather peculiar-looking
-	 * way in order to get the right answer for partial indexes.
-	 */
-	numIndexTuples = costs->numIndexTuples;
-	if (numIndexTuples <= 0.0)
-	{
-		numIndexTuples = indexSelectivity * index->rel->tuples;
-
-		/*
-		 * The above calculation counts all the tuples visited across all
-		 * scans induced by ScalarArrayOpExpr nodes.  We want to consider the
-		 * average per-indexscan number, so adjust.  This is a handy place to
-		 * round to integer, too.  (If caller supplied tuple estimate, it's
-		 * responsible for handling these considerations.)
-		 */
-		numIndexTuples = rint(numIndexTuples / num_sa_scans);
-	}
-
-	/*
-	 * We can bound the number of tuples by the index size in any case. Also,
-	 * always estimate at least one tuple is touched, even when
-	 * indexSelectivity estimate is tiny.
-	 */
-	if (numIndexTuples > index->tuples)
-		numIndexTuples = index->tuples;
-	if (numIndexTuples < 1.0)
-		numIndexTuples = 1.0;
-
-	/*
-	 * Estimate the number of index pages that will be retrieved.
-	 *
-	 * We use the simplistic method of taking a pro-rata fraction of the total
-	 * number of index pages.  In effect, this counts only leaf pages and not
-	 * any overhead such as index metapage or upper tree levels.
-	 *
-	 * In practice access to upper index levels is often nearly free because
-	 * those tend to stay in cache under load; moreover, the cost involved is
-	 * highly dependent on index type.  We therefore ignore such costs here
-	 * and leave it to the caller to add a suitable charge if needed.
-	 */
-	if (index->pages > 1 && index->tuples > 1)
-		numIndexPages = ceil(numIndexTuples * index->pages / index->tuples);
-	else
-		numIndexPages = 1.0;
 
-	/* fetch estimated page cost for tablespace containing index */
-	get_tablespace_page_costs(index->reltablespace,
-							  &spc_random_page_cost,
-							  NULL);
-
-	/*
-	 * Now compute the disk access costs.
-	 *
-	 * The above calculations are all per-index-scan.  However, if we are in a
-	 * nestloop inner scan, we can expect the scan to be repeated (with
-	 * different search keys) for each row of the outer relation.  Likewise,
-	 * ScalarArrayOpExpr quals result in multiple index scans.  This creates
-	 * the potential for cache effects to reduce the number of disk page
-	 * fetches needed.  We want to estimate the average per-scan I/O cost in
-	 * the presence of caching.
-	 *
-	 * We use the Mackert-Lohman formula (see costsize.c for details) to
-	 * estimate the total number of page fetches that occur.  While this
-	 * wasn't what it was designed for, it seems a reasonable model anyway.
-	 * Note that we are counting pages not tuples anymore, so we take N = T =
-	 * index size, as if there were one "tuple" per page.
-	 */
-	num_outer_scans = loop_count;
-	num_scans = num_sa_scans * num_outer_scans;
-
-	if (num_scans > 1)
-	{
-		double		pages_fetched;
-
-		/* total page fetches ignoring cache effects */
-		pages_fetched = numIndexPages * num_scans;
-
-		/* use Mackert and Lohman formula to adjust for cache effects */
-		pages_fetched = index_pages_fetched(pages_fetched,
-											index->pages,
-											(double) index->pages,
-											root);
-
-		/*
-		 * Now compute the total disk access cost, and then report a pro-rated
-		 * share for each outer scan.  (Don't pro-rate for ScalarArrayOpExpr,
-		 * since that's internal to the indexscan.)
-		 */
-		indexTotalCost = (pages_fetched * spc_random_page_cost)
-			/ num_outer_scans;
-	}
-	else
-	{
-		/*
-		 * For a single index scan, we just charge spc_random_page_cost per
-		 * page touched.
-		 */
-		indexTotalCost = numIndexPages * spc_random_page_cost;
-	}
+	(void) compute_index_pages(root, index, indexQuals, loop_count,
+							   costs->numIndexTuples, &spc_random_page_cost,
+							   &num_sa_scans, &numIndexTuples, &numIndexPages,
+							   &indexSelectivity, &indexTotalCost);
 
 	/*
 	 * CPU cost: any complex expressions in the indexquals will need to be
@@ -6446,7 +6312,7 @@ genericcostestimate(PlannerInfo *root,
  * predicate_implied_by() and clauselist_selectivity(), but might be
  * problematic if the result were passed to other things.
  */
-static List *
+List *
 add_predicate_to_quals(IndexOptInfo *index, List *indexQuals)
 {
 	List	   *predExtraQuals = NIL;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0e68264..8525917 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -183,6 +183,10 @@ extern void set_cte_size_estimates(PlannerInfo *root, RelOptInfo *rel,
 					   double cte_rows);
 extern void set_foreign_size_estimates(PlannerInfo *root, RelOptInfo *rel);
 extern PathTarget *set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target);
+extern double compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
+					List *indexclauses, int loop_count, double numIndexTuples,
+			 double *random_page_cost, double *sa_scans, double *indexTuples,
+					double *indexPages, Selectivity *indexSel, Cost *cost);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 				  Path *bitmapqual, int loop_count, Cost *cost, double *tuple);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..44e143c 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -41,6 +41,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  IndexOptInfo *index,
 				  List *indexclauses,
 				  List *indexclausecols,
+				  List *indexquals,
+				  List *indexqualcols,
 				  List *indexorderbys,
 				  List *indexorderbycols,
 				  List *pathkeys,
diff --git a/src/include/utils/selfuncs.h b/src/include/utils/selfuncs.h
index 9f9d2dc..b1ffca3 100644
--- a/src/include/utils/selfuncs.h
+++ b/src/include/utils/selfuncs.h
@@ -212,6 +212,7 @@ extern void genericcostestimate(PlannerInfo *root, IndexPath *path,
 					double loop_count,
 					List *qinfos,
 					GenericCosts *costs);
+extern List *add_predicate_to_quals(IndexOptInfo *index, List *indexQuals);
 
 /* Functions in array_selfuncs.c */

parallel_index_scan_v7.patchapplication/octet-stream; name=parallel_index_scan_v7.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 01fad38..638a9ab 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index ba27c1e..28de5be 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -481,6 +481,9 @@ index_parallelrescan(IndexScanDesc scan)
 {
 	SCAN_CHECKS;
 
+	if (!scan->parallel_scan)
+		return;
+
 	/* amparallelrescan is optional; assume no-op if not provided by AM */
 	if (scan->indexRelation->rd_amroutine->amparallelrescan != NULL)
 		scan->indexRelation->rd_amroutine->amparallelrescan(scan);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 469e7ab..5809700 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,44 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started.
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +158,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -490,6 +530,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -652,6 +693,215 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan -- estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan -- initialize BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * In theory, we don't need to acquire the spinlock here, because there
+	 * shouldn't be any other workers running at this point, but we do so for
+	 * consistency.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call bt_parallel_release() or
+ *		bt_parallel_done().
+ *
+ * The return value tells caller whether to continue the scan or not.  The
+ * true value indicates either one of the following (a) the block number
+ * returned is valid and the scan can be continued (b) the block number is
+ * invalid and the scan has just begun (c) the block number is P_NONE and the
+ * scan is finished.  The false value indicates that we have reached the end
+ * of scan for current scankeys and for that we return block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the value of pageno, if false is returned.
+ */
+bool
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	bool		status = true;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			*pageno = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!status)
+		*pageno = P_NONE;
+
+	return status;
+}
+
+/*
+ * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ *		new page.  We now have the new value btps_scanPage; some other backend
+ *		can now begin advancing the scan.
+ *
+ * scan_page - For backward scans, it holds the latest page being scanned
+ *		and for forward scans, it holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Mark the parallel scan as complete.
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
+ *			keys.
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b6459d2..a211c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		status = _bt_parallel_seize(scan, &blkno);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1132,6 +1178,10 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * In the case of parallel scans, caller must have called _bt_parallel_seize
+ * and _bt_parallel_release will be called in this function to advance the
+ * parallel scan.
+ *
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
@@ -1154,6 +1204,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1338,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1374,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1402,72 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ *
+ * Callers pass valid blkno for forward and backward scans with an exception
+ * for backward scans in which case so->currPos is expected to contain a valid
+ * value.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1359,14 +1489,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1547,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1564,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1585,46 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  So we
+			 * can't rely on _bt_walk_left call.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1902,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1931,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..5b259a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_array_keys(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 011a72e..60ca1d5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -661,10 +664,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 ItemPointer ht_ctid, Relation heapRel,
 		 IndexUniqueCheck checkUnique);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

parallel_index_opt_exec_support_v8.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v8.patchDownload

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..f2eda67 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 5d8e557..623be4e 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/doc/src/sgml/parallel.sgml b/doc/src/sgml/parallel.sgml
index 5d4bb21..cfc3435 100644
--- a/doc/src/sgml/parallel.sgml
+++ b/doc/src/sgml/parallel.sgml
@@ -268,15 +268,28 @@ EXPLAIN SELECT * FROM pgbench_accounts WHERE filler LIKE '%x%';
   <title>Parallel Scans</title>
 
   <para>
-    Currently, the only type of scan which has been modified to work with
-    parallel query is a sequential scan.  Therefore, the driving table in
-    a parallel plan will always be scanned using a
-    <literal>Parallel Seq Scan</>.  The relation's blocks will be divided
-    among the cooperating processes.  Blocks are handed out one at a
-    time, so that access to the relation remains sequential.  Each process
-    will visit every tuple on the page assigned to it before requesting a new
-    page.
+    Currently, the type of scans that work with the parallel query are sequential
+    and index scans.
   </para>
+   
+  <para>
+    In <literal>Parallel Sequential Scans</>, the driving table in a parallel plan
+    will always be scanned using a <literal>Parallel Seq Scan</>.  The relation's
+    blocks will be divided among the cooperating processes.  Blocks are handed out
+    one at a time, so that access to the relation remains sequential.  Each process
+    will visit every tuple on the page assigned to it before requesting a new page.
+  </para>
+
+  <para>
+    In <literal>Parallel Index Scans</>, the driving table in a parallel plan will
+    always be scanned using a <literal>Parallel Index Scan</>.  Currently, the only
+    type of index which has been modified to work with the parallel query is
+    <literal>btree</>.  The parallelism is performed at the leaf level of <literal>btree</>.
+    The first backend (either master or worker backend) to start a scan will scan till
+    leaf and others will wait till it reaches the leaf level.  At leaf level, blocks are
+    handed out one at a time similar to <literal>Parallel Seq Scan</> till all the blocks
+    are finished or scan has reached the end point.
+   </para>
  </sect2>
 
  <sect2 id="parallel-joins">
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b2afdb7..9ae1c39 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 02d920b..4feb524 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c2247ad..b2c9389 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -70,6 +70,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 97ad22a..06e80ef 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 5809700..4f4ba2a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -139,6 +139,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..e57ac49 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index fe87c9a..03ed1ca 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..7bb594c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -514,6 +517,18 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * If we are here to just update the scan keys, then don't reset parallel
+	 * scan.  We don't want each of the participating process in the parallel
+	 * scan to update the shared parallel scan state at the start of the scan.
+	 * It is quite possible that one of the participants has already begun
+	 * scanning the index when another has yet to start it.
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -539,10 +554,21 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/*
+	 * Reset (parallel) index scan.  For parallel-aware nodes, the scan
+	 * descriptor is initialized during actual execution of node and we can
+	 * reach here before that (ex. during execution of nest loop join).  So,
+	 * avoid updating the scan descriptor at that time.
+	 */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1013,22 +1039,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1590,3 +1623,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 85505c5..eeacf81 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -127,8 +127,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
-						BlockNumber index_pages);
 
 
 /*
@@ -2885,7 +2883,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * "heap_pages" is the number of pages from the table that we expect to scan.
  * "index_pages" is the number of pages from the index that we expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
 						BlockNumber index_pages)
 {
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index ef125b4..b73a467 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -400,6 +400,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -602,11 +603,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
@@ -4958,6 +4972,43 @@ compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
 }
 
 /*
+ * compute_index_and_heap_pages
+ *
+ * compute and return the number of pages fetched from index and heap in index
+ * scan.
+ */
+void
+compute_index_and_heap_pages(PlannerInfo *root, IndexOptInfo *index,
+							 List *indexQuals, int loop_count,
+							 double *index_pages, double *heap_pages)
+{
+	RelOptInfo *baserel = index->rel;
+	Selectivity indexSelectivity;
+	double		tuples_fetched;
+
+	*index_pages = compute_index_pages(root, index, indexQuals, loop_count,
+									   0.0, NULL, NULL, NULL, NULL,
+									   &indexSelectivity, NULL);
+
+	/* estimate number of main-table tuples fetched */
+	tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);
+
+	if (loop_count > 1)
+		tuples_fetched *= loop_count;
+
+	/*
+	 * apply the Mackert and Lohman formula to compute number of pages
+	 * required to be fetched from heap.
+	 */
+	*heap_pages = index_pages_fetched(tuples_fetched,
+									  baserel->pages,
+									  (double) index->pages,
+									  root);
+
+	return;
+}
+
+/*
  * compute_bitmap_pages
  *
  * compute number of pages fetched from heap in bitmap heap scan.
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5d22496..d713c62 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -813,7 +813,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
- *	  or more IndexPaths.
+ *	  or more IndexPaths. It also constructs zero or more partial IndexPaths.
  *
  * We return a list of paths because (1) this routine checks some cases
  * that should cause us to not generate any IndexPath, and (2) in some
@@ -1051,8 +1051,33 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  0);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap or index only scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+			create_partial_index_path(root, index,
+									  index_clauses,
+									  clause_columns,
+									  indexquals,
+									  indexqualcols,
+									  orderbyclauses,
+									  orderbyclausecols,
+									  useful_pathkeys,
+									  index_is_ordered ?
+									  ForwardScanDirection :
+									  NoMovementScanDirection,
+									  index_only_scan,
+									  outer_relids,
+									  loop_count);
 	}
 
 	/*
@@ -1077,8 +1102,28 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  0);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+				create_partial_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  indexquals,
+										  indexqualcols,
+										  NIL,
+										  NIL,
+										  useful_pathkeys,
+										  BackwardScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count);
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f5366d8..965f261 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5379,7 +5379,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, 0);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8cd80db..920fa0f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,8 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  As for add_path, we take
+ *	  an exception for IndexPaths here as well.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,9 +824,12 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
-			pfree(old_path);
+
+			/*
+			 * Delete the data pointed-to by the deleted cell, if possible
+			 */
+			if (!IsA(old_path, IndexPath))
+				pfree(old_path);
 			/* p1_prev does not advance */
 		}
 		else
@@ -860,10 +861,9 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
-		pfree(new_path);
+		if (!IsA(new_path, IndexPath))
+			pfree(new_path);
 	}
 }
 
@@ -1021,7 +1021,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  int parallel_workers)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1031,9 +1032,9 @@ create_index_path(PlannerInfo *root,
 	pathnode->path.pathtarget = rel->reltarget;
 	pathnode->path.param_info = get_baserel_parampathinfo(root, rel,
 														  required_outer);
-	pathnode->path.parallel_aware = false;
+	pathnode->path.parallel_aware = parallel_workers > 0 ? true : false;;
 	pathnode->path.parallel_safe = rel->consider_parallel;
-	pathnode->path.parallel_workers = 0;
+	pathnode->path.parallel_workers = parallel_workers;
 	pathnode->path.pathkeys = pathkeys;
 
 	/* Fill in the pathnode */
@@ -1051,6 +1052,63 @@ create_index_path(PlannerInfo *root,
 }
 
 /*
+ * create_partial_index_path
+ *	  Build partial access path for a parallel index scan.
+ */
+void
+create_partial_index_path(PlannerInfo *root,
+						  IndexOptInfo *index,
+						  List *indexclauses,
+						  List *indexclausecols,
+						  List *indexquals,
+						  List *indexqualcols,
+						  List *indexorderbys,
+						  List *indexorderbycols,
+						  List *pathkeys,
+						  ScanDirection indexscandir,
+						  bool indexonly,
+						  Relids required_outer,
+						  double loop_count)
+{
+	double		index_pages_fetched = 0.0;
+	double		heap_pages_fetched = 0.0;
+	int			parallel_workers = 0;
+	RelOptInfo *rel = index->rel;
+	IndexPath  *ipath;
+
+	/*
+	 * Use both index and heap pages (that can be scanned) with a maximum of
+	 * total index pages to compute parallel workers required.  This will work
+	 * for now, but perhaps someday, we will come up with something better.
+	 */
+	compute_index_and_heap_pages(root, index, indexquals, loop_count,
+								 &index_pages_fetched, &heap_pages_fetched);
+
+	parallel_workers = compute_parallel_worker(rel,
+											(BlockNumber) heap_pages_fetched,
+										  (BlockNumber) index_pages_fetched);
+
+	if (parallel_workers <= 0)
+		return;
+
+	ipath = create_index_path(root, index,
+							  indexclauses,
+							  indexclausecols,
+							  indexquals,
+							  indexqualcols,
+							  indexorderbys,
+							  indexorderbycols,
+							  pathkeys,
+							  indexscandir,
+							  indexonly,
+							  required_outer,
+							  loop_count,
+							  parallel_workers);
+
+	add_partial_path(rel, (Path *) ipath);
+}
+
+/*
  * create_bitmap_heap_path
  *	  Creates a path node for a bitmap scan.
  *
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index e91e41d..ed62ac1 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -187,6 +187,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..0048aa9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1359,6 +1359,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1385,6 +1386,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 8525917..c42c512 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -187,6 +187,9 @@ extern double compute_index_pages(PlannerInfo *root, IndexOptInfo *index,
 					List *indexclauses, int loop_count, double numIndexTuples,
 			 double *random_page_cost, double *sa_scans, double *indexTuples,
 					double *indexPages, Selectivity *indexSel, Cost *cost);
+extern void compute_index_and_heap_pages(PlannerInfo *root, IndexOptInfo *index,
+					   List *indexQuals, int loop_count, double *index_pages,
+							 double *heap_pages);
 extern double compute_bitmap_pages(PlannerInfo *root, RelOptInfo *baserel,
 				  Path *bitmapqual, int loop_count, Cost *cost, double *tuple);
 
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 44e143c..e2fd53a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -49,7 +49,21 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  int parallel_workers);
+extern void create_partial_index_path(PlannerInfo *root,
+						  IndexOptInfo *index,
+						  List *indexclauses,
+						  List *indexclausecols,
+						  List *indexquals,
+						  List *indexqualcols,
+						  List *indexorderbys,
+						  List *indexorderbycols,
+						  List *pathkeys,
+						  ScanDirection indexscandir,
+						  bool indexonly,
+						  Relids required_outer,
+						  double loop_count);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81e7a42..ebda308 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index b967b45..37d9e86 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index e3c32de..6102981 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#67

Peter Geoghegan

pg@bowt.ie

almost 9 years ago

In reply to: Amit Kapila (#66)

Re: Parallel Index Scans

On Wed, Feb 8, 2017 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I had some offlist discussion with Robert about the above point and we
feel that keeping only heap pages for parallel computation might not
be future proof as for parallel index only scans there might not be
any heap pages. So, it is better to use separate GUC for parallel
index (only) scans. We can have two guc's
min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
(512kB) for computing parallel scans. The parallel sequential scan
and parallel bitmap heap scans can use min_parallel_table_scan_size as
a threshold to compute parallel workers as we are doing now. For
parallel index scans, both min_parallel_table_scan_size and
min_parallel_index_scan_size can be used for threshold; We can
compute parallel workers both based on heap_pages to be scanned and
index_pages to be scanned and then keep the minimum of those. This
will help us to engage parallel index scans when the index pages are
lower than threshold but there are many heap pages to be scanned and
will also allow keeping a maximum cap on the number of workers based
on index scan size.

What about parallel CREATE INDEX? The patch currently uses
min_parallel_relation_size as an input into the optimizer's custom
cost model. I had wondered if that made sense. Note that another such
input is the projected size of the final index. That's the thing that
increases at logarithmic intervals as there is a linear increase in
the number of workers assigned to the operation (so it's not the size
of the underlying table).

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#68

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Peter Geoghegan (#67)

Re: Parallel Index Scans

On Thu, Feb 9, 2017 at 12:08 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Feb 8, 2017 at 10:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I had some offlist discussion with Robert about the above point and we
feel that keeping only heap pages for parallel computation might not
be future proof as for parallel index only scans there might not be
any heap pages. So, it is better to use separate GUC for parallel
index (only) scans. We can have two guc's
min_parallel_table_scan_size (8MB) and min_parallel_index_scan_size
(512kB) for computing parallel scans. The parallel sequential scan
and parallel bitmap heap scans can use min_parallel_table_scan_size as
a threshold to compute parallel workers as we are doing now. For
parallel index scans, both min_parallel_table_scan_size and
min_parallel_index_scan_size can be used for threshold; We can
compute parallel workers both based on heap_pages to be scanned and
index_pages to be scanned and then keep the minimum of those. This
will help us to engage parallel index scans when the index pages are
lower than threshold but there are many heap pages to be scanned and
will also allow keeping a maximum cap on the number of workers based
on index scan size.

What about parallel CREATE INDEX? The patch currently uses
min_parallel_relation_size as an input into the optimizer's custom
cost model. I had wondered if that made sense. Note that another such
input is the projected size of the final index.

If projected index size is available, then I think Create Index can
also use a somewhat similar formula where we cap the maximum number of
workers based on the size of the index. Now, I am not sure if the
threshold values of guc's kept for the scan are realistic for Create
Index operation.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#69

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#68)

Re: Parallel Index Scans

On Thu, Feb 9, 2017 at 5:34 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

What about parallel CREATE INDEX? The patch currently uses
min_parallel_relation_size as an input into the optimizer's custom
cost model. I had wondered if that made sense. Note that another such
input is the projected size of the final index.

If projected index size is available, then I think Create Index can
also use a somewhat similar formula where we cap the maximum number of
workers based on the size of the index. Now, I am not sure if the
threshold values of guc's kept for the scan are realistic for Create
Index operation.

I think that would be an abuse of the GUC, because the idea of the
existing GUC - and the new one we're proposing to create here - has
always been about the amount of data being fed into the parallel
operation. In the case of CREATE INDEX, the resulting index is an
output, not an input. So if I were Peter and wanted to reuse the
existing GUCs, I'd reuse the one for the table size, because that's
what is being scanned. No index is going to get scanned.

Of course, it's possible that the sensible amount of parallelism for
CREATE INDEX is higher or lower than for other sequential scans, so
that might not be the right thing to do. It might need its own knob,
or some other refinement.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#70

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#60)

Re: Parallel Index Scans

On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The hunk in indexam.c looks like a leftover that can be omitted.

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1] (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.

That's not my point. The point is, if it's not a parallel scan,
index_parallelrescan() should never get called in the first place. So
therefore it shouldn't need to worry about scan->parallel_scan being
NULL. It seems possibly reasonable to put something like
Assert(scan->parallel_scan != NULL) in there, but arranging to return
without doing anything in that case seems like it just masks possible
bugs in the calling code. And really I'm not sure we even need the
Assert.

I am a bit mystified about how this manages to work with array keys.
_bt_parallel_done() won't set btps_pageStatus to BTPARALLEL_DONE
unless so->arrayKeyCount >= btscan->btps_arrayKeyCount, but
_bt_parallel_advance_scan() won't do anything unless btps_pageStatus
is already BTPARALLEL_DONE.

This is just to ensure that btps_arrayKeyCount is advanced once and
btps_pageStatus is changed to BTPARALLEL_DONE once per array element.
So it goes something like, if we have array with values [1,2,3], then
all the workers will complete the scan with key 1 and one of them will
mark btps_pageStatus as BTPARALLEL_DONE and then first one to hit
_bt_parallel_advance_scan will increment the value of
btps_arrayKeyCount, then same will happen for key 2 and 3. It is
quite possible that by the time one of the participant advances it
local key, the scan for that key is already finished and we handle
that situation in _bt_parallel_seize() with below check:

if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
*status = false;

I think Rahila has also mentioned something on above lines, let us
know if we are missing something here? Do you want to add more
comments in the code to explain this handling, if yes, then where (on
top of function _bt_parallel_advance_scan)?

You know, I just misread this code. Both of you are right, and I am wrong.

That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. We could
* avoid that by doing _bt_parallel_release once we have read the
* current page, but it is bad to make other workers wait till we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

Yeah, but the reason of difference in comments is that for
non-parallel backwards scans there is no code at that place to move to
previous page and it basically relies on next call to _bt_walk_left()
whereas for parallel-scans, we can't simply rely on _bt_walk_left().
I have slightly modified the comments for backward scan case, see if
that looks better and if not, then suggest what you think is better.

Why can't we rely on _bt_walk_left?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#71

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#66)

Re: Parallel Index Scans

On Thu, Feb 9, 2017 at 1:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

compute_index_pages_v2.patch - This function extracts the computation
of index pages to be scanned in a separate function and used it in
existing code. You will notice that I have pulled up the logic of
conversion of clauses to indexquals from create_index_path to
build_index_paths as that is required to compute the number of index
and heap pages to be scanned by scan in patch
parallel_index_opt_exec_support_v8.patch. This doesn't impact any
existing functionality.

This design presupposes that every AM that might ever want to do
parallel scans is happy with genericcostestimate()'s method of
computing the number of index pages that will be fetched. However,
that might not be true for every possible AM. In fact, it's already
not true for BRIN, which always reads the whole index. Now, since
we're only concerned with btree at the moment, nothing would be
immediately broken by this approach but it seems we're just setting
ourselves up for future pain.

I have what I think might be a better idea: let's make
amcostestimate() responsible for returning a suggested parallel degree
for the path via an additional out parameter. cost_index() can then
reduce that value if it seems like not enough heap pages will be
fetched to justify the return value, or it can override it completely
if parallel_degree is set for the relation. Then we don't need to run
this logic twice to compute the same value, and we don't violate the
AM abstraction layer.

BTW, the changes to add_partial_path() aren't needed, because an
IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
top of it, and that won't be the case with this or any other pending
patch. If we get the ability to have a Parallel Bitmap Heap Scan that
takes a parallel index scan rather than a standard index scan as
input, then we'll need something like this. But for now it's probably
best to just update the comments and remove the Assert().

I think you can also leave out the documentation changes from these
patches. I'll do some general rewriting of the parallel query section
once we know exactly what capabilities we'll have in this release; I
think that will work better than trying to update them a bit at a time
for each patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#72

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#70)

Re: Parallel Index Scans

On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The hunk in indexam.c looks like a leftover that can be omitted.

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1] (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.

That's not my point. The point is, if it's not a parallel scan,
index_parallelrescan() should never get called in the first place. So
therefore it shouldn't need to worry about scan->parallel_scan being
NULL. It seems possibly reasonable to put something like
Assert(scan->parallel_scan != NULL) in there, but arranging to return
without doing anything in that case seems like it just masks possible
bugs in the calling code. And really I'm not sure we even need the
Assert.

This is just a safety check, so probably we can change it to Assert.

That
is a little odd. Another, possibly related thing that is odd is that
when _bt_steppage() finds no tuples and decides to advance to a new
page again, there's a very short comment in the forward case and a
very long comment in the backward case:

/* nope, keep going */
vs.
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. We could
* avoid that by doing _bt_parallel_release once we have read the
* current page, but it is bad to make other workers wait till we
* read the page.
*/

Now it seems to me that these cases are symmetric and the issues are
identical. It's basically that, while we were judging whether the
current page has useful contents, some other process could have
advanced the scan (since _bt_readpage un-seized it).

Yeah, but the reason of difference in comments is that for
non-parallel backwards scans there is no code at that place to move to
previous page and it basically relies on next call to _bt_walk_left()
whereas for parallel-scans, we can't simply rely on _bt_walk_left().
I have slightly modified the comments for backward scan case, see if
that looks better and if not, then suggest what you think is better.

Why can't we rely on _bt_walk_left?

The reason is mentioned in comments, but let me try to explain with
some example. When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page. Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9). So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9. In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#73

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#72)

Re: Parallel Index Scans

On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why can't we rely on _bt_walk_left?

The reason is mentioned in comments, but let me try to explain with
some example. When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page. Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9). So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9. In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem. You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan. The entire point of that stuff is that when somebody's
advancing the scan to the next page, everybody else waits for them to
get done.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#74

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#73)

Re: Parallel Index Scans

On Sat, Feb 11, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 10, 2017 at 11:22 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why can't we rely on _bt_walk_left?

The reason is mentioned in comments, but let me try to explain with
some example. When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page. Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9). So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9. In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem. You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan.

Exactly and that is what is done in the patch. Basically, if we found
that the current page is half-dead or it doesn't contain any matching
items, then release the current buffer, seize the scan, read the
current page, walk left and so on. I am slightly confused here
because it seems both of us agree on what is the right thing to do and
according to me that is how it is implemented. Are you just ensuring
about whether I have implemented as discussed or do you see a problem
with the way it is implemented?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#75

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#70)

2 attachment(s)

Re: Parallel Index Scans

On Fri, Feb 10, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 1, 2017 at 8:20 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

The hunk in indexam.c looks like a leftover that can be omitted.

It is not a leftover hunk. Earlier, the patch has the same check
btparallelrescan, but based on your comment up thread [1] (Why does
btparallelrescan cater to the case where scan->parallel_scan== NULL?),
this has been moved to indexam.c.

That's not my point. The point is, if it's not a parallel scan,
index_parallelrescan() should never get called in the first place. So
therefore it shouldn't need to worry about scan->parallel_scan being
NULL. It seems possibly reasonable to put something like
Assert(scan->parallel_scan != NULL) in there, but arranging to return
without doing anything in that case seems like it just masks possible
bugs in the calling code. And really I'm not sure we even need the
Assert.

I have removed the check from index_parallelrescan() and ensured in
the caller that it must be called only when parallel_scan descriptor
is present.

Comments from another e-mail:

This design presupposes that every AM that might ever want to do
parallel scans is happy with genericcostestimate()'s method of
computing the number of index pages that will be fetched. However,
that might not be true for every possible AM. In fact, it's already
not true for BRIN, which always reads the whole index. Now, since
we're only concerned with btree at the moment, nothing would be
immediately broken by this approach but it seems we're just setting
ourselves up for future pain.

I think to make it future-proof, we could add a generic page
estimation function. However, I have tried to implement based on your
below suggestion, see if that looks better to you, otherwise we can
introduce a generic page estimation API.

I have what I think might be a better idea: let's make
amcostestimate() responsible for returning a suggested parallel degree
for the path via an additional out parameter. cost_index() can then
reduce that value if it seems like not enough heap pages will be
fetched to justify the return value, or it can override it completely
if parallel_degree is set for the relation.

I see the value of your idea, but I think it might be better to return
the number of IndexPages required to be scanned from amcostestimate()
and then use the already computed value of heap_pages in cost_index to
identify the number of workers. This will make the calculation simple
and avoid overriding the return value. Now, the returned value of
index pages will only be used for cases when partial path is being
constructed, but I think that is okay, because we are not doing any
extra calculation to compute the number of index pages fetched.
Another argument could be that the number of index pages to be used
for parallelism can be different from the number of pages to be
scanned or what ever is used in cost computation, but I think that is
also easy to change later when we create partial paths for indexes
other than btree. I have implemented the above idea in the attached
patch (parallel_index_opt_exec_support_v9.patch)

Then we don't need to run
this logic twice to compute the same value, and we don't violate the
AM abstraction layer.

We can avoid computing the same value twice, however, with your
suggested approach, we have to do all the additional work for the
cases where employing parallel workers is not beneficial, so not sure
if there is a net gain.

BTW, the changes to add_partial_path() aren't needed, because an
IndexPath only gets reused if you stick a Bitmap{Heap,And,Or}Path on
top of it, and that won't be the case with this or any other pending
patch. If we get the ability to have a Parallel Bitmap Heap Scan that
takes a parallel index scan rather than a standard index scan as
input, then we'll need something like this. But for now it's probably
best to just update the comments and remove the Assert().

Right, changed as per suggestion.

I think you can also leave out the documentation changes from these
patches. I'll do some general rewriting of the parallel query section
once we know exactly what capabilities we'll have in this release; I
think that will work better than trying to update them a bit at a time
for each patch.

Okay, removed the documentation part.

Patches to be used: guc_parallel_index_scan_v1.patch [1]/messages/by-id/CAA4eK1+TnM4pXQbvn7OXqam+k_HZqb0ROZUMxOiL6DWJYCyYow@mail.gmail.com,
parallel_index_scan_v8.patch, parallel_index_opt_exec_support_v9.patch

[1]: /messages/by-id/CAA4eK1+TnM4pXQbvn7OXqam+k_HZqb0ROZUMxOiL6DWJYCyYow@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v8.patchapplication/octet-stream; name=parallel_index_scan_v8.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5b67def..3b6a6dc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 945e563..b707fe2 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,44 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTPARALLEL_NOT_INITIALIZED implies that the scan is not started.
+ *
+ * BTPARALLEL_ADVANCING implies one of the worker or backend is advancing the
+ * scan to a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE implies that no backend is advancing the scan; someone can
+ * start doing it.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +158,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -491,6 +531,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -653,6 +694,215 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan -- estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan -- initialize BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * In theory, we don't need to acquire the spinlock here, because there
+	 * shouldn't be any other workers running at this point, but we do so for
+	 * consistency.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call bt_parallel_release() or
+ *		bt_parallel_done().
+ *
+ * The return value tells caller whether to continue the scan or not.  The
+ * true value indicates either one of the following (a) the block number
+ * returned is valid and the scan can be continued (b) the block number is
+ * invalid and the scan has just begun (c) the block number is P_NONE and the
+ * scan is finished.  The false value indicates that we have reached the end
+ * of scan for current scankeys and for that we return block number as P_NONE.
+ *
+ * The first time master backend or worker hits last page, it will return
+ * P_NONE and status as 'True', after that any worker tries to fetch next
+ * page, it will return status as 'False'.
+ *
+ * Callers ignore the value of pageno, if false is returned.
+ */
+bool
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	bool		status = true;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	while (1)
+	{
+		/*
+		 * Fetch the next block to scan and update the page status so that
+		 * other participants of parallel scan can wait till next page is
+		 * available for scan.  Set the status as false, if scan is finished.
+		 */
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		/* Check if the scan for current scan keys is finished */
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+			status = false;
+		else if (pageStatus == BTPARALLEL_DONE)
+			status = false;
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			*pageno = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	/* no more pages to scan */
+	if (!status)
+		*pageno = P_NONE;
+
+	return status;
+}
+
+/*
+ * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ *		new page.  We now have the new value btps_scanPage; some other backend
+ *		can now begin advancing the scan.
+ *
+ * scan_page - For backward scans, it holds the latest page being scanned
+ *		and for forward scans, it holds the next page to be scanned.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Mark the parallel scan as complete.
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Ensure to mark parallel scan as done no more than once for single scan.
+	 * We rely on this state to initiate the next scan for multiple array
+	 * keys, see _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
+ *			keys.
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b6459d2..a211c35 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,12 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +547,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +569,38 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the page from shared state. If scan has not
+	 * started, proceed to find out first leaf page to scan by keeping other
+	 * workers waiting until we have descended to appropriate leaf page to be
+	 * scanned for matching tuples.
+	 *
+	 * If the scan has already begun, skip finding the first leaf page and
+	 * directly scanning the page stored in shared structure or the page to
+	 * its left in case of backward scan.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		status = _bt_parallel_seize(scan, &blkno);
+		if (!status)
+		{
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +780,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+		if (!match)
+		{
+			/* No match , indicate (parallel) scan finished */
+			_bt_parallel_done(scan);
+			BTScanPosInvalidate(so->currPos);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -993,25 +1042,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1105,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1132,6 +1178,10 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * In the case of parallel scans, caller must have called _bt_parallel_seize
+ * and _bt_parallel_release will be called in this function to advance the
+ * parallel scan.
+ *
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
@@ -1154,6 +1204,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1338,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,13 +1374,27 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+
+		/*
+		 * We must rely on the previously saved nextPage link for non-parallel
+		 * scans!
+		 */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+			blkno = so->currPos.nextPage;
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
@@ -1333,11 +1402,72 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
 
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		/* For parallel scans, get the last page scanned */
+		if (scan->parallel_scan != NULL)
+		{
+			status = _bt_parallel_seize(scan, &blkno);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+
+		if (!_bt_readnextpage(scan, blkno, dir))
+			return false;
+	}
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ *
+ * Callers pass valid blkno for forward and backward scans with an exception
+ * for backward scans in which case so->currPos is expected to contain a valid
+ * value.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1359,14 +1489,30 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * for parallel scans, current block number needs to be retrieved from
+		 * shared state and it is the responsibility of caller to pass the
+		 * correct block number.
+		 */
+		if (blkno != InvalidBlockNumber)
+			so->currPos.currPage = blkno;
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1547,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+
+				/*
+				 * mark parallel scan as done, so that all the workers can
+				 * finish their scan
+				 */
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1564,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1585,46 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to fetch previous page, other
+			 * worker has also decided to scan that previous page.  So we
+			 * can't rely on _bt_walk_left call.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and pin on buffer.  We return TRUE to indicate
+ * success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1902,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1931,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..5b259a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_array_keys(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 344ef99..4c01a2f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,8 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * prototypes for functions in nbtree.c (external entry points for btree and
+ * functions to maintain state of parallel scan)
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -662,10 +665,17 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 IndexUniqueCheck checkUnique,
 		 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

parallel_index_opt_exec_support_v9.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v9.patchDownload

diff --git a/contrib/bloom/blcost.c b/contrib/bloom/blcost.c
index 98a2228..ba39f62 100644
--- a/contrib/bloom/blcost.c
+++ b/contrib/bloom/blcost.c
@@ -24,7 +24,8 @@
 void
 blcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			   Cost *indexStartupCost, Cost *indexTotalCost,
-			   Selectivity *indexSelectivity, double *indexCorrelation)
+			   Selectivity *indexSelectivity, double *indexCorrelation,
+			   double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -45,4 +46,5 @@ blcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 39d8d05..0cfe49a 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -208,6 +208,6 @@ extern bytea *bloptions(Datum reloptions, bool validate);
 extern void blcostestimate(PlannerInfo *root, IndexPath *path,
 			   double loop_count, Cost *indexStartupCost,
 			   Cost *indexTotalCost, Selectivity *indexSelectivity,
-			   double *indexCorrelation);
+			   double *indexCorrelation, double *indexPages);
 
 #endif
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..f2eda67 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 9afd7f6..401b115 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4ff046b..b22563b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 02d920b..4feb524 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -49,6 +49,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 96ead53..6593771 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -71,6 +71,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index bca77a8..24510e7 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b707fe2..fb554d6 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -139,6 +139,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..e57ac49 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index fe87c9a..03ed1ca 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..0a9dfdb 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -514,6 +517,18 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * If we are here to just update the scan keys, then don't reset parallel
+	 * scan.  We don't want each of the participating process in the parallel
+	 * scan to update the shared parallel scan state at the start of the scan.
+	 * It is quite possible that one of the participants has already begun
+	 * scanning the index when another has yet to start it.
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -539,10 +554,21 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/*
+	 * Reset (parallel) index scan.  For parallel-aware nodes, the scan
+	 * descriptor is initialized during actual execution of node and we can
+	 * reach here before that (ex. during execution of nest loop join).  So,
+	 * avoid updating the scan descriptor at that time.
+	 */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan && node->iss_ScanDesc->parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1013,22 +1039,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1590,3 +1623,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 85505c5..eeacf81 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -127,8 +127,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
-						BlockNumber index_pages);
 
 
 /*
@@ -2885,7 +2883,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * "heap_pages" is the number of pages from the table that we expect to scan.
  * "index_pages" is the number of pages from the index that we expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
 						BlockNumber index_pages)
 {
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a43daa7..d01630f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -391,7 +391,8 @@ cost_gather(GatherPath *path, PlannerInfo *root,
  * we have to fetch from the table, so they don't reduce the scan cost.
  */
 void
-cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
+cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
+		   bool partial_path)
 {
 	IndexOptInfo *index = path->indexinfo;
 	RelOptInfo *baserel = index->rel;
@@ -400,6 +401,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -413,6 +415,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	Cost		cpu_per_tuple;
 	double		tuples_fetched;
 	double		pages_fetched;
+	double		rand_heap_pages;
+	double		index_pages;
 
 	/* Should only be applied to base relations */
 	Assert(IsA(baserel, RelOptInfo) &&
@@ -459,7 +463,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	amcostestimate = (amcostestimate_function) index->amcostestimate;
 	amcostestimate(root, path, loop_count,
 				   &indexStartupCost, &indexTotalCost,
-				   &indexSelectivity, &indexCorrelation);
+				   &indexSelectivity, &indexCorrelation,
+				   &index_pages);
 
 	/*
 	 * Save amcostestimate's results for possible use in bitmap scan planning.
@@ -526,6 +531,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 		if (indexonly)
 			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
 
+		rand_heap_pages = pages_fetched;
+
 		max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
 
 		/*
@@ -564,6 +571,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 		if (indexonly)
 			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
 
+		rand_heap_pages = pages_fetched;
+
 		/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
 		max_IO_cost = pages_fetched * spc_random_page_cost;
 
@@ -583,6 +592,29 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 			min_IO_cost = 0;
 	}
 
+	if (partial_path)
+	{
+		/*
+		 * Estimate the number of parallel workers required to scan index. Use
+		 * the number of heap pages computed considering heap fetches won't be
+		 * sequential as for parallel scans the pages are accessed in random
+		 * order.
+		 */
+		path->path.parallel_workers = compute_parallel_worker(baserel,
+											   (BlockNumber) rand_heap_pages,
+												  (BlockNumber) index_pages);
+
+		/*
+		 * Fall out if workers can't be assigned for parallel scan, because in
+		 * such a case this path will be rejected.  So there is no benefit in
+		 * doing extra computation.
+		 */
+		if (path->path.parallel_workers <= 0)
+			return;
+
+		path->path.parallel_aware = true;
+	}
+
 	/*
 	 * Now interpolate based on estimated index order correlation to get total
 	 * disk I/O cost for main table accesses.
@@ -602,11 +634,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..0e9cc0a 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -813,7 +813,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
- *	  or more IndexPaths.
+ *	  or more IndexPaths. It also constructs zero or more partial IndexPaths.
  *
  * We return a list of paths because (1) this routine checks some cases
  * that should cause us to not generate any IndexPath, and (2) in some
@@ -1042,8 +1042,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  false);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap or index only scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			ipath = create_index_path(root, index,
+									  index_clauses,
+									  clause_columns,
+									  orderbyclauses,
+									  orderbyclausecols,
+									  useful_pathkeys,
+									  index_is_ordered ?
+									  ForwardScanDirection :
+									  NoMovementScanDirection,
+									  index_only_scan,
+									  outer_relids,
+									  loop_count,
+									  true);
+
+			/*
+			 * consider path for parallelism only when it is beneficial to
+			 * engage workers to scan the underlying relation.
+			 */
+			if (ipath->path.parallel_workers > 0)
+				add_partial_path(rel, (Path *) ipath);
+			else
+				pfree(ipath);
+		}
 	}
 
 	/*
@@ -1066,8 +1101,34 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  false);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  NIL,
+										  NIL,
+										  useful_pathkeys,
+										  BackwardScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  true);
+
+				if (ipath->path.parallel_workers > 0)
+					add_partial_path(rel, (Path *) ipath);
+				else
+					pfree(ipath);
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 881742f..486bd28 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5379,7 +5379,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, false);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..c36522d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,9 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  Unlike add_path, we don't
+ *	  take an exception for IndexPaths as partial index paths won't be
+ *	  referenced by partial BitmapHeapPaths.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,8 +825,6 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
 			pfree(old_path);
 			/* p1_prev does not advance */
 		}
@@ -860,8 +857,6 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
 		pfree(new_path);
 	}
@@ -1005,6 +1000,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
  * 'required_outer' is the set of outer relids for a parameterized path.
  * 'loop_count' is the number of repetitions of the indexscan to factor into
  *		estimates of caching behavior.
+ * 'partialpath' is true if the parallel scan is expected.
  *
  * Returns the new path node.
  */
@@ -1019,7 +1015,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  bool partialpath)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1049,7 +1046,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
 
-	cost_index(pathnode, root, loop_count);
+	cost_index(pathnode, root, loop_count, partialpath);
 
 	return pathnode;
 }
@@ -3247,7 +3244,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				memcpy(newpath, ipath, sizeof(IndexPath));
 				newpath->path.param_info =
 					get_baserel_parampathinfo(root, rel, required_outer);
-				cost_index(newpath, root, loop_count);
+				cost_index(newpath, root, loop_count, false);
 				return (Path *) newpath;
 			}
 		case T_BitmapHeapScan:
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa32e9e..d14f0f9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6471,7 +6471,8 @@ add_predicate_to_quals(IndexOptInfo *index, List *indexQuals)
 void
 btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			   Cost *indexStartupCost, Cost *indexTotalCost,
-			   Selectivity *indexSelectivity, double *indexCorrelation)
+			   Selectivity *indexSelectivity, double *indexCorrelation,
+			   double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6761,12 +6762,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 hashcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	List	   *qinfos;
 	GenericCosts costs;
@@ -6807,12 +6810,14 @@ hashcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6866,12 +6871,14 @@ gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				Cost *indexStartupCost, Cost *indexTotalCost,
-				Selectivity *indexSelectivity, double *indexCorrelation)
+				Selectivity *indexSelectivity, double *indexCorrelation,
+				double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6925,6 +6932,7 @@ spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 
@@ -7222,7 +7230,8 @@ gincost_scalararrayopexpr(PlannerInfo *root,
 void
 gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				Cost *indexStartupCost, Cost *indexTotalCost,
-				Selectivity *indexSelectivity, double *indexCorrelation)
+				Selectivity *indexSelectivity, double *indexCorrelation,
+				double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *indexQuals = path->indexquals;
@@ -7537,6 +7546,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexStartupCost += qual_arg_cost;
 	*indexTotalCost += qual_arg_cost;
 	*indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost);
+	*indexPages = dataPagesFetched;
 }
 
 /*
@@ -7545,7 +7555,8 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 void
 brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *indexQuals = path->indexquals;
@@ -7597,6 +7608,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexStartupCost += qual_arg_cost;
 	*indexTotalCost += qual_arg_cost;
 	*indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost);
+	*indexPages = index->pages;
 
 	/* XXX what about pages_per_range? */
 }
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index b0730bf..f919cf8 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -95,7 +95,8 @@ typedef void (*amcostestimate_function) (struct PlannerInfo *root,
 													 Cost *indexStartupCost,
 													 Cost *indexTotalCost,
 											   Selectivity *indexSelectivity,
-												   double *indexCorrelation);
+													 double *indexCorrelation,
+													 double *indexPages);
 
 /* parse index reloptions */
 typedef bytea *(*amoptions_function) (Datum reloptions,
@@ -188,6 +189,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 42c6c58..b3a3c22 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,6 +1363,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1389,6 +1390,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0e68264..72200fa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -76,7 +76,7 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
-		   double loop_count);
+		   double loop_count, bool partial_path);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 					  ParamPathInfo *param_info,
 					  Path *bitmapqual, double loop_count);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..4362c7a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  bool partialpath);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81e7a42..ebda308 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index d317242..17d165c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -28,41 +28,47 @@ extern void brincostestimate(struct PlannerInfo *root,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void btcostestimate(struct PlannerInfo *root,
 			   struct IndexPath *path,
 			   double loop_count,
 			   Cost *indexStartupCost,
 			   Cost *indexTotalCost,
 			   Selectivity *indexSelectivity,
-			   double *indexCorrelation);
+			   double *indexCorrelation,
+			   double *indexPages);
 extern void hashcostestimate(struct PlannerInfo *root,
 				 struct IndexPath *path,
 				 double loop_count,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void gistcostestimate(struct PlannerInfo *root,
 				 struct IndexPath *path,
 				 double loop_count,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void spgcostestimate(struct PlannerInfo *root,
 				struct IndexPath *path,
 				double loop_count,
 				Cost *indexStartupCost,
 				Cost *indexTotalCost,
 				Selectivity *indexSelectivity,
-				double *indexCorrelation);
+				double *indexCorrelation,
+				double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 				struct IndexPath *path,
 				double loop_count,
 				Cost *indexStartupCost,
 				Cost *indexTotalCost,
 				Selectivity *indexSelectivity,
-				double *indexCorrelation);
+				double *indexCorrelation,
+				double *indexPages);
 
 #endif   /* INDEX_SELFUNCS_H */
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index b967b45..37d9e86 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -99,6 +99,29 @@ explain (costs off)
    ->  Index Only Scan using tenk1_unique1 on tenk1
 (3 rows)
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index e3c32de..6102981 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -39,6 +39,17 @@ explain (costs off)
 	select  sum(parallel_restricted(unique1)) from tenk1
 	group by(parallel_restricted(unique1));
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#76

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#74)

Re: Parallel Index Scans

On Sat, Feb 11, 2017 at 6:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why can't we rely on _bt_walk_left?

The reason is mentioned in comments, but let me try to explain with
some example. When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page. Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9). So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9. In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem. You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan.

Exactly and that is what is done in the patch. Basically, if we found
that the current page is half-dead or it doesn't contain any matching
items, then release the current buffer, seize the scan, read the
current page, walk left and so on. I am slightly confused here
because it seems both of us agree on what is the right thing to do and
according to me that is how it is implemented. Are you just ensuring
about whether I have implemented as discussed or do you see a problem
with the way it is implemented?

Well, before, I thought you said that relying entirely on
_bt_walk_left couldn't work because then two people might end up
running it at the same time, and that would cause problems. But if
you can only run _bt_walk_left while you've got the scan seized, then
that can't happen. Evidently I'm missing something here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#77

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#76)

Re: Parallel Index Scans

On Mon, Feb 13, 2017 at 5:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Feb 11, 2017 at 6:35 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Why can't we rely on _bt_walk_left?

The reason is mentioned in comments, but let me try to explain with
some example. When you reach that point of code, it means that either
the current page (assume page number is 10) doesn't contain any
matching items or it is a half-dead page, both of which indicates that
we have to move to the previous page. Now, before checking if the
current page contains matching items, we signal parallel machinery
(via _bt_parallel_release) to allow workers to read the previous page
(assume previous page number is 9). So it is quite possible that
after deciding that current page (page number 10) doesn't contain any
matching tuples if we directly move to the previous page (in this case
it will be 9) by using _bt_walk_left, some other worker would have
read page 9. In short, if we directly use _bt_walk_left(), then we
are prone to returning some of the values twice as multiple workers
can read the same page.

But ... the entire point of the seize-and-release stuff is to avoid
this problem. You're suppose to seize the scan, read the current
page, walk left, store the page you find in the scan, and then release
the scan.

Exactly and that is what is done in the patch. Basically, if we found
that the current page is half-dead or it doesn't contain any matching
items, then release the current buffer, seize the scan, read the
current page, walk left and so on. I am slightly confused here
because it seems both of us agree on what is the right thing to do and
according to me that is how it is implemented. Are you just ensuring
about whether I have implemented as discussed or do you see a problem
with the way it is implemented?

Well, before, I thought you said that relying entirely on
_bt_walk_left couldn't work because then two people might end up
running it at the same time, and that would cause problems. But if
you can only run _bt_walk_left while you've got the scan seized, then
that can't happen. Evidently I'm missing something here.

I think the comment at that place is not as clear as it should be. So
how about changing it as below:

Existing comment:
--------------------------
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. So we
* can't rely on _bt_walk_left call.
*/

Modified comment:
--------------------------
/*
* For parallel scans, it is quite possible that by the time we try to fetch
* the previous page, another worker has also decided to scan that
* previous page. So to avoid that we need to get the last page scanned
* from shared scan descriptor before calling _bt_walk_left.
*/

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#78

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#77)

Re: Parallel Index Scans

On Mon, Feb 13, 2017 at 9:04 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think the comment at that place is not as clear as it should be. So
how about changing it as below:

Existing comment:
--------------------------
/*
* For parallel scans, get the last page scanned as it is quite
* possible that by the time we try to fetch previous page, other
* worker has also decided to scan that previous page. So we
* can't rely on _bt_walk_left call.
*/

Modified comment:
--------------------------
/*
* For parallel scans, it is quite possible that by the time we try to fetch
* the previous page, another worker has also decided to scan that
* previous page. So to avoid that we need to get the last page scanned
* from shared scan descriptor before calling _bt_walk_left.
*/

That sounds way better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#79

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#78)

1 attachment(s)

Re: Parallel Index Scans

On Tue, Feb 14, 2017 at 12:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That sounds way better.

Here's an updated patch. Please review my changes, which include:

* Various comment updates.
* _bt_parallel_seize now unconditionally sets *pageno to P_NONE at the
beginning, instead of doing it conditionally at the end. This seems
cleaner to me.
* I removed various BTScanPosInvalidate calls from _bt_first in places
where they followed calls to _bt_parallel_done, because I can't see
how the scan position could be valid at that point; note that
_bt_first asserts that it is invalid on entry.
* I added a _bt_parallel_done() call to _bt_first where it apparently
returned without releasing the scan; search for SK_ROW_MEMBER. Maybe
there's something I'm missing that makes this unnecessary, but if so
there should probably be a comment here.
* I wasn't happy with the strange calling convention where
_bt_readnextpage usually gets a valid block number but not for
non-parallel backward scans. I had a stab at fixing that so that the
block number is always valid, but I'm not entirely sure I've got the
logic right. Please see what you think.
* I repositioned the function prototypes you added to nbtree.h to
separate the public and non-public interfaces.

I can't easily test this because your second patch doesn't apply, so
I'd appreciate it if you could have a stab at checking whether I've
broken anything in this revision. Also, it would be good if you could
rebase the second patch.

I think this is pretty close to committable at this point. Whether or
not I broke anything in this revision, I don't think there's too much
left to be done here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

parallel_index_scan_v9.patchapplication/octet-stream; name=parallel_index_scan_v9.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5b67def..3b6a6dc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 945e563..a90c801 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,45 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
+ *
+ * BTPARALLEL_ADVANCING indicates that some process is advancing the scan to
+ * a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
+ * to a new page; some process can start doing that.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).
+ * We reach this state once for every distinct combination of array keys.
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +159,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -491,6 +532,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -653,6 +695,217 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan -- estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan -- initialize BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * In theory, we don't need to acquire the spinlock here, because there
+	 * shouldn't be any other workers running at this point, but we do so for
+	 * consistency.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call bt_parallel_release() or
+ *		bt_parallel_done().
+ *
+ * The return value is true if we successfully seized the scan and false
+ * if we did not.  The latter case occurs if no pages remain for the current
+ * set of scankeys.
+ *
+ * If the return value is true, *pageno returns the next or current page
+ * of the scan (depending on thes can direction).  An invalid block number
+ * means the scan hasn't yet started, and P_NONE means we've reached the end.
+ * The first time a participating process reaches the last page, it will return
+ * true and set *pageno to P_NONE; after that, further attempts to seize the
+ * scan will return false.
+ *
+ * Callers should ignore the value of pageno if the return value is false.
+ */
+bool
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	bool		status = true;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	*pageno = P_NONE;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	while (1)
+	{
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		{
+			/* Parallel scan has already advanced to a new set of scankeys. */
+			status = false;
+		}
+		else if (pageStatus == BTPARALLEL_DONE)
+		{
+			/*
+			 * We're done with this set of scankeys, but have not yet advanced
+			 * to the next set.
+			 */
+			status = false;
+		}
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			/*
+			 * We have successfully seized control of the scan for the purpose
+			 * of advancing it to a new page!
+			 */
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			*pageno = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	return status;
+}
+
+/*
+ * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ *		new page.  We now have the new value btps_scanPage; some other backend
+ *		can now begin advancing the scan.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Mark the parallel scan as complete.
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Mark the parallel scan as done for this combination of scan keys,
+	 * unless some other process already did so.  See also
+	 * _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
+ *			keys.
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b6459d2..2f32b2e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,13 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
+					  ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +548,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +570,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the starting page from shared state. If the
+	 * scan has not started, proceed to find out first leaf page in the usual
+	 * way while keeping other participating processes waiting.  If the scan
+	 * has already begun, use the page number from the shared structure.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		status = _bt_parallel_seize(scan, &blkno);
+		if (!status)
+			return false;
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +773,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+
+		if (!match)
+		{
+			/* No match, so mark (parallel) scan finished */
+			_bt_parallel_done(scan);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -773,7 +815,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 			Assert(subkey->sk_flags & SK_ROW_MEMBER);
 			if (subkey->sk_flags & SK_ISNULL)
+			{
+				_bt_parallel_done(scan);
 				return false;
+			}
 			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
@@ -993,25 +1038,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1101,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1132,6 +1174,10 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,25 +1370,103 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+		if (scan->parallel_scan != NULL)
+		{
+			/*
+			 * Seize the scan to get the next block number; if the scan has
+			 * ended already, bail out.
+			 */
+			status = _bt_parallel_seize(scan, &blkno);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+		{
+			/* Not parallel, so use the previously-saved nextPage link. */
+			blkno = so->currPos.nextPage;
+		}
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		if (scan->parallel_scan != NULL)
+		{
+			/*
+			 * Seize the scan to get the current block number; if the scan has
+			 * ended already, bail out.
+			 */
+			status = _bt_parallel_seize(scan, &blkno);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+		{
+			/* Not parallel, so just use our own notion of the current page */
+			blkno = so->currPos.currPage;
+		}
+	}
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1359,14 +1488,32 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * Should only happen in parallel cases, when some other backend
+		 * advanced the scan.
+		 */
+		if (so->currPos.currPage != blkno)
+		{
+			BTScanPosUnpinIfPinned(so->currPos);
+			so->currPos.currPage = blkno;
+		}
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1548,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1560,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1581,46 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to seize the scan, some other
+			 * worker has already advanced the scan to a different page.  We
+			 * must continue based on the latest page scanned by any worker.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and maybe pin on buffer.  We return TRUE to
+ * indicate success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1898,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1927,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..5b259a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_array_keys(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 344ef99..898475a 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -609,6 +609,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -652,7 +654,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * external entry points for btree, in nbtree.c
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -662,10 +664,13 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 IndexUniqueCheck checkUnique,
 		 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
@@ -678,6 +683,14 @@ extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
 extern bool btcanreturn(Relation index, int attno);
 
 /*
+ * prototypes for internal functions in nbtree.c
+ */
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+
+/*
  * prototypes for functions in nbtinsert.c
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

#80

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#79)

3 attachment(s)

Re: Parallel Index Scans

On Wed, Feb 15, 2017 at 2:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Feb 14, 2017 at 12:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:

That sounds way better.

Here's an updated patch. Please review my changes, which include:

* Various comment updates.

1.
+ * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
+ * to a new page; some process can start doing that.
+ *
+ * BTPARALLEL_DONE implies that the scan is complete (including error exit).

/implies/indicates, to be consistent with other explanations.

2.
+ * of the scan (depending on thes can direction). An invalid block number

/thes can/the scan

I have modified the patch to include above two changes.

3.
+ else if (pageStatus == BTPARALLEL_DONE)
+ {
+ /*
+ * We're done with this set of scankeys, but have not yet advanced
+ * to the next set.
+ */
+ status = false;
+ }

Here second part of the comment (but have not yet advanced ..) seems
to be slightly misleading because this state has nothing to do with
the advancement of scan keys.

I have not changed this because I am not sure what you have in mind.

* _bt_parallel_seize now unconditionally sets *pageno to P_NONE at the
beginning, instead of doing it conditionally at the end. This seems
cleaner to me.
* I removed various BTScanPosInvalidate calls from _bt_first in places
where they followed calls to _bt_parallel_done, because I can't see
how the scan position could be valid at that point; note that
_bt_first asserts that it is invalid on entry.
* I added a _bt_parallel_done() call to _bt_first where it apparently
returned without releasing the scan; search for SK_ROW_MEMBER. Maybe
there's something I'm missing that makes this unnecessary, but if so
there should probably be a comment here.
* I wasn't happy with the strange calling convention where
_bt_readnextpage usually gets a valid block number but not for
non-parallel backward scans. I had a stab at fixing that so that the
block number is always valid, but I'm not entirely sure I've got the
logic right. Please see what you think.

Looks good to me.

* I repositioned the function prototypes you added to nbtree.h to
separate the public and non-public interfaces.

I have verified all your changes and they look good to me.

I can't easily test this because your second patch doesn't apply,

I have tried and it works for me on latest code except for one test
output file which could have been excluded. I wonder whether you are
first applying the GUC related patch [1]/messages/by-id/CAA4eK1+TnM4pXQbvn7OXqam+k_HZqb0ROZUMxOiL6DWJYCyYow@mail.gmail.com before applying the optimizer
support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

so
I'd appreciate it if you could have a stab at checking whether I've
broken anything in this revision. Also, it would be good if you could
rebase the second patch.

I have rebased the optimizer/executor support related patch.

[1]: /messages/by-id/CAA4eK1+TnM4pXQbvn7OXqam+k_HZqb0ROZUMxOiL6DWJYCyYow@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

parallel_index_scan_v10.patchapplication/octet-stream; name=parallel_index_scan_v10.patchDownload

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 5b67def..3b6a6dc 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1207,7 +1207,7 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting in an extension.</entry>
         </row>
         <row>
-         <entry morerows="9"><literal>IPC</></entry>
+         <entry morerows="10"><literal>IPC</></entry>
          <entry><literal>BgWorkerShutdown</></entry>
          <entry>Waiting for background worker to shut down.</entry>
         </row>
@@ -1240,6 +1240,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry>Waiting for parallel workers to finish computing.</entry>
         </row>
         <row>
+         <entry><literal>ParallelBtreePage</></entry>
+         <entry>Waiting for the page number needed to continue a parallel btree scan
+         to become available.</entry>
+        </row>
+        <row>
          <entry><literal>SafeSnapshot</></entry>
          <entry>Waiting for a snapshot for a <literal>READ ONLY DEFERRABLE</> transaction.</entry>
         </row>
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 945e563..a5c3c08 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,8 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "pgstat.h"
+#include "storage/condition_variable.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -63,6 +65,45 @@ typedef struct
 	MemoryContext pagedelcontext;
 } BTVacState;
 
+/*
+ * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
+ *
+ * BTPARALLEL_ADVANCING indicates that some process is advancing the scan to
+ * a new page; others must wait.
+ *
+ * BTPARALLEL_IDLE indicates that no backend is currently advancing the scan
+ * to a new page; some process can start doing that.
+ *
+ * BTPARALLEL_DONE indicates that the scan is complete (including error exit).
+ * We reach this state once for every distinct combination of array keys.
+ */
+typedef enum
+{
+	BTPARALLEL_NOT_INITIALIZED,
+	BTPARALLEL_ADVANCING,
+	BTPARALLEL_IDLE,
+	BTPARALLEL_DONE
+} BTPS_State;
+
+/*
+ * BTParallelScanDescData contains btree specific shared information required
+ * for parallel scan.
+ */
+typedef struct BTParallelScanDescData
+{
+	BlockNumber btps_scanPage;	/* latest or next page to be scanned */
+	BTPS_State	btps_pageStatus;/* indicates whether next page is available
+								 * for scan. see above for possible states of
+								 * parallel scan. */
+	int			btps_arrayKeyCount;		/* count indicating number of array
+										 * scan keys processed by parallel
+										 * scan */
+	slock_t		btps_mutex;		/* protects above variables */
+	ConditionVariable btps_cv;	/* used to synchronize parallel scan */
+} BTParallelScanDescData;
+
+typedef struct BTParallelScanDescData *BTParallelScanDesc;
+
 
 static void btbuildCallback(Relation index,
 				HeapTuple htup,
@@ -118,9 +159,9 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amendscan = btendscan;
 	amroutine->ammarkpos = btmarkpos;
 	amroutine->amrestrpos = btrestrpos;
-	amroutine->amestimateparallelscan = NULL;
-	amroutine->aminitparallelscan = NULL;
-	amroutine->amparallelrescan = NULL;
+	amroutine->amestimateparallelscan = btestimateparallelscan;
+	amroutine->aminitparallelscan = btinitparallelscan;
+	amroutine->amparallelrescan = btparallelrescan;
 
 	PG_RETURN_POINTER(amroutine);
 }
@@ -491,6 +532,7 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	}
 
 	so->markItemIndex = -1;
+	so->arrayKeyCount = 0;
 	BTScanPosUnpinIfPinned(so->markPos);
 	BTScanPosInvalidate(so->markPos);
 
@@ -653,6 +695,217 @@ btrestrpos(IndexScanDesc scan)
 }
 
 /*
+ * btestimateparallelscan -- estimate storage for BTParallelScanDescData
+ */
+Size
+btestimateparallelscan(void)
+{
+	return sizeof(BTParallelScanDescData);
+}
+
+/*
+ * btinitparallelscan -- initialize BTParallelScanDesc for parallel btree scan
+ */
+void
+btinitparallelscan(void *target)
+{
+	BTParallelScanDesc bt_target = (BTParallelScanDesc) target;
+
+	SpinLockInit(&bt_target->btps_mutex);
+	bt_target->btps_scanPage = InvalidBlockNumber;
+	bt_target->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	bt_target->btps_arrayKeyCount = 0;
+	ConditionVariableInit(&bt_target->btps_cv);
+}
+
+/*
+ *	btparallelrescan() -- reset parallel scan
+ */
+void
+btparallelrescan(IndexScanDesc scan)
+{
+	BTParallelScanDesc btscan;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+
+	Assert(parallel_scan);
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * In theory, we don't need to acquire the spinlock here, because there
+	 * shouldn't be any other workers running at this point, but we do so for
+	 * consistency.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = InvalidBlockNumber;
+	btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+	btscan->btps_arrayKeyCount = 0;
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
+ * _bt_parallel_seize() -- Begin the process of advancing the scan to a new
+ *		page.  Other scans must wait until we call bt_parallel_release() or
+ *		bt_parallel_done().
+ *
+ * The return value is true if we successfully seized the scan and false
+ * if we did not.  The latter case occurs if no pages remain for the current
+ * set of scankeys.
+ *
+ * If the return value is true, *pageno returns the next or current page
+ * of the scan (depending on the scan direction).  An invalid block number
+ * means the scan hasn't yet started, and P_NONE means we've reached the end.
+ * The first time a participating process reaches the last page, it will return
+ * true and set *pageno to P_NONE; after that, further attempts to seize the
+ * scan will return false.
+ *
+ * Callers should ignore the value of pageno if the return value is false.
+ */
+bool
+_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BTPS_State	pageStatus;
+	bool		exit_loop = false;
+	bool		status = true;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	*pageno = P_NONE;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	while (1)
+	{
+		SpinLockAcquire(&btscan->btps_mutex);
+		pageStatus = btscan->btps_pageStatus;
+
+		if (so->arrayKeyCount < btscan->btps_arrayKeyCount)
+		{
+			/* Parallel scan has already advanced to a new set of scankeys. */
+			status = false;
+		}
+		else if (pageStatus == BTPARALLEL_DONE)
+		{
+			/*
+			 * We're done with this set of scankeys, but have not yet advanced
+			 * to the next set.
+			 */
+			status = false;
+		}
+		else if (pageStatus != BTPARALLEL_ADVANCING)
+		{
+			/*
+			 * We have successfully seized control of the scan for the purpose
+			 * of advancing it to a new page!
+			 */
+			btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
+			*pageno = btscan->btps_scanPage;
+			exit_loop = true;
+		}
+		SpinLockRelease(&btscan->btps_mutex);
+		if (exit_loop || !status)
+			break;
+		ConditionVariableSleep(&btscan->btps_cv, WAIT_EVENT_BTREE_PAGE);
+	}
+	ConditionVariableCancelSleep();
+
+	return status;
+}
+
+/*
+ * _bt_parallel_release() -- Complete the process of advancing the scan to a
+ *		new page.  We now have the new value btps_scanPage; some other backend
+ *		can now begin advancing the scan.
+ */
+void
+_bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
+{
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	SpinLockAcquire(&btscan->btps_mutex);
+	btscan->btps_scanPage = scan_page;
+	btscan->btps_pageStatus = BTPARALLEL_IDLE;
+	SpinLockRelease(&btscan->btps_mutex);
+	ConditionVariableSignal(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_done() -- Mark the parallel scan as complete.
+ *
+ * When there are no pages left to scan, this function should be called to
+ * notify other workers.  Otherwise, they might wait forever for the scan to
+ * advance to the next page.
+ */
+void
+_bt_parallel_done(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+	bool		status_changed = false;
+
+	/* Do nothing, for non-parallel scans */
+	if (parallel_scan == NULL)
+		return;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	/*
+	 * Mark the parallel scan as done for this combination of scan keys,
+	 * unless some other process already did so.  See also
+	 * _bt_advance_array_keys.
+	 */
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (so->arrayKeyCount >= btscan->btps_arrayKeyCount &&
+		btscan->btps_pageStatus != BTPARALLEL_DONE)
+	{
+		btscan->btps_pageStatus = BTPARALLEL_DONE;
+		status_changed = true;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+
+	/* wake up all the workers associated with this parallel scan */
+	if (status_changed)
+		ConditionVariableBroadcast(&btscan->btps_cv);
+}
+
+/*
+ * _bt_parallel_advance_array_keys() -- Advances the parallel scan for array
+ *			keys.
+ *
+ * Updates the count of array keys processed for both local and parallel
+ * scans.
+ */
+void
+_bt_parallel_advance_array_keys(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
+	BTParallelScanDesc btscan;
+
+	btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
+												  parallel_scan->ps_offset);
+
+	so->arrayKeyCount++;
+	SpinLockAcquire(&btscan->btps_mutex);
+	if (btscan->btps_pageStatus == BTPARALLEL_DONE)
+	{
+		btscan->btps_scanPage = InvalidBlockNumber;
+		btscan->btps_pageStatus = BTPARALLEL_NOT_INITIALIZED;
+		btscan->btps_arrayKeyCount++;
+	}
+	SpinLockRelease(&btscan->btps_mutex);
+}
+
+/*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index b6459d2..2f32b2e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -30,9 +30,13 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
+static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
+					  ScanDirection dir);
 static Buffer _bt_walk_left(Relation rel, Buffer buf, Snapshot snapshot);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
 
 
 /*
@@ -544,8 +548,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	ScanKeyData notnullkeys[INDEX_MAX_KEYS];
 	int			keysCount = 0;
 	int			i;
+	bool		status = true;
 	StrategyNumber strat_total;
 	BTScanPosItem *currItem;
+	BlockNumber blkno;
 
 	Assert(!BTScanPosIsValid(so->currPos));
 
@@ -564,6 +570,30 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	if (!so->qual_ok)
 		return false;
 
+	/*
+	 * For parallel scans, get the starting page from shared state. If the
+	 * scan has not started, proceed to find out first leaf page in the usual
+	 * way while keeping other participating processes waiting.  If the scan
+	 * has already begun, use the page number from the shared structure.
+	 */
+	if (scan->parallel_scan != NULL)
+	{
+		status = _bt_parallel_seize(scan, &blkno);
+		if (!status)
+			return false;
+		else if (blkno == P_NONE)
+		{
+			_bt_parallel_done(scan);
+			return false;
+		}
+		else if (blkno != InvalidBlockNumber)
+		{
+			if (!_bt_parallel_readpage(scan, blkno, dir))
+				return false;
+			goto readcomplete;
+		}
+	}
+
 	/*----------
 	 * Examine the scan keys to discover where we need to start the scan.
 	 *
@@ -743,7 +773,19 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * there.
 	 */
 	if (keysCount == 0)
-		return _bt_endpoint(scan, dir);
+	{
+		bool		match;
+
+		match = _bt_endpoint(scan, dir);
+
+		if (!match)
+		{
+			/* No match, so mark (parallel) scan finished */
+			_bt_parallel_done(scan);
+		}
+
+		return match;
+	}
 
 	/*
 	 * We want to start the scan somewhere within the index.  Set up an
@@ -773,7 +815,10 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 
 			Assert(subkey->sk_flags & SK_ROW_MEMBER);
 			if (subkey->sk_flags & SK_ISNULL)
+			{
+				_bt_parallel_done(scan);
 				return false;
+			}
 			memcpy(scankeys + i, subkey, sizeof(ScanKeyData));
 
 			/*
@@ -993,25 +1038,21 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * because nothing finer to lock exists.
 		 */
 		PredicateLockRelation(rel, scan->xs_snapshot);
+
+		/*
+		 * mark parallel scan as done, so that all the workers can finish
+		 * their scan
+		 */
+		_bt_parallel_done(scan);
+		BTScanPosInvalidate(so->currPos);
+
 		return false;
 	}
 	else
 		PredicateLockPage(rel, BufferGetBlockNumber(buf),
 						  scan->xs_snapshot);
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	Assert(so->markItemIndex == -1);
+	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
 	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
@@ -1060,6 +1101,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 	}
 
+readcomplete:
 	/* OK, itemIndex says what to return */
 	currItem = &so->currPos.items[so->currPos.itemIndex];
 	scan->xs_ctup.t_self = currItem->heapTid;
@@ -1132,6 +1174,10 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
  * moreLeft or moreRight (as appropriate) is cleared if _bt_checkkeys reports
  * that there can be no more matching tuples in the current scan direction.
  *
+ * In the case of a parallel scan, caller must have called _bt_parallel_seize
+ * prior to calling this function; this function will invoke
+ * _bt_parallel_release before returning.
+ *
  * Returns true if any matching items found on the page, false if none.
  */
 static bool
@@ -1154,6 +1200,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	page = BufferGetPage(so->currPos.buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	/* allow next page be processed by parallel worker */
+	if (scan->parallel_scan)
+	{
+		if (ScanDirectionIsForward(dir))
+			_bt_parallel_release(scan, opaque->btpo_next);
+		else
+			_bt_parallel_release(scan, BufferGetBlockNumber(so->currPos.buf));
+	}
+
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
 
@@ -1278,21 +1334,16 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * if pinned, we'll drop the pin before moving to next page.  The buffer is
  * not locked on entry.
  *
- * On success exit, so->currPos is updated to contain data from the next
- * interesting page.  For success on a scan using a non-MVCC snapshot we hold
- * a pin, but not a read lock, on that page.  If we do not hold the pin, we
- * set so->currPos.buf to InvalidBuffer.  We return TRUE to indicate success.
- *
- * If there are no more matching records in the given direction, we drop all
- * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ * For success on a scan using a non-MVCC snapshot we hold a pin, but not a
+ * read lock, on that page.  If we do not hold the pin, we set so->currPos.buf
+ * to InvalidBuffer.  We return TRUE to indicate success.
  */
 static bool
 _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
-	Relation	rel;
-	Page		page;
-	BTPageOpaque opaque;
+	BlockNumber blkno = InvalidBlockNumber;
+	bool		status = true;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -1319,25 +1370,103 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markItemIndex = -1;
 	}
 
-	rel = scan->indexRelation;
-
 	if (ScanDirectionIsForward(dir))
 	{
 		/* Walk right to the next page with data */
-		/* We must rely on the previously saved nextPage link! */
-		BlockNumber blkno = so->currPos.nextPage;
+		if (scan->parallel_scan != NULL)
+		{
+			/*
+			 * Seize the scan to get the next block number; if the scan has
+			 * ended already, bail out.
+			 */
+			status = _bt_parallel_seize(scan, &blkno);
+			if (!status)
+			{
+				/* release the previous buffer, if pinned */
+				BTScanPosUnpinIfPinned(so->currPos);
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+		{
+			/* Not parallel, so use the previously-saved nextPage link. */
+			blkno = so->currPos.nextPage;
+		}
 
 		/* Remember we left a page with data */
 		so->currPos.moreLeft = true;
 
 		/* release the previous buffer, if pinned */
 		BTScanPosUnpinIfPinned(so->currPos);
+	}
+	else
+	{
+		/* Remember we left a page with data */
+		so->currPos.moreRight = true;
+
+		if (scan->parallel_scan != NULL)
+		{
+			/*
+			 * Seize the scan to get the current block number; if the scan has
+			 * ended already, bail out.
+			 */
+			status = _bt_parallel_seize(scan, &blkno);
+			BTScanPosUnpinIfPinned(so->currPos);
+			if (!status)
+			{
+				BTScanPosInvalidate(so->currPos);
+				return false;
+			}
+		}
+		else
+		{
+			/* Not parallel, so just use our own notion of the current page */
+			blkno = so->currPos.currPage;
+		}
+	}
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
+	/* Drop the lock, and maybe the pin, on the current page */
+	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
+	return true;
+}
+
+/*
+ *	_bt_readnextpage() -- Read next page containing valid data for scan
+ *
+ * On success exit, so->currPos is updated to contain data from the next
+ * interesting page.  Caller is responsible to release lock and pin on
+ * buffer on success.  We return TRUE to indicate success.
+ *
+ * If there are no more matching records in the given direction, we drop all
+ * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
+ */
+static bool
+_bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	Relation	rel;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		status = true;
+
+	rel = scan->indexRelation;
+
+	if (ScanDirectionIsForward(dir))
+	{
 		for (;;)
 		{
-			/* if we're at end of scan, give up */
+			/*
+			 * if we're at end of scan, give up and mark parallel scan as
+			 * done, so that all the workers can finish their scan
+			 */
 			if (blkno == P_NONE || !so->currPos.moreRight)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1359,14 +1488,32 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			}
 
 			/* nope, keep going */
-			blkno = opaque->btpo_next;
+			if (scan->parallel_scan != NULL)
+			{
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					_bt_relbuf(rel, so->currPos.buf);
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+			}
+			else
+				blkno = opaque->btpo_next;
 			_bt_relbuf(rel, so->currPos.buf);
 		}
 	}
 	else
 	{
-		/* Remember we left a page with data */
-		so->currPos.moreRight = true;
+		/*
+		 * Should only happen in parallel cases, when some other backend
+		 * advanced the scan.
+		 */
+		if (so->currPos.currPage != blkno)
+		{
+			BTScanPosUnpinIfPinned(so->currPos);
+			so->currPos.currPage = blkno;
+		}
 
 		/*
 		 * Walk left to the next page with data.  This is much more complex
@@ -1401,6 +1548,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			if (!so->currPos.moreLeft)
 			{
 				_bt_relbuf(rel, so->currPos.buf);
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1412,6 +1560,7 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			/* if we're physically at end of index, return failure */
 			if (so->currPos.buf == InvalidBuffer)
 			{
+				_bt_parallel_done(scan);
 				BTScanPosInvalidate(so->currPos);
 				return false;
 			}
@@ -1432,9 +1581,46 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
 					break;
 			}
+
+			/*
+			 * For parallel scans, get the last page scanned as it is quite
+			 * possible that by the time we try to seize the scan, some other
+			 * worker has already advanced the scan to a different page.  We
+			 * must continue based on the latest page scanned by any worker.
+			 */
+			if (scan->parallel_scan != NULL)
+			{
+				_bt_relbuf(rel, so->currPos.buf);
+				status = _bt_parallel_seize(scan, &blkno);
+				if (!status)
+				{
+					BTScanPosInvalidate(so->currPos);
+					return false;
+				}
+				so->currPos.buf = _bt_getbuf(rel, blkno, BT_READ);
+			}
 		}
 	}
 
+	return true;
+}
+
+/*
+ *	_bt_parallel_readpage() -- Read current page containing valid data for scan
+ *
+ * On success, release lock and maybe pin on buffer.  We return TRUE to
+ * indicate success.
+ */
+static bool
+_bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+	_bt_initialize_more_data(so, dir);
+
+	if (!_bt_readnextpage(scan, blkno, dir))
+		return false;
+
 	/* Drop the lock, and maybe the pin, on the current page */
 	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
 
@@ -1712,19 +1898,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 	/* remember which buffer we have pinned */
 	so->currPos.buf = buf;
 
-	/* initialize moreLeft/moreRight appropriately for scan direction */
-	if (ScanDirectionIsForward(dir))
-	{
-		so->currPos.moreLeft = false;
-		so->currPos.moreRight = true;
-	}
-	else
-	{
-		so->currPos.moreLeft = true;
-		so->currPos.moreRight = false;
-	}
-	so->numKilled = 0;			/* just paranoia */
-	so->markItemIndex = -1;		/* ditto */
+	_bt_initialize_more_data(so, dir);
 
 	/*
 	 * Now load data from the first page of the scan.
@@ -1753,3 +1927,25 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 
 	return true;
 }
+
+/*
+ * _bt_initialize_more_data() -- initialize moreLeft/moreRight appropriately
+ * for scan direction
+ */
+static inline void
+_bt_initialize_more_data(BTScanOpaque so, ScanDirection dir)
+{
+	/* initialize moreLeft/moreRight appropriately for scan direction */
+	if (ScanDirectionIsForward(dir))
+	{
+		so->currPos.moreLeft = false;
+		so->currPos.moreRight = true;
+	}
+	else
+	{
+		so->currPos.moreLeft = true;
+		so->currPos.moreRight = false;
+	}
+	so->numKilled = 0;			/* just paranoia */
+	so->markItemIndex = -1;		/* ditto */
+}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index da0f330..5b259a3 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -590,6 +590,10 @@ _bt_advance_array_keys(IndexScanDesc scan, ScanDirection dir)
 			break;
 	}
 
+	/* advance parallel scan */
+	if (scan->parallel_scan != NULL)
+		_bt_parallel_advance_array_keys(scan);
+
 	return found;
 }
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..92af6ec 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_PARALLEL_FINISH:
 			event_name = "ParallelFinish";
 			break;
+		case WAIT_EVENT_BTREE_PAGE:
+			event_name = "ParallelBtreePage";
+			break;
 		case WAIT_EVENT_SAFE_SNAPSHOT:
 			event_name = "SafeSnapshot";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 25a1dc8..6289ffa 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -383,6 +383,8 @@ typedef struct BTScanOpaqueData
 	ScanKey		arrayKeyData;	/* modified copy of scan->keyData */
 	int			numArrayKeys;	/* number of equality-type array keys (-1 if
 								 * there are any unsatisfiable array keys) */
+	int			arrayKeyCount;	/* count indicating number of array scan keys
+								 * processed */
 	BTArrayKeyInfo *arrayKeys;	/* info about each equality-type array key */
 	MemoryContext arrayContext; /* scan-lifespan context for array data */
 
@@ -426,7 +428,7 @@ typedef BTScanOpaqueData *BTScanOpaque;
 #define SK_BT_NULLS_FIRST	(INDOPTION_NULLS_FIRST << SK_BT_INDOPTION_SHIFT)
 
 /*
- * prototypes for functions in nbtree.c (external entry points for btree)
+ * external entry points for btree, in nbtree.c
  */
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 		struct IndexInfo *indexInfo);
@@ -436,10 +438,13 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
 		 IndexUniqueCheck checkUnique,
 		 struct IndexInfo *indexInfo);
 extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern Size btestimateparallelscan(void);
+extern void btinitparallelscan(void *target);
 extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
 extern int64 btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
 extern void btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		 ScanKey orderbys, int norderbys);
+extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
@@ -452,6 +457,14 @@ extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
 extern bool btcanreturn(Relation index, int attno);
 
 /*
+ * prototypes for internal functions in nbtree.c
+ */
+extern bool _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno);
+extern void _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page);
+extern void _bt_parallel_done(IndexScanDesc scan);
+extern void _bt_parallel_advance_array_keys(IndexScanDesc scan);
+
+/*
  * prototypes for functions in nbtinsert.c
  */
 extern bool _bt_doinsert(Relation rel, IndexTuple itup,
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..915cc85 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -786,6 +786,7 @@ typedef enum
 	WAIT_EVENT_MQ_RECEIVE,
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
+	WAIT_EVENT_BTREE_PAGE,
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP
 } WaitEventIPC;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c4235ae..9f876ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -161,6 +161,9 @@ BTPageOpaque
 BTPageOpaqueData
 BTPageStat
 BTPageState
+BTParallelScanDesc
+BTParallelScanDescData
+BTPS_State
 BTScanOpaque
 BTScanOpaqueData
 BTScanPos

guc_parallel_index_scan_v1.patchapplication/octet-stream; name=guc_parallel_index_scan_v1.patchDownload

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index dc63d7d..b4b7f0b 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -3835,20 +3835,34 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
       </listitem>
      </varlistentry>
 
-     <varlistentry id="guc-min-parallel-relation-size" xreflabel="min_parallel_relation_size">
-      <term><varname>min_parallel_relation_size</varname> (<type>integer</type>)
+     <varlistentry id="guc-min-parallel-table-scan-size" xreflabel="min_parallel_table_scan_size">
+      <term><varname>min_parallel_table_scan_size</varname> (<type>integer</type>)
       <indexterm>
-       <primary><varname>min_parallel_relation_size</> configuration parameter</primary>
+       <primary><varname>min_parallel_table_scan_size</> configuration parameter</primary>
       </indexterm>
       </term>
       <listitem>
        <para>
-        Sets the minimum size of relations to be considered for parallel scan.
+        Sets the minimum size of tables to be considered for parallel scan.
         The default is 8 megabytes (<literal>8MB</>).
        </para>
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-min-parallel-index-scan-size" xreflabel="min_parallel_index_scan_size">
+      <term><varname>min_parallel_index_scan_size</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>min_parallel_index_scan_size</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Sets the minimum size of indexes to be considered for parallel scan.
+        The default is 512 kilobytes (<literal>512kB</>).
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-effective-cache-size" xreflabel="effective_cache_size">
       <term><varname>effective_cache_size</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 5c18987..85505c5 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -57,7 +57,8 @@ typedef struct pushdown_safety_info
 /* These parameters are set by GUC */
 bool		enable_geqo = false;	/* just in case GUC doesn't set it */
 int			geqo_threshold;
-int			min_parallel_relation_size;
+int			min_parallel_table_scan_size;
+int			min_parallel_index_scan_size;
 
 /* Hook for plugins to get control in set_rel_pathlist() */
 set_rel_pathlist_hook_type set_rel_pathlist_hook = NULL;
@@ -126,7 +127,8 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int	compute_parallel_worker(RelOptInfo *rel, BlockNumber pages);
+static int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages);
 
 
 /*
@@ -679,7 +681,7 @@ create_plain_partial_paths(PlannerInfo *root, RelOptInfo *rel)
 {
 	int			parallel_workers;
 
-	parallel_workers = compute_parallel_worker(rel, rel->pages);
+	parallel_workers = compute_parallel_worker(rel, rel->pages, 0);
 
 	/* If any limit was set to zero, the user doesn't want a parallel scan. */
 	if (parallel_workers <= 0)
@@ -2876,13 +2878,20 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
 
 /*
  * Compute the number of parallel workers that should be used to scan a
- * relation.  "pages" is the number of pages from the relation that we
- * expect to scan.
+ * relation.  We compute the parallel workers based on the size of the heap to
+ * be scanned and the size of the index to be scanned, then choose a minimum
+ * of those.
+ *
+ * "heap_pages" is the number of pages from the table that we expect to scan.
+ * "index_pages" is the number of pages from the index that we expect to scan.
  */
 static int
-compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
+compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages)
 {
-	int			parallel_workers;
+	int			parallel_workers = 0;
+	int			heap_parallel_workers = 1;
+	int			index_parallel_workers = 1;
 
 	/*
 	 * If the user has set the parallel_workers reloption, use that; otherwise
@@ -2892,7 +2901,8 @@ compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 		parallel_workers = rel->rel_parallel_workers;
 	else
 	{
-		int			parallel_threshold;
+		int			heap_parallel_threshold;
+		int			index_parallel_threshold;
 
 		/*
 		 * If this relation is too small to be worth a parallel scan, just
@@ -2901,25 +2911,48 @@ compute_parallel_worker(RelOptInfo *rel, BlockNumber pages)
 		 * might not be worthwhile just for this relation, but when combined
 		 * with all of its inheritance siblings it may well pay off.
 		 */
-		if (pages < (BlockNumber) min_parallel_relation_size &&
+		if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
+			index_pages < (BlockNumber) min_parallel_index_scan_size &&
 			rel->reloptkind == RELOPT_BASEREL)
 			return 0;
 
-		/*
-		 * Select the number of workers based on the log of the size of the
-		 * relation.  This probably needs to be a good deal more
-		 * sophisticated, but we need something here for now.  Note that the
-		 * upper limit of the min_parallel_relation_size GUC is chosen to
-		 * prevent overflow here.
-		 */
-		parallel_workers = 1;
-		parallel_threshold = Max(min_parallel_relation_size, 1);
-		while (pages >= (BlockNumber) (parallel_threshold * 3))
+		if (heap_pages > 0)
+		{
+			/*
+			 * Select the number of workers based on the log of the size of
+			 * the relation.  This probably needs to be a good deal more
+			 * sophisticated, but we need something here for now.  Note that
+			 * the upper limit of the min_parallel_table_scan_size GUC is
+			 * chosen to prevent overflow here.
+			 */
+			heap_parallel_threshold = Max(min_parallel_table_scan_size, 1);
+			while (heap_pages >= (BlockNumber) (heap_parallel_threshold * 3))
+			{
+				heap_parallel_workers++;
+				heap_parallel_threshold *= 3;
+				if (heap_parallel_threshold > INT_MAX / 3)
+					break;		/* avoid overflow */
+			}
+
+			parallel_workers = heap_parallel_workers;
+		}
+
+		if (index_pages > 0)
 		{
-			parallel_workers++;
-			parallel_threshold *= 3;
-			if (parallel_threshold > INT_MAX / 3)
-				break;			/* avoid overflow */
+			/* same calculation as for heap_pages above */
+			index_parallel_threshold = Max(min_parallel_index_scan_size, 1);
+			while (index_pages >= (BlockNumber) (index_parallel_threshold * 3))
+			{
+				index_parallel_workers++;
+				index_parallel_threshold *= 3;
+				if (index_parallel_threshold > INT_MAX / 3)
+					break;		/* avoid overflow */
+			}
+
+			if (parallel_workers > 0)
+				parallel_workers = Min(parallel_workers, index_parallel_workers);
+			else
+				parallel_workers = index_parallel_workers;
 		}
 	}
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de85eca..0df45b8 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2776,17 +2776,28 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"min_parallel_relation_size", PGC_USERSET, QUERY_TUNING_COST,
-			gettext_noop("Sets the minimum size of relations to be considered for parallel scan."),
+		{"min_parallel_table_scan_size", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the minimum size of tables to be considered for parallel scan."),
 			NULL,
 			GUC_UNIT_BLOCKS,
 		},
-		&min_parallel_relation_size,
+		&min_parallel_table_scan_size,
 		(8 * 1024 * 1024) / BLCKSZ, 0, INT_MAX / 3,
 		NULL, NULL, NULL
 	},
 
 	{
+		{"min_parallel_index_scan_size", PGC_USERSET, QUERY_TUNING_COST,
+			gettext_noop("Sets the minimum size of indexes to be considered for parallel scan."),
+			NULL,
+			GUC_UNIT_BLOCKS,
+		},
+		&min_parallel_index_scan_size,
+		(512 * 1024) / BLCKSZ, 0, INT_MAX / 3,
+		NULL, NULL, NULL
+	},
+
+	{
 		/* Can't be set in postgresql.conf */
 		{"server_version_num", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows the server version as an integer."),
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 661b0fa..157d775 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -300,7 +300,8 @@
 #cpu_operator_cost = 0.0025		# same scale as above
 #parallel_tuple_cost = 0.1		# same scale as above
 #parallel_setup_cost = 1000.0	# same scale as above
-#min_parallel_relation_size = 8MB
+#min_parallel_table_scan_size = 8MB
+#min_parallel_index_scan_size = 512kB
 #effective_cache_size = 4GB
 
 # - Genetic Query Optimizer -
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81a9be7..81e7a42 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -22,7 +22,8 @@
  */
 extern bool enable_geqo;
 extern int	geqo_threshold;
-extern int	min_parallel_relation_size;
+extern int	min_parallel_table_scan_size;
+extern int	min_parallel_index_scan_size;
 
 /* Hook for plugins to get control in set_rel_pathlist() */
 typedef void (*set_rel_pathlist_hook_type) (PlannerInfo *root,
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 18e21b7..b967b45 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -9,7 +9,7 @@ begin isolation level repeatable read;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
-set min_parallel_relation_size=0;
+set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 explain (costs off)
   select count(*) from a_star;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index 8b4090f..e3c32de 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -12,7 +12,7 @@ begin isolation level repeatable read;
 -- encourage use of parallel plans
 set parallel_setup_cost=0;
 set parallel_tuple_cost=0;
-set min_parallel_relation_size=0;
+set min_parallel_table_scan_size=0;
 set max_parallel_workers_per_gather=4;
 
 explain (costs off)

parallel_index_opt_exec_support_v10.patchapplication/octet-stream; name=parallel_index_opt_exec_support_v10.patchDownload

diff --git a/contrib/bloom/blcost.c b/contrib/bloom/blcost.c
index 98a2228..ba39f62 100644
--- a/contrib/bloom/blcost.c
+++ b/contrib/bloom/blcost.c
@@ -24,7 +24,8 @@
 void
 blcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			   Cost *indexStartupCost, Cost *indexTotalCost,
-			   Selectivity *indexSelectivity, double *indexCorrelation)
+			   Selectivity *indexSelectivity, double *indexCorrelation,
+			   double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -45,4 +46,5 @@ blcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index 39d8d05..0cfe49a 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -208,6 +208,6 @@ extern bytea *bloptions(Datum reloptions, bool validate);
 extern void blcostestimate(PlannerInfo *root, IndexPath *path,
 			   double loop_count, Cost *indexStartupCost,
 			   Cost *indexTotalCost, Selectivity *indexSelectivity,
-			   double *indexCorrelation);
+			   double *indexCorrelation, double *indexPages);
 
 #endif
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 858798d..f2eda67 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -119,6 +119,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = blbuild;
diff --git a/doc/src/sgml/indexam.sgml b/doc/src/sgml/indexam.sgml
index 9afd7f6..401b115 100644
--- a/doc/src/sgml/indexam.sgml
+++ b/doc/src/sgml/indexam.sgml
@@ -110,6 +110,8 @@ typedef struct IndexAmRoutine
     bool        amclusterable;
     /* does AM handle predicate locks? */
     bool        ampredlocks;
+    /* does AM support parallel scan? */
+    bool        amcanparallel;
     /* type of data stored in index, or InvalidOid if variable */
     Oid         amkeytype;
 
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 4ff046b..b22563b 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -93,6 +93,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = brinbuild;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a98d4fc..d03d59d 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -50,6 +50,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = ginbuild;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 96ead53..6593771 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -71,6 +71,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = true;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = gistbuild;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index bca77a8..24510e7 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -67,6 +67,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = INT4OID;
 
 	amroutine->ambuild = hashbuild;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a5c3c08..f6df2f3 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -140,6 +140,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
+	amroutine->amcanparallel = true;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = btbuild;
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 78846be..e57ac49 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -49,6 +49,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->amstorage = false;
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = false;
+	amroutine->amcanparallel = false;
 	amroutine->amkeytype = InvalidOid;
 
 	amroutine->ambuild = spgbuild;
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 784dbaf..98d4f1e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -28,6 +28,7 @@
 #include "executor/nodeCustom.h"
 #include "executor/nodeForeignscan.h"
 #include "executor/nodeSeqscan.h"
+#include "executor/nodeIndexscan.h"
 #include "executor/tqueue.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/planmain.h"
@@ -197,6 +198,10 @@ ExecParallelEstimate(PlanState *planstate, ExecParallelEstimateContext *e)
 				ExecSeqScanEstimate((SeqScanState *) planstate,
 									e->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanEstimate((IndexScanState *) planstate,
+									  e->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanEstimate((ForeignScanState *) planstate,
 										e->pcxt);
@@ -249,6 +254,10 @@ ExecParallelInitializeDSM(PlanState *planstate,
 				ExecSeqScanInitializeDSM((SeqScanState *) planstate,
 										 d->pcxt);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeDSM((IndexScanState *) planstate,
+										   d->pcxt);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeDSM((ForeignScanState *) planstate,
 											 d->pcxt);
@@ -725,6 +734,9 @@ ExecParallelInitializeWorker(PlanState *planstate, shm_toc *toc)
 			case T_SeqScanState:
 				ExecSeqScanInitializeWorker((SeqScanState *) planstate, toc);
 				break;
+			case T_IndexScanState:
+				ExecIndexScanInitializeWorker((IndexScanState *) planstate, toc);
+				break;
 			case T_ForeignScanState:
 				ExecForeignScanInitializeWorker((ForeignScanState *) planstate,
 												toc);
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 5734550..0a9dfdb 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -22,6 +22,9 @@
  *		ExecEndIndexScan		releases all storage.
  *		ExecIndexMarkPos		marks scan position.
  *		ExecIndexRestrPos		restores scan position.
+ *		ExecIndexScanEstimate	estimates DSM space needed for parallel index scan
+ *		ExecIndexScanInitializeDSM initialize DSM for parallel indexscan
+ *		ExecIndexScanInitializeWorker attach to DSM info in parallel worker
  */
 #include "postgres.h"
 
@@ -514,6 +517,18 @@ ExecIndexScan(IndexScanState *node)
 void
 ExecReScanIndexScan(IndexScanState *node)
 {
+	bool		reset_parallel_scan = true;
+
+	/*
+	 * If we are here to just update the scan keys, then don't reset parallel
+	 * scan.  We don't want each of the participating process in the parallel
+	 * scan to update the shared parallel scan state at the start of the scan.
+	 * It is quite possible that one of the participants has already begun
+	 * scanning the index when another has yet to start it.
+	 */
+	if (node->iss_NumRuntimeKeys != 0 && !node->iss_RuntimeKeysReady)
+		reset_parallel_scan = false;
+
 	/*
 	 * If we are doing runtime key calculations (ie, any of the index key
 	 * values weren't simple Consts), compute the new key values.  But first,
@@ -539,10 +554,21 @@ ExecReScanIndexScan(IndexScanState *node)
 			reorderqueue_pop(node);
 	}
 
-	/* reset index scan */
-	index_rescan(node->iss_ScanDesc,
-				 node->iss_ScanKeys, node->iss_NumScanKeys,
-				 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+	/*
+	 * Reset (parallel) index scan.  For parallel-aware nodes, the scan
+	 * descriptor is initialized during actual execution of node and we can
+	 * reach here before that (ex. during execution of nest loop join).  So,
+	 * avoid updating the scan descriptor at that time.
+	 */
+	if (node->iss_ScanDesc)
+	{
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+
+		if (reset_parallel_scan && node->iss_ScanDesc->parallel_scan)
+			index_parallelrescan(node->iss_ScanDesc);
+	}
 	node->iss_ReachedEnd = false;
 
 	ExecScanReScan(&node->ss);
@@ -1013,22 +1039,29 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	}
 
 	/*
-	 * Initialize scan descriptor.
+	 * for parallel-aware node, we initialize the scan descriptor after
+	 * initializing the shared memory for parallel execution.
 	 */
-	indexstate->iss_ScanDesc = index_beginscan(currentRelation,
-											   indexstate->iss_RelationDesc,
-											   estate->es_snapshot,
-											   indexstate->iss_NumScanKeys,
+	if (!node->scan.plan.parallel_aware)
+	{
+		/*
+		 * Initialize scan descriptor.
+		 */
+		indexstate->iss_ScanDesc = index_beginscan(currentRelation,
+												indexstate->iss_RelationDesc,
+												   estate->es_snapshot,
+												 indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
-	/*
-	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
-	 * index AM.
-	 */
-	if (indexstate->iss_NumRuntimeKeys == 0)
-		index_rescan(indexstate->iss_ScanDesc,
-					 indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
+		/*
+		 * If no run-time keys to calculate, go ahead and pass the scankeys to
+		 * the index AM.
+		 */
+		if (indexstate->iss_NumRuntimeKeys == 0)
+			index_rescan(indexstate->iss_ScanDesc,
+					   indexstate->iss_ScanKeys, indexstate->iss_NumScanKeys,
 				indexstate->iss_OrderByKeys, indexstate->iss_NumOrderByKeys);
+	}
 
 	/*
 	 * all done.
@@ -1590,3 +1623,91 @@ ExecIndexBuildScanKeys(PlanState *planstate, Relation index,
 	else if (n_array_keys != 0)
 		elog(ERROR, "ScalarArrayOpExpr index qual found where not allowed");
 }
+
+/* ----------------------------------------------------------------
+ *						Parallel Scan Support
+ * ----------------------------------------------------------------
+ */
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanEstimate
+ *
+ *		estimates the space required to serialize indexscan node.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanEstimate(IndexScanState *node,
+					  ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+
+	node->iss_PscanLen = index_parallelscan_estimate(node->iss_RelationDesc,
+													 estate->es_snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, node->iss_PscanLen);
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeDSM
+ *
+ *		Set up a parallel index scan descriptor.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeDSM(IndexScanState *node,
+						   ParallelContext *pcxt)
+{
+	EState	   *estate = node->ss.ps.state;
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
+	index_parallelscan_initialize(node->ss.ss_currentRelation,
+								  node->iss_RelationDesc,
+								  estate->es_snapshot,
+								  piscan);
+	shm_toc_insert(pcxt->toc, node->ss.ps.plan->plan_node_id, piscan);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecIndexScanInitializeWorker
+ *
+ *		Copy relevant information from TOC into planstate.
+ * ----------------------------------------------------------------
+ */
+void
+ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc)
+{
+	ParallelIndexScanDesc piscan;
+
+	piscan = shm_toc_lookup(toc, node->ss.ps.plan->plan_node_id);
+	node->iss_ScanDesc =
+		index_beginscan_parallel(node->ss.ss_currentRelation,
+								 node->iss_RelationDesc,
+								 node->iss_NumScanKeys,
+								 node->iss_NumOrderByKeys,
+								 piscan);
+
+	/*
+	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
+	 * index AM.
+	 */
+	if (node->iss_NumRuntimeKeys == 0)
+		index_rescan(node->iss_ScanDesc,
+					 node->iss_ScanKeys, node->iss_NumScanKeys,
+					 node->iss_OrderByKeys, node->iss_NumOrderByKeys);
+}
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 85505c5..eeacf81 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -127,8 +127,6 @@ static void subquery_push_qual(Query *subquery,
 static void recurse_push_qual(Node *setOp, Query *topquery,
 				  RangeTblEntry *rte, Index rti, Node *qual);
 static void remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel);
-static int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
-						BlockNumber index_pages);
 
 
 /*
@@ -2885,7 +2883,7 @@ remove_unused_subquery_outputs(Query *subquery, RelOptInfo *rel)
  * "heap_pages" is the number of pages from the table that we expect to scan.
  * "index_pages" is the number of pages from the index that we expect to scan.
  */
-static int
+int
 compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
 						BlockNumber index_pages)
 {
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index a43daa7..d01630f 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -391,7 +391,8 @@ cost_gather(GatherPath *path, PlannerInfo *root,
  * we have to fetch from the table, so they don't reduce the scan cost.
  */
 void
-cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
+cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
+		   bool partial_path)
 {
 	IndexOptInfo *index = path->indexinfo;
 	RelOptInfo *baserel = index->rel;
@@ -400,6 +401,7 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	List	   *qpquals;
 	Cost		startup_cost = 0;
 	Cost		run_cost = 0;
+	Cost		cpu_run_cost = 0;
 	Cost		indexStartupCost;
 	Cost		indexTotalCost;
 	Selectivity indexSelectivity;
@@ -413,6 +415,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	Cost		cpu_per_tuple;
 	double		tuples_fetched;
 	double		pages_fetched;
+	double		rand_heap_pages;
+	double		index_pages;
 
 	/* Should only be applied to base relations */
 	Assert(IsA(baserel, RelOptInfo) &&
@@ -459,7 +463,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	amcostestimate = (amcostestimate_function) index->amcostestimate;
 	amcostestimate(root, path, loop_count,
 				   &indexStartupCost, &indexTotalCost,
-				   &indexSelectivity, &indexCorrelation);
+				   &indexSelectivity, &indexCorrelation,
+				   &index_pages);
 
 	/*
 	 * Save amcostestimate's results for possible use in bitmap scan planning.
@@ -526,6 +531,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 		if (indexonly)
 			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
 
+		rand_heap_pages = pages_fetched;
+
 		max_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count;
 
 		/*
@@ -564,6 +571,8 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 		if (indexonly)
 			pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
 
+		rand_heap_pages = pages_fetched;
+
 		/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
 		max_IO_cost = pages_fetched * spc_random_page_cost;
 
@@ -583,6 +592,29 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 			min_IO_cost = 0;
 	}
 
+	if (partial_path)
+	{
+		/*
+		 * Estimate the number of parallel workers required to scan index. Use
+		 * the number of heap pages computed considering heap fetches won't be
+		 * sequential as for parallel scans the pages are accessed in random
+		 * order.
+		 */
+		path->path.parallel_workers = compute_parallel_worker(baserel,
+											   (BlockNumber) rand_heap_pages,
+												  (BlockNumber) index_pages);
+
+		/*
+		 * Fall out if workers can't be assigned for parallel scan, because in
+		 * such a case this path will be rejected.  So there is no benefit in
+		 * doing extra computation.
+		 */
+		if (path->path.parallel_workers <= 0)
+			return;
+
+		path->path.parallel_aware = true;
+	}
+
 	/*
 	 * Now interpolate based on estimated index order correlation to get total
 	 * disk I/O cost for main table accesses.
@@ -602,11 +634,24 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count)
 	startup_cost += qpqual_cost.startup;
 	cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
 
-	run_cost += cpu_per_tuple * tuples_fetched;
+	cpu_run_cost += cpu_per_tuple * tuples_fetched;
 
 	/* tlist eval costs are paid per output row, not per tuple scanned */
 	startup_cost += path->path.pathtarget->cost.startup;
-	run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+	cpu_run_cost += path->path.pathtarget->cost.per_tuple * path->path.rows;
+
+	/* Adjust costing for parallelism, if used. */
+	if (path->path.parallel_workers > 0)
+	{
+		double		parallel_divisor = get_parallel_divisor(&path->path);
+
+		path->path.rows = clamp_row_est(path->path.rows / parallel_divisor);
+
+		/* The CPU cost is divided among all the workers. */
+		cpu_run_cost /= parallel_divisor;
+	}
+
+	run_cost += cpu_run_cost;
 
 	path->path.startup_cost = startup_cost;
 	path->path.total_cost = startup_cost + run_cost;
diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c
index 5283468..0e9cc0a 100644
--- a/src/backend/optimizer/path/indxpath.c
+++ b/src/backend/optimizer/path/indxpath.c
@@ -813,7 +813,7 @@ get_index_paths(PlannerInfo *root, RelOptInfo *rel,
 /*
  * build_index_paths
  *	  Given an index and a set of index clauses for it, construct zero
- *	  or more IndexPaths.
+ *	  or more IndexPaths. It also constructs zero or more partial IndexPaths.
  *
  * We return a list of paths because (1) this routine checks some cases
  * that should cause us to not generate any IndexPath, and (2) in some
@@ -1042,8 +1042,43 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 								  NoMovementScanDirection,
 								  index_only_scan,
 								  outer_relids,
-								  loop_count);
+								  loop_count,
+								  false);
 		result = lappend(result, ipath);
+
+		/*
+		 * If appropriate, consider parallel index scan.  We don't allow
+		 * parallel index scan for bitmap or index only scans.
+		 */
+		if (index->amcanparallel &&
+			!index_only_scan &&
+			rel->consider_parallel &&
+			outer_relids == NULL &&
+			scantype != ST_BITMAPSCAN)
+		{
+			ipath = create_index_path(root, index,
+									  index_clauses,
+									  clause_columns,
+									  orderbyclauses,
+									  orderbyclausecols,
+									  useful_pathkeys,
+									  index_is_ordered ?
+									  ForwardScanDirection :
+									  NoMovementScanDirection,
+									  index_only_scan,
+									  outer_relids,
+									  loop_count,
+									  true);
+
+			/*
+			 * consider path for parallelism only when it is beneficial to
+			 * engage workers to scan the underlying relation.
+			 */
+			if (ipath->path.parallel_workers > 0)
+				add_partial_path(rel, (Path *) ipath);
+			else
+				pfree(ipath);
+		}
 	}
 
 	/*
@@ -1066,8 +1101,34 @@ build_index_paths(PlannerInfo *root, RelOptInfo *rel,
 									  BackwardScanDirection,
 									  index_only_scan,
 									  outer_relids,
-									  loop_count);
+									  loop_count,
+									  false);
 			result = lappend(result, ipath);
+
+			/* If appropriate, consider parallel index scan */
+			if (index->amcanparallel &&
+				!index_only_scan &&
+				rel->consider_parallel &&
+				outer_relids == NULL &&
+				scantype != ST_BITMAPSCAN)
+			{
+				ipath = create_index_path(root, index,
+										  index_clauses,
+										  clause_columns,
+										  NIL,
+										  NIL,
+										  useful_pathkeys,
+										  BackwardScanDirection,
+										  index_only_scan,
+										  outer_relids,
+										  loop_count,
+										  true);
+
+				if (ipath->path.parallel_workers > 0)
+					add_partial_path(rel, (Path *) ipath);
+				else
+					pfree(ipath);
+			}
 		}
 	}
 
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index abb4f12..3d33d46 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -5333,7 +5333,7 @@ plan_cluster_use_sort(Oid tableOid, Oid indexOid)
 	indexScanPath = create_index_path(root, indexInfo,
 									  NIL, NIL, NIL, NIL, NIL,
 									  ForwardScanDirection, false,
-									  NULL, 1.0);
+									  NULL, 1.0, false);
 
 	return (seqScanAndSortPath.total_cost < indexScanPath->path.total_cost);
 }
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index f440875..c36522d 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -744,10 +744,9 @@ add_path_precheck(RelOptInfo *parent_rel,
  *	  As with add_path, we pfree paths that are found to be dominated by
  *	  another partial path; this requires that there be no other references to
  *	  such paths yet.  Hence, GatherPaths must not be created for a rel until
- *	  we're done creating all partial paths for it.  We do not currently build
- *	  partial indexscan paths, so there is no need for an exception for
- *	  IndexPaths here; for safety, we instead Assert that a path to be freed
- *	  isn't an IndexPath.
+ *	  we're done creating all partial paths for it.  Unlike add_path, we don't
+ *	  take an exception for IndexPaths as partial index paths won't be
+ *	  referenced by partial BitmapHeapPaths.
  */
 void
 add_partial_path(RelOptInfo *parent_rel, Path *new_path)
@@ -826,8 +825,6 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 		{
 			parent_rel->partial_pathlist =
 				list_delete_cell(parent_rel->partial_pathlist, p1, p1_prev);
-			/* we should not see IndexPaths here, so always safe to delete */
-			Assert(!IsA(old_path, IndexPath));
 			pfree(old_path);
 			/* p1_prev does not advance */
 		}
@@ -860,8 +857,6 @@ add_partial_path(RelOptInfo *parent_rel, Path *new_path)
 	}
 	else
 	{
-		/* we should not see IndexPaths here, so always safe to delete */
-		Assert(!IsA(new_path, IndexPath));
 		/* Reject and recycle the new path */
 		pfree(new_path);
 	}
@@ -1005,6 +1000,7 @@ create_samplescan_path(PlannerInfo *root, RelOptInfo *rel, Relids required_outer
  * 'required_outer' is the set of outer relids for a parameterized path.
  * 'loop_count' is the number of repetitions of the indexscan to factor into
  *		estimates of caching behavior.
+ * 'partialpath' is true if the parallel scan is expected.
  *
  * Returns the new path node.
  */
@@ -1019,7 +1015,8 @@ create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count)
+				  double loop_count,
+				  bool partialpath)
 {
 	IndexPath  *pathnode = makeNode(IndexPath);
 	RelOptInfo *rel = index->rel;
@@ -1049,7 +1046,7 @@ create_index_path(PlannerInfo *root,
 	pathnode->indexorderbycols = indexorderbycols;
 	pathnode->indexscandir = indexscandir;
 
-	cost_index(pathnode, root, loop_count);
+	cost_index(pathnode, root, loop_count, partialpath);
 
 	return pathnode;
 }
@@ -3247,7 +3244,7 @@ reparameterize_path(PlannerInfo *root, Path *path,
 				memcpy(newpath, ipath, sizeof(IndexPath));
 				newpath->path.param_info =
 					get_baserel_parampathinfo(root, rel, required_outer);
-				cost_index(newpath, root, loop_count);
+				cost_index(newpath, root, loop_count, false);
 				return (Path *) newpath;
 			}
 		case T_BitmapHeapScan:
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7836e6b..4ed2705 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -241,6 +241,7 @@ get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
 			info->amoptionalkey = amroutine->amoptionalkey;
 			info->amsearcharray = amroutine->amsearcharray;
 			info->amsearchnulls = amroutine->amsearchnulls;
+			info->amcanparallel = amroutine->amcanparallel;
 			info->amhasgettuple = (amroutine->amgettuple != NULL);
 			info->amhasgetbitmap = (amroutine->amgetbitmap != NULL);
 			info->amcostestimate = amroutine->amcostestimate;
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index fa32e9e..d14f0f9 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6471,7 +6471,8 @@ add_predicate_to_quals(IndexOptInfo *index, List *indexQuals)
 void
 btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			   Cost *indexStartupCost, Cost *indexTotalCost,
-			   Selectivity *indexSelectivity, double *indexCorrelation)
+			   Selectivity *indexSelectivity, double *indexCorrelation,
+			   double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6761,12 +6762,14 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 hashcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	List	   *qinfos;
 	GenericCosts costs;
@@ -6807,12 +6810,14 @@ hashcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6866,12 +6871,14 @@ gistcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 void
 spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				Cost *indexStartupCost, Cost *indexTotalCost,
-				Selectivity *indexSelectivity, double *indexCorrelation)
+				Selectivity *indexSelectivity, double *indexCorrelation,
+				double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *qinfos;
@@ -6925,6 +6932,7 @@ spgcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexTotalCost = costs.indexTotalCost;
 	*indexSelectivity = costs.indexSelectivity;
 	*indexCorrelation = costs.indexCorrelation;
+	*indexPages = costs.numIndexPages;
 }
 
 
@@ -7222,7 +7230,8 @@ gincost_scalararrayopexpr(PlannerInfo *root,
 void
 gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				Cost *indexStartupCost, Cost *indexTotalCost,
-				Selectivity *indexSelectivity, double *indexCorrelation)
+				Selectivity *indexSelectivity, double *indexCorrelation,
+				double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *indexQuals = path->indexquals;
@@ -7537,6 +7546,7 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexStartupCost += qual_arg_cost;
 	*indexTotalCost += qual_arg_cost;
 	*indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost);
+	*indexPages = dataPagesFetched;
 }
 
 /*
@@ -7545,7 +7555,8 @@ gincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 void
 brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 				 Cost *indexStartupCost, Cost *indexTotalCost,
-				 Selectivity *indexSelectivity, double *indexCorrelation)
+				 Selectivity *indexSelectivity, double *indexCorrelation,
+				 double *indexPages)
 {
 	IndexOptInfo *index = path->indexinfo;
 	List	   *indexQuals = path->indexquals;
@@ -7597,6 +7608,7 @@ brincostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	*indexStartupCost += qual_arg_cost;
 	*indexTotalCost += qual_arg_cost;
 	*indexTotalCost += (numTuples * *indexSelectivity) * (cpu_index_tuple_cost + qual_op_cost);
+	*indexPages = index->pages;
 
 	/* XXX what about pages_per_range? */
 }
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index b0730bf..f919cf8 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -95,7 +95,8 @@ typedef void (*amcostestimate_function) (struct PlannerInfo *root,
 													 Cost *indexStartupCost,
 													 Cost *indexTotalCost,
 											   Selectivity *indexSelectivity,
-												   double *indexCorrelation);
+													 double *indexCorrelation,
+													 double *indexPages);
 
 /* parse index reloptions */
 typedef bytea *(*amoptions_function) (Datum reloptions,
@@ -188,6 +189,8 @@ typedef struct IndexAmRoutine
 	bool		amclusterable;
 	/* does AM handle predicate locks? */
 	bool		ampredlocks;
+	/* does AM support parallel scan? */
+	bool		amcanparallel;
 	/* type of data stored in index, or InvalidOid if variable */
 	Oid			amkeytype;
 
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 46d6f45..ea3f3a5 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -14,6 +14,7 @@
 #ifndef NODEINDEXSCAN_H
 #define NODEINDEXSCAN_H
 
+#include "access/parallel.h"
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
@@ -22,6 +23,9 @@ extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
 extern void ExecReScanIndexScan(IndexScanState *node);
+extern void ExecIndexScanEstimate(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeDSM(IndexScanState *node, ParallelContext *pcxt);
+extern void ExecIndexScanInitializeWorker(IndexScanState *node, shm_toc *toc);
 
 /*
  * These routines are exported to share code with nodeIndexonlyscan.c and
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 42c6c58..b3a3c22 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1363,6 +1363,7 @@ typedef struct
  *		SortSupport		   for reordering ORDER BY exprs
  *		OrderByTypByVals   is the datatype of order by expression pass-by-value?
  *		OrderByTypLens	   typlens of the datatypes of order by expressions
+ *		pscan_len		   size of parallel index scan descriptor
  * ----------------
  */
 typedef struct IndexScanState
@@ -1389,6 +1390,9 @@ typedef struct IndexScanState
 	SortSupport iss_SortSupport;
 	bool	   *iss_OrderByTypByVals;
 	int16	   *iss_OrderByTypLens;
+
+	/* This is needed for parallel index scan */
+	Size		iss_PscanLen;
 } IndexScanState;
 
 /* ----------------
diff --git a/src/include/nodes/relation.h b/src/include/nodes/relation.h
index 643be54..f7ac6f6 100644
--- a/src/include/nodes/relation.h
+++ b/src/include/nodes/relation.h
@@ -629,6 +629,7 @@ typedef struct IndexOptInfo
 	bool		amsearchnulls;	/* can AM search for NULL/NOT NULL entries? */
 	bool		amhasgettuple;	/* does AM have amgettuple interface? */
 	bool		amhasgetbitmap; /* does AM have amgetbitmap interface? */
+	bool		amcanparallel;	/* does AM support parallel scan? */
 	/* Rather than include amapi.h here, we declare amcostestimate like this */
 	void		(*amcostestimate) ();	/* AM's cost estimator */
 } IndexOptInfo;
diff --git a/src/include/optimizer/cost.h b/src/include/optimizer/cost.h
index 0e68264..72200fa 100644
--- a/src/include/optimizer/cost.h
+++ b/src/include/optimizer/cost.h
@@ -76,7 +76,7 @@ extern void cost_seqscan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 extern void cost_samplescan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 				ParamPathInfo *param_info);
 extern void cost_index(IndexPath *path, PlannerInfo *root,
-		   double loop_count);
+		   double loop_count, bool partial_path);
 extern void cost_bitmap_heap_scan(Path *path, PlannerInfo *root, RelOptInfo *baserel,
 					  ParamPathInfo *param_info,
 					  Path *bitmapqual, double loop_count);
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 7b41317..4362c7a 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -47,7 +47,8 @@ extern IndexPath *create_index_path(PlannerInfo *root,
 				  ScanDirection indexscandir,
 				  bool indexonly,
 				  Relids required_outer,
-				  double loop_count);
+				  double loop_count,
+				  bool partialpath);
 extern BitmapHeapPath *create_bitmap_heap_path(PlannerInfo *root,
 						RelOptInfo *rel,
 						Path *bitmapqual,
diff --git a/src/include/optimizer/paths.h b/src/include/optimizer/paths.h
index 81e7a42..ebda308 100644
--- a/src/include/optimizer/paths.h
+++ b/src/include/optimizer/paths.h
@@ -54,6 +54,8 @@ extern RelOptInfo *standard_join_search(PlannerInfo *root, int levels_needed,
 					 List *initial_rels);
 
 extern void generate_gather_paths(PlannerInfo *root, RelOptInfo *rel);
+extern int compute_parallel_worker(RelOptInfo *rel, BlockNumber heap_pages,
+						BlockNumber index_pages);
 
 #ifdef OPTIMIZER_DEBUG
 extern void debug_print_rel(PlannerInfo *root, RelOptInfo *rel);
diff --git a/src/include/utils/index_selfuncs.h b/src/include/utils/index_selfuncs.h
index d317242..17d165c 100644
--- a/src/include/utils/index_selfuncs.h
+++ b/src/include/utils/index_selfuncs.h
@@ -28,41 +28,47 @@ extern void brincostestimate(struct PlannerInfo *root,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void btcostestimate(struct PlannerInfo *root,
 			   struct IndexPath *path,
 			   double loop_count,
 			   Cost *indexStartupCost,
 			   Cost *indexTotalCost,
 			   Selectivity *indexSelectivity,
-			   double *indexCorrelation);
+			   double *indexCorrelation,
+			   double *indexPages);
 extern void hashcostestimate(struct PlannerInfo *root,
 				 struct IndexPath *path,
 				 double loop_count,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void gistcostestimate(struct PlannerInfo *root,
 				 struct IndexPath *path,
 				 double loop_count,
 				 Cost *indexStartupCost,
 				 Cost *indexTotalCost,
 				 Selectivity *indexSelectivity,
-				 double *indexCorrelation);
+				 double *indexCorrelation,
+				 double *indexPages);
 extern void spgcostestimate(struct PlannerInfo *root,
 				struct IndexPath *path,
 				double loop_count,
 				Cost *indexStartupCost,
 				Cost *indexTotalCost,
 				Selectivity *indexSelectivity,
-				double *indexCorrelation);
+				double *indexCorrelation,
+				double *indexPages);
 extern void gincostestimate(struct PlannerInfo *root,
 				struct IndexPath *path,
 				double loop_count,
 				Cost *indexStartupCost,
 				Cost *indexTotalCost,
 				Selectivity *indexSelectivity,
-				double *indexCorrelation);
+				double *indexCorrelation,
+				double *indexPages);
 
 #endif   /* INDEX_SELFUNCS_H */
diff --git a/src/test/regress/expected/select_parallel.out b/src/test/regress/expected/select_parallel.out
index 3692d4f..48fb80e 100644
--- a/src/test/regress/expected/select_parallel.out
+++ b/src/test/regress/expected/select_parallel.out
@@ -125,6 +125,29 @@ select count(*) from tenk1 where (two, four) not in
 (1 row)
 
 alter table tenk2 reset (parallel_workers);
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+                             QUERY PLAN                             
+--------------------------------------------------------------------
+ Finalize Aggregate
+   ->  Gather
+         Workers Planned: 4
+         ->  Partial Aggregate
+               ->  Parallel Index Scan using tenk1_hundred on tenk1
+                     Index Cond: (hundred > 1)
+(6 rows)
+
+select  count((unique1)) from tenk1 where hundred > 1;
+ count 
+-------
+  9800
+(1 row)
+
+reset enable_seqscan;
+reset enable_bitmapscan;
 set force_parallel_mode=1;
 explain (costs off)
   select stringu1::int2 from tenk1 where unique1 = 1;
diff --git a/src/test/regress/sql/select_parallel.sql b/src/test/regress/sql/select_parallel.sql
index f4f9dd5..f5bc4d1 100644
--- a/src/test/regress/sql/select_parallel.sql
+++ b/src/test/regress/sql/select_parallel.sql
@@ -48,6 +48,17 @@ select count(*) from tenk1 where (two, four) not in
 	(select hundred, thousand from tenk2 where thousand > 100);
 alter table tenk2 reset (parallel_workers);
 
+-- test parallel index scans.
+set enable_seqscan to off;
+set enable_bitmapscan to off;
+
+explain (costs off)
+	select  count((unique1)) from tenk1 where hundred > 1;
+select  count((unique1)) from tenk1 where hundred > 1;
+
+reset enable_seqscan;
+reset enable_bitmapscan;
+
 set force_parallel_mode=1;
 
 explain (costs off)

#81

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#80)

Re: Parallel Index Scans

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here second part of the comment (but have not yet advanced ..) seems
to be slightly misleading because this state has nothing to do with
the advancement of scan keys.

I have not changed this because I am not sure what you have in mind.

OK, I rewrote that to be (hopefully) more clear.

I have verified all your changes and they look good to me.

Cool. Committed. I also changed the wait event to be BtreePage in
the docs + pg_stat_activity, and moved it into alphabetical order in
the switch and the enum.

I can't easily test this because your second patch doesn't apply,

I have tried and it works for me on latest code except for one test
output file which could have been excluded. I wonder whether you are
first applying the GUC related patch [1] before applying the optimizer
support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Oh, duh. I forgot about the prerequisite patch. Sorry.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#82

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#80)

Re: Parallel Index Scans

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#83

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Robert Haas (#82)

Re: Parallel Index Scans

On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#84

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#83)

Re: Parallel Index Scans

On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

Thanks a lot! I think this is a big step forward for parallelism in
PostgreSQL. Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#85

Michael Banck

michael.banck@credativ.de

almost 9 years ago

In reply to: Amit Kapila (#84)

Re: Parallel Index Scans

Hi,

On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:

On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

Thanks a lot! I think this is a big step forward for parallelism in
PostgreSQL. Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.

Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
updated for this as well, or is this going to be taken care as a batch
at the ned of the development cycle, pending other added parallization
features?

Michael

--
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax: +49 2166 9901-100
Email: michael.banck@credativ.de

credativ GmbH, HRB Mï¿½nchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mï¿½nchengladbach
Geschï¿½ftsfï¿½hrung: Dr. Michael Meskes, Jï¿½rg Folz, Sascha Heuer

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#86

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Michael Banck (#85)

Re: Parallel Index Scans

On Mon, Mar 6, 2017 at 4:57 PM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:

On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

Thanks a lot! I think this is a big step forward for parallelism in
PostgreSQL. Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.

Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
updated for this as well, or is this going to be taken care as a batch
at the ned of the development cycle, pending other added parallization
features?

Robert mentioned up thread that it is better to update it once at end
rather than with each feature.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#87

Robert Haas

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Kapila (#86)

Re: Parallel Index Scans

On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Mar 6, 2017 at 4:57 PM, Michael Banck <michael.banck@credativ.de> wrote:

Hi,

On Thu, Feb 16, 2017 at 08:14:28AM +0530, Amit Kapila wrote:

On Thu, Feb 16, 2017 at 12:27 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 1:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 7:11 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:>

support related patch. In anycase, to avoid confusion I am attaching
all the three patches with this e-mail.

Committed guc_parallel_index_scan_v1.patch, with changes to the
documentation and GUC descriptions.

And committed parallel_index_opt_exec_support_v10.patch as well, with
a couple of minor tweaks.

Thanks a lot! I think this is a big step forward for parallelism in
PostgreSQL. Now, we have another way to drive parallel scans and I
hope many more queries can use parallelism.

Shouldn't the chapter 15.3 "Parallel Plans" in the documentation be
updated for this as well, or is this going to be taken care as a batch
at the ned of the development cycle, pending other added parallization
features?

Robert mentioned up thread that it is better to update it once at end
rather than with each feature.

I was going to do it after index and index-only scans and parallel
bitmap heap scan were committed, but I've been holding off on
committing parallel bitmap heap scan waiting for Andres to fix the
simplehash regressions, so maybe I should just go do an update now and
another one later once that goes in.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#88

Amit Kapila

amit.kapila16@gmail.com

almost 9 years ago

In reply to: Robert Haas (#87)

Re: Parallel Index Scans

On Mon, Mar 6, 2017 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I was going to do it after index and index-only scans and parallel
bitmap heap scan were committed, but I've been holding off on
committing parallel bitmap heap scan waiting for Andres to fix the
simplehash regressions, so maybe I should just go do an update now and
another one later once that goes in.

As you wish, but one point to note is that as of now parallelism for
index scans can be influenced by table level parameter
parallel_workers. It sounds slightly awkward, but if we want to keep
that way, then maybe we can update the docs to indicate the same.
Another option is to have a separate parameter for index scans.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#89

Gavin Flower

GavinFlower@archidevsys.co.nz

almost 9 years ago

In reply to: Amit Kapila (#88)

Re: Parallel Index Scans

On 07/03/17 02:46, Amit Kapila wrote:

On Mon, Mar 6, 2017 at 6:49 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Mar 6, 2017 at 6:33 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I was going to do it after index and index-only scans and parallel
bitmap heap scan were committed, but I've been holding off on
committing parallel bitmap heap scan waiting for Andres to fix the
simplehash regressions, so maybe I should just go do an update now and
another one later once that goes in.

As you wish, but one point to note is that as of now parallelism for
index scans can be influenced by table level parameter
parallel_workers. It sounds slightly awkward, but if we want to keep
that way, then maybe we can update the docs to indicate the same.
Another option is to have a separate parameter for index scans.

My immediate gut feeling was to have separate parameters.

On thinking about it, I think they serve different use cases. I don't
think of workers when I think of Index scans, and I suspect I'd find
more reasons to keep them separate if I looked deeper.

Cheers,
Gavin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers