WIP: parallel GiST index builds

Started by Tomas Vondraover 1 year ago24 messages

tomas.vondra@enterprisedb.com

over 1 year ago

1 attachment(s)

Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.

unsorted builds only
--------------------

Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...

buffered builds
---------------

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?

gistGetFakeLSN
--------------

One more thing - GiST disables WAL-logging during the build, and only
logs it once at the end. For serial builds this is fine, because there
are no concurrent splits, and so we don't need to rely on page LSNs to
detect these cases (in fact, is uses a bogus value).

But for parallel builds this would not work - we need page LSNs that
actually change, otherwise we'd miss page splits, and the index build
would either fail or produce a broken index. But the existing is_build
flag affects both things, so I had to introduce a new "is_parallel" flag
which only affects the page LSN part, using the gistGetFakeLSN()
function, previously used only for unlogged indexes.

This means we'll produce WAL during the index build (because
gistGetFakeLSN() writes a trivial message into WAL). Compared to the
serial builds this produces maybe 25-75% more WAL, but it's an order of
magnitude less than with "full" WAL logging (is_build=false).

For example, serial build of 5GB index needs ~5GB of WAL. A parallel
build may need ~7GB, while a parallel build with "full" logging would
use 50GB. I think this is a reasonable trade off.

There's one "strange" thing, though - the amount of WAL decreases with
the number of parallel workers. Consider for example an index on a
numeric field, where the index is ~9GB, but the amount of WAL changes
like this (0 workers means serial builds):

parallel workers 0 1 3 5 7
WAL (GB) 5.7 9.2 7.6 7.0 6.8

The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN
determines if it needs to actually assign a new LSN (and write stuff to
WAL) by comparing the last LSN assigned (in a given worker) to the
current insert LSN. But the current insert LSN might have been updated
by some other worker, in which case we simply use that. Which means that
multiple workers may use the same fake LSN, and the likelihood increases
with the number of workers - and this is consistent with the observed
behavior of the WAL decreasing as the number of workers increases
(because more workers use the same LSN).

I'm not entirely sure if this is OK or a problem. I was worried two
workers might end up using the same LSN for the same page, leading to
other workers not noticing the split. But after a week of pretty
intensive stress testing, I haven't seen a single such failure ...

If this turns out to be a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFakeLSN() to produce a new fake LSN
every time when is_parallel=true.

performance
-----------

Obviously, the primary goal of the patch is to speed up the builds, so
does it actually do that? For indexes of different sizes I got this
timings (in seconds):

scale type 0 1 3 5 7
------------------------------------------------------------------
small inet 13 7 4 4 2
numeric 239 122 67 46 36
oid 15 8 5 3 2
text 71 35 19 13 10
medium inet 207 111 59 42 32
numeric 3409 1714 885 618 490
oid 214 114 60 43 33
text 940 479 247 180 134
large inet 2167 1459 865 632 468
numeric 38125 20256 10454 7487 5846
oid 2647 1490 808 594 475
text 10987 6298 3376 2462 1961

Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index,
depending on the data type.

The raw duration is not particularly easy to interpret, so let's look at
the "actual speedup" which is calculated as

(serial duration) / (parallel duration)

and the table looks like this:

scale type 1 3 5 7
--------------------------------------------------------------
small inet 1.9 3.3 3.3 6.5
numeric 2.0 3.6 5.2 6.6
oid 1.9 3.0 5.0 7.5
text 2.0 3.7 5.5 7.1
medium inet 1.9 3.5 4.9 6.5
numeric 2.0 3.9 5.5 7.0
oid 1.9 3.6 5.0 6.5
text 2.0 3.8 5.2 7.0
large inet 1.5 2.5 3.4 4.6
numeric 1.9 3.6 5.1 6.5
oid 1.8 3.3 4.5 5.6
text 1.7 3.3 4.5 5.6

Ideally (if the build scaled linearly with the number of workers), we'd
get the number of workers + 1 (because the leader participates too).
Obviously, it's not that great - for example for text with 3 workers we
get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers.

But I think those numbers are actually pretty good - I'd definitely not
complain if my index builds got 5x faster.

But those are synthetic tests on random data, using the btree_gist
opclasses. It'd be interesting if people could do their own testing on
real-world data sets.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240607-0001-WIP-parallel-GiST-build.patchtext/x-patch; charset=UTF-8; name=v20240607-0001-WIP-parallel-GiST-build.patchDownload

From f684556a910f566a8c6a6ea0dd588173ab94a245 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20240607] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builts (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  37 +-
 src/backend/access/gist/gistbuild.c   | 713 +++++++++++++++++++++++++-
 src/backend/access/gist/gistutil.c    |  10 +-
 src/backend/access/gist/gistvacuum.c  |   6 +-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  12 +-
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 739 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..f5f56fb2503 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -75,7 +75,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -182,7 +182,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -230,7 +230,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -501,9 +502,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -511,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -570,7 +579,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -589,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 		PageSetLSN(page, recptr);
 
@@ -632,7 +646,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -646,6 +661,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1303,7 +1319,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1722,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ba06df30faf..c8fa67beebb 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,106 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +210,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +266,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +295,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +335,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +449,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   (void *) &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -861,7 +1043,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -900,6 +1082,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1068,7 +1292,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1577,3 +1802,439 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 78e98d68b15..733d5849317 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel)
+gistGetFakeLSN(Relation rel, bool is_parallel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,8 +1035,12 @@ gistGetFakeLSN(Relation rel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
+		/*
+		 * Shouldn't be called for WAL-logging relations, but parallell
+		 * builds are an exception - we need the fake LSN to detect
+		 * concurrent changes.
+		 */
+		Assert(is_parallel || !RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..082804e9c7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = gistGetFakeLSN(rel, false);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = gistGetFakeLSN(info->index, false);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..7e09fc79c30 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..d5b22bc1018 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -531,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 4f57078d133..a4985e98585 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -971,6 +971,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -981,6 +982,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.45.0

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#1)

3 attachment(s)

Re: WIP: parallel GiST index builds

Hi,

I've done a number of experiments with the GiST parallel builds, both
with the sorted and unsorted cases, so let me share some of the results
and conclusions from that.

In the first post I did some benchmarks using btree_gist, but that
seemed not very realistic - there certainly are much more widely used
GiST indexes in the GIS world. So this time I used OpenStreetMap, loaded
using osm2pgsql, with two dataset sizes:

- small - "north america" (121GB without indexes)
- large - "planet" (688GB without indexes)

And then I created indexes using either gist_geometry_ops_2d (with sort)
or gist_geometry_ops_nd (no sorting).

On 6/7/24 19:41, Tomas Vondra wrote:

Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.

unsorted builds only
--------------------

Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

For sorted builds, I made the claim that parallelizing sorted builds is
"solved problem" because we can use a parallel tuplesort. I was thinking
that maybe it'd be better to do that in the initial patch, and only then
introduce the more complex stuff in the unsorted case, so I gave this a
try, and it turned to be rather pointless.

Yes, parallel tuplesort does improve the duration, but it's not a very
significant improvement - maybe 10% or so. Most of the build time is
spent in gist_indexsortbuild(), so this is the part that would need to
be parallelized for any substantial improvement. Only then is it useful
to improve the tuplesort, I think.

And parallelizing gist_indexsortbuild() is not trivial - most of the
time is spent in gist_indexsortbuild_levelstate_flush() / gistSplit(),
so ISTM a successful parallel implementation would need to divide this
work between multiple workers. I don't have a clear idea how, though.

I do have a PoC/WIP patch doing the paralle tuplesort in my github
branch at [1]https://github.com/tvondra/postgres/tree/parallel-gist-20240625 (and then also some ugly experiments on top of that), but
I'm not going to attach it here because of the reasons I just explained.
It'd be just a pointless distraction.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...

I still think this assessment is correct - it's fine to not parallelize
sorted builds. It can be improved in the future, or even not at all.

buffered builds
---------------

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?

For unsorted builds, here's the results from one of the machines for
duration of CREATE INDEX with the requested number of workers (0 means
serial build) for different tables in the OSM databases:

db type size (MB) | 0 1 2 3 4
-----------------------------|----------------------------------
small line 4889 | 811 429 294 223 186
point 2625 | 485 262 179 141 125
polygon 7644 | 1230 623 418 318 261
roads 273 | 40 22 16 14 12
-----------------------------|----------------------------------
large line 20592 | 3916 2157 1479 1137 948
point 13080 | 2636 1442 981 770 667
polygon 50598 | 10990 5648 3860 2991 2504
roads 1322 | 228 123 85 67 56

There's also the size of the index. If we calculate the speedup compared
to serial build, we get this:

db type | 1 2 3 4
-----------------|--------------------------------
small line | 1.9 2.8 3.6 4.4
point | 1.9 2.7 3.4 3.9
polygon | 2.0 2.9 3.9 4.7
roads | 1.8 2.5 2.9 3.3
-----------------|--------------------------------
large line | 1.8 2.6 3.4 4.1
point | 1.8 2.7 3.4 4.0
polygon | 1.9 2.8 3.7 4.4
roads | 1.9 2.7 3.4 4.1

Remember, the leader participates in the build, so K workers means K+1
processes are doing the work. And the speedup is pretty close to the
ideal speedup.

There's the question about buffering, though - as mentioned in the first
patch, the parallel builds do not support buffering, so the question is
how bad is the impact of that. Clearly, the duration improves a lot, so
that's good, but maybe it did write out far more buffers and the NVMe
drive handled it well?

So I used pg_stat_statements to track the number of buffer writes
(shared_blks_written) for the CREATE INDEX, and for the large data set
it looks like this (this is in MBs written):

size | 0 1 2 3 4
----------------|--------------------------------------------
line 20592 | 43577 47580 49574 50388 50734
point 13080 | 23331 25721 26399 26745 26889
polygon 50598 | 113108 125095 129599 130170 131249
roads 1322 | 1322 1310 1305 1300 1295

The serial builds (0 workers) are buffered, but the buffering only
applies for indexes that exceed effective_cache_size (4GB). Which means
the "roads" buffer is too small to activate buffering, and there should
be very little differences - which is the case (but the index also fits
into shared buffers in this case).

The other indexes do activate buffering, so the question is how many
more buffers get written out compared to serial builds (with buffering).
And the comparison looks like this:

1 2 3 4
------------------------------------------
line 109% 114% 116% 116%
point 110% 113% 115% 115%
polygon 111% 115% 115% 116%
roads 99% 99% 98% 98%

So it writes about 15-20% more buffers during the index build, which is
not that much IMHO. I was wondering if this might change with smaller
shared buffers, so I tried building indexes on the smaller data set with
128MB shared buffers, but the difference remained to be ~15-20%.

My conclusion from this is that it's OK to have parallel builds without
buffering, and then maybe improve that later. The thing I'm not sure
about is how this should interact with the "buffering" option. Right now
we just ignore that entirely if we decide to do parallel build. But
maybe it'd be better to disable parallel builds when the user specifies
"buffering=on" (and only allow parallel builds with off/auto)?

I did check how parallelism affects the amount of WAL produced, but
that's pretty much exactly how I described that in the initial message,
including the "strange" decrease with more workers due to reusing the
fake LSN etc.

regards

[1]: https://github.com/tvondra/postgres/tree/parallel-gist-20240625

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240701-0001-WIP-parallel-GiST-build.patchtext/x-patch; charset=UTF-8; name=v20240701-0001-WIP-parallel-GiST-build.patchDownload

From 4a00f2505e333f1c160040478ed81a0450514b11 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20240701] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builds (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  37 +-
 src/backend/access/gist/gistbuild.c   | 713 +++++++++++++++++++++++++-
 src/backend/access/gist/gistutil.c    |  10 +-
 src/backend/access/gist/gistvacuum.c  |   6 +-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  12 +-
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 739 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..f5f56fb2503 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -75,7 +75,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -182,7 +182,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -230,7 +230,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -501,9 +502,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -511,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -570,7 +579,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -589,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 		PageSetLSN(page, recptr);
 
@@ -632,7 +646,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -646,6 +661,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1303,7 +1319,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1722,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ba06df30faf..c8fa67beebb 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,106 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +210,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +266,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +295,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +335,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +449,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   (void *) &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -861,7 +1043,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -900,6 +1082,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1068,7 +1292,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1577,3 +1802,439 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 78e98d68b15..733d5849317 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel)
+gistGetFakeLSN(Relation rel, bool is_parallel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,8 +1035,12 @@ gistGetFakeLSN(Relation rel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
+		/*
+		 * Shouldn't be called for WAL-logging relations, but parallell
+		 * builds are an exception - we need the fake LSN to detect
+		 * concurrent changes.
+		 */
+		Assert(is_parallel || !RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..082804e9c7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = gistGetFakeLSN(rel, false);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = gistGetFakeLSN(info->index, false);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..7e09fc79c30 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..d5b22bc1018 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -531,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6c1caf6498..54376940b74 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -971,6 +971,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -981,6 +982,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.45.2

gist-i5.tgzapplication/x-compressed-tar; name=gist-i5.tgzDownload

���kwG�%Z_�_�-YMyZ������K�T.M��U55w��I%I#I������>��DW���XU��'����������?>>Y��_|6���A��~�Q���D�������^����EU-?U�s�����(��lY|���E}dA��E��q�������1�
�����Oq�����X�$��?x�?���?��/��Q��r�{���5�����Q]_.V����yy�(g�����\V��lX~,�������OG�r2<���������E9��=�L�i����b�\���hvQ��������S�{t�D��j5_�������K4R/������r>�.g�yy��i��Y�E8���v�A�i�^F~��Y�������O/�<�3VM�K��|O��<�a^�� d(�%�y<H�,���($y�A;��j�1GQ���8�#?��h 
St�^���JE~S|��e�[!��o>�Sb�2��*+�vs�	����g'��2�,G�AY:�v5.�"H|�(.�(I��Q�*|ana�f>k�~��Q�a��e
� ���2�!f�s�����������8�kEi�9�Q�����e��
���8�X$J\�������s����A8(,(�y� ��()�r
��8lX 	�H��t:(@�"�~��!?
3�s�oIv��X��,����d���4.��`������8����@��#hXu����4��^,Ea�i�l�F����Yj,���-iX���y��~�����y�q����H�}�!PoPd���,C��>Y D�8e�$(��y��vH� ��(�L�AV$!�"���� e��`���t�hX�v��
I�5��F�dY����zd���r5"��0�C�@T�i�q�e��8j�������S���d8?M9�@�`�'	���n�~L��;=�n��w��9��P8�A�'q����	���������d��S��@�MaP�O�Yf,^�c�f�D!
�����qP�Y�����%��~���X � JR�@Z��17��@V��!�P��q.)�c�	���cC
�d+��ZNZC
@�f�8Z)@��Qb,
9��
})���p,�}��0�@R��"���?��,P�������K����_����_�I5�"H���/����5>�����?����$�����etK�C�P�G�
q"�u����8`�l��_�l�~h�$�[���E�P�1cN�(�	�
���_P��������;.��.� q��|�-$k7)��?��2������@��W�=SgB�eP����ka@(��P����jC)F�y��)0j��� �<��1W�H�q���^�6j�K�]�8�c�!��c�!�/�b��b�P�0��ei��b(����#z���t�#�`S����1d��V1����DQD`�K�f����{�vt����ia����0 ����h�<e��f$�6����8Y�\h<����O=��Ma����%
@ABA0�O�F	�j
�X�
c$��%���U@�G�0�|�Dc��(�pN0G���X�8�� �`@4a.	�G�E��@���a�fQZ����*?��I�i7�(������T�1�6@�	E���>�aw�q�4���
+�4�#�����M5l�f	D����B�+����2���	�	8���&t��z��	$~�0b3 x�x�K�kE;�A(2Za�(��s{�U �~L����|v-��������6p?�Q6�n�w�"�b���9mY4=�04#�`7�6�-��S(F dd�\U�q��BXP~��m��2J/i_@0k���@Z	,Bb�����.@<ayl����X����H/S p�L:�E�0.����~	�	�4���H���@@��f��!c��305�\$����=��)�@���Qq^i��|N��9��b*	y�T

$.`�s6s�����AJ�A ��+.1�CR��D��F�m
,���C�C���D�eJus�A�B�pH����#E��"��Q7p�)���J	�w9�&c�G��(��l!�|����C���)`,?�|���2�����0�h�9��7��Z0�#��95x�o����{C����d<$���2 $�%vl���.�j1�e4� +ZQ��~����8����!G�q~,��_�zei�g�%��	��G����{A�����I�p�*����3�:�JC���r,���:T�*���j9���aOA��k�?� �h	 I�������T��k�&I�d,������u]M�+������-���`�8�(J@��+f
������������@5���a����b����)�+��Y,A�mP@��(J<�(Z�x�a�>�-5���a�"�#z	�_(�s�|��P�^B�cB?�N#u)�sHUMB��q����C�+jGL@�C����y!��{�>*�tlH�8���H4(:2?D(lt�T,���F�h,`�������;/H~"�@S��/xT��\y��c����O��X=��,�����  &�4v?�"��(Eb�C�&��,)Y@x`A#`�t�
��r�������n��=�P��ja����MY�fb��D�h�s���q�������� \x�����H�,�>Q��vX�"S��b�!��7���l�������
<����BBo1�k��A��%���t����}��![��C!'�������8���4
��D��=
%b�P��>��f]p���&�8����
%tk@��3�ag8�(n��� n����qB�y)s)�z��
v�9��`P�GP�O^�P"ap^f�Cy���"��%�&�YKE5���(t'�q@��X�x �o4�r��1��NlGIjB8���c�l�x��n�<�V��F����(|��0��1J���L@�N�'�H��6|�� �<�f���?�7M����F`RP�Ct��W��Q"�9�'�� (6��p��T4������sR�we(
�����p1J�����EU.��0Rm�fs����a��%�ts
%�Y���0�0��6����x��'�K�@X	�@��u���Q��Go�I���VJ�Y�m�)������KB��_0�C(��YW�(����{$�k�������8#�"J,Hq(�r$p����|���2�[ �h�cQf�e��
Fos6��w��6'�%�9]�����0���|�$�$P�r��>'�������N/�	��^"�3p�H���
��G[�t�'1%Dz�����n��@�'&���a�BJ<C���b�`g(�B��P"�Y���B�a�-������()���s����Nz~�D�������
P`)E��$�!i�B�P5qP����{!
,D�>Cs~��F�ri��bIi�H/��X�D�r,z��)��aihI;�Ui�`U��-^�9`��p%t]����
�*��E!M��m��+PX�j�/AFR����c(�����3b�$��	�s�4�>T��� ���V�(2XA���Xt
��X�����"�o�s��
Y�S�C����~7kR��������ij�("#@���� �����10��H����?@k��9�T�`&�	�L��1Q��qC���@Us��H��������W��5�GP��r`e!���zKZ�R��i�D�e��2��0L���H������!��E��+`^��1�.�P':
��R�����10:Lm$iUI�)���J/%��Xz0VA^!gbB��t�4"J�'���j������i������v
�K�P�&ToCP�s!��
F��L�����3�m(M��;M�Xf�+;6~AqN:pW$����!��y`�I�E6pUa��PUA�(8���$&��}�V`��X$R����cY%�1`�1&b�l��a�%��u9F\�g��1a���^&�Y�f�P�0�z L������H�4
R�<af��#	M�do#�	�������$FL$:Xs&m����"�1�)a"va��ey:
�� R
��%!�j���H�nhS>,mP� �����N����h���0FF��#��+Be���B��B��A`&
+�F�\
�:|-��$t��F��m�V{&i{��w��'� (���!��_Kb7��Uy(�wr���������x�,m"���D�	�������l�!�Q.�.$G�w\�06���|At�d4���3���J
Lk� 7h"BB^�vn��2tl'������}���FZ���]z��f�pT�L���{�/�6gv�*m[�,J�Z��gB�P"��L�H;�D�P����%r���L&����D�|���C�cg##�����L�P2���!��l]�(���O��bycr:�|p��%\-�J��fW�)�m�wy E0��4��E>����e�`O���g�%��aLH��9�J1` �BS�����9�1�v�U���49m
�U:��"�xE�r4��3zUC����	CpD$[=�Hl���:���J,�O��-Z�e�$}�J&����%�t�� ��=�%�W&M�L^�|
!���h�b���F�C@����d$���]�A5�S��M�@}��(����.�qg�t�*�%�t��A��]��P��0m ��4��)�� -�L� J	���)�#T�
~���	�D��������k�N�pB���_�z@��q�YS����	$�H&��k�9���M�&�j�:�A�8��m8�E���b������@I��T�:S���jN�]S��������k�(p(B�w�i����3�{��W��v���wY�2�;E�_7�����[�+3�c�������M�?m�1��{���k|~�?���L&���s"�H!l���)�D��r��<�M,�])@L?6�� R�E2(����"s�������Vx�����q�d6���
V�K���f��\ce�����t��Lch!?2,A�&
��y�} L�(��H�3��e\�y)UYDW�bX	�����1���(��c�SP�Z�VN@��_�H��K���t�������0a�O��N��h
�8,,����Vs~.���Fa��2�STf�3�*a|�8$d��V50%W
l��]��s��c�
����7�{ah��>M���� Uvz���`�n"��63=�>�`�
�������[9�	M> >�?"1����� ;�R1sb�<�Tf���E.y
�`�F"�t�`�'�[��0�"�P-�"��g���A�.Om(/�x����Y�t��IJ�F����2�S0�t	S:hI�BeD���W�!K�I|E
��$`n����&3X���M��`,E���rC���W���S��t������,2����4�LO����C��Exb�X���g`����dL�1�Hu3w�xU��K���=J6��$������j;�d�EDwM�����#�p�:��a�K��@`.�������)rKDJ��������s��h�b�<��,�c!�s��c�|��I��t<��a��u����A���%H���Y��x�o��'���Y*�S��������)����W:��y+,�R%y1wC���)���	)���������K��}��#�1�i�����w�����%�e<�3`��(�XK��$���"�s���>a���>�(����1�! ��X�*1P�xE72��\��<���vE
����:�Cu�-���
&>2#/Q���@�����p�|��2�������R��?Q6s�
�x�j��~�>�!���4
�\�-g9�y��"��:�q�T�/�@�0��>-?�,�)g��FQdP��Y��L�� NR��1pf}j��<��"!�\0�
��X_.:��'��4�b�:BH7yf���g��F�2Uu=���Lc��_F��h�(P[��?*}&T��(�	�����BjbB���0Lut�������.�L��	N�M�Z3/2M�*�,���A�����	
(x��p���W�Y��2p9�P�B�y�oe��LG�kj��3��IL��.@.�?c7D.6����������o�T�8��1A#���6�c��0L (�S�B�>�a)�0WL��S�P�\Eg�NI�s���g� #CE<���U���|3P�h����L�R�W�F����Tb������ta��.(�bx���A��X�N2���<c?���J����|�ne�]HT��r��]Dw0�a�H�B��0�������p��4�L�(L�����Lg�z��|M��2���cnY-a
f���*[xT�Qi4���SK�]"��Q����2m��<�����(�F#�'0���)`1��wF
@��3�g*$��x
c����_���	������<`�v��1�=�1CY~�3/���!�}0�!$�3�'d;�
�s[C��
$^f	;RZ��N{�Q$,k�e�H/���y4�2��V�2���B'}SI�#k���X9�O����s&)��4	��Z3���9WJ:s|2���a���V&8���rB���y�Z0$��b !����z����$L�2Ue�m�	���6�2��$�(��3�yM���V&x$��Y�<������3�AF�����������c��4�0��6�r|bbZ�P���B���%������$����I��Tf�AH�<,����-)������������V�f>���b	*u?��g"3CX�v�a�����"����a�1�9��u��s|`��l6�E��3��;���VHD`��\73��T��F��y~L
�b����]c�3%<��XN�+
�M%�G�,z�����&������}3�W+�
�M�T�c��p�Y����!�����PR}(�'JS�P�#~��P������)�0��Lpf^dr���L�D?v�����'#y* W���. vGC�F�kl�Z�SGCdB��
�A�d��\pd	��Iw��RS ���#�8��&�g1cs��SEcE����5>H�G��1�f$rZ��!0U��F285�e�]��u�FNP����x1��b�Y���tz2�	}���"��8��5�������I������� ��<lo�;0�.��Ce��'*�(e�a�������e�1��P�O`��()9�k�i6�^UR::k=!8�b�2��C�(Y���)F�sw&"�>�4�:���'�|;@�����0�!- ����D��t�?��E�L}z����[l����|�+/Bx����������&����F�1�"(���p_��)�#L��
9�H)��O'v�A2����	/b]hsk2�2��� �x����}0Pvh����VN��z�����������|�%�us4��>x���K���=���D@��Y�C ��<�<�c�	^���p_�X(Z_0G�<�k	$c���0�)�Y� WF6#�<��RQ��O&'����O�J�r_�n�k3'<,�_�(O����4R�H�\]z�xX��Yg~|����
yK	ON�L�S�W��BD��i`�*O�I�!c�Z����rf!b"<V�%�z�1g�NI�H��K`F��G���������dJ8]�9��1���&���R��oc?�8f.U�B���F&���^�Y��1�,�f�~�aa��s�@O>���L�	i�sOF}�>�	��BDI �VFxL;E�<�	hgc !���Q���J�����2i��:�����q&;2#������:�-�d�1Zd�6�fF8�f���D@p)�����Yp1�������Cx�>�s#P4�C���s�4��������P�)�:�@�m\
he��<b��
�ezR���T�q��:��#	
���-�HBf�Z�|F��H@�(��	OS]7�}�>!a%u�y%/@}_��]����v�	��D��M�$1[���ahg�}��d3'��Ja��a���
x7��"c���������d���	c_�(������
����L	v�(l��3��9c]���sy��@��R@8�X�o+�(����Y�{a&o��C��X�y�Tbx��fN8�f��y�<�)(���P�\a�6��R�����u��5���x?�����^��S����	'�d���uDI}����N�u��	/R������$�����	�����L �6s����zet�w���`@�V���@uVc-rzaY��*2�����a����O��2K9�:hl���-Ss�1_F%L�*�C���^�'� z����%�gL���VB3/�"+�@ D�F^x��J�K&��I$Q[
��L'l��r��5Y��q�'�b���y���(�5�8��f���[��;������U���#�'t�A���d�����j��,�#����
�P��)��.��Z`��yC�F%�m����!������E�������=g�pI�	0���V�gQ�/o���#��$>�:�/2&�������,)�$t$0*�>Q��X���%(!�O�'y��u�qH�%���VT ��
fH�m��N_��N�c)Sg�A���2F���c�G�K���/+�v��g������t'�?$6��x���y�!����3�D�������F�Q�(�|��x�J�C��l&35�)�f�'�`��b�B�
P�hEa�mg��H���A�����_$����u-
�#A�H���a��u���,�!X� �0�x�-{��b;;<T^*�b���<!+&>�xC/'����P��L�0���(�Ya�e����I��������������(����a�����:�8-*Eq���4�	(&4�v��^���oBx���L�-,�������H���V�9�-�l���3��$���+�B� �3f��)�Y�=CZ8(xg}M��ll��X����jL�"?<��e1�1Z��B�1���\�U���f11#�s(�yQ���#glC� S��3��n���|�1�#	����q�A�a�s�����g��m�j�n���uO����s�����t�d�1��,+A}���x�����q�@�9�3�^��5�������<cq�������x� =��(��LM�Hx�����w���4��o����(�AS;+�����|l�Y��a]7A�������	4f��QR0K��	(��.9z3E<�	���Y%�|^��:���a�������B��0<HH%������h��4��nK`Q�f��<���"���	��q�
����X"7�WB����t�0O-G�N� �UNF���E8������B�K|�l��'2a���\=��\�F��\~0���������	4i��3��d3��������of�+����1�"_3�x�*'�i��Y�t���UR�J������,iANCp��: yV�z�7���X�����Q�,�������	����0�v�����h���O��j����}����W������0��?���{Y/���r�wM=�ryE<��N�/��b=>y����s�������������W�����w���i���c��0�w��QwG�uw��Q��C�}o��(���m���c��*�w���vG�iw����7L�N���>}w�1�o��;�|�;��������;�����e�G�wG����xG��V	�#���;��H�#����a��p������#���;�|-�w��QwG�uw��Q��C��w�xG���a��}RxG�iw���vG�ii�w�V���������������7�������,��_}�TF�?�8����k|����U�xx4�?<��{{���/�OF����{�W{{������?��������/�5�zo�Bqr�M��\~a���V�5��'��.�(�B/j������{�n���E����nK����
/�w�'��r�<�����O��������^z3��d0�N$�VqHs�_�8���,��=����W����_������?A�/���|x>Z�f�r6\��f�����z�S�}����������#������rHz��3	��ES��<
_tU?v�l��uA�/?�����'C�`���g�|I�A3���f;~'j�7YT� ����M���#����|D)�����|��(����\��l���[�T������5<)��r���	�_�?����*~�m"����D���w�
�����:��m��������L�h��|���~���
����a������=���m�^o��l3�C�����R���*��w���O�{��0h������A{N��s���(_|qsgMr���W��<$Gh�S�5�~�$�����h��_�W�S����p���07���������C2��/�_
P��]�Xg������r��t]~�@F�}��>��������]��,���j�����h��t
qEy�b�S����.9��/(v�%��6��x�V�H��d�������j�K�m������!���G~�F7>������S+���������������|Q�����g������
����W��B���`\��o�mq��f����B���>b��;pg��*�����>{w���!o���G������x:+�>[%O��J��Z��:���?��%�Rg_���9l�~4[�zt�%�|{�C3�9��_y�_�z��|4.aR\p�����=���QB�.�d5.'�������{��/�b"#�����:;����3"L��
��r��������R}('hK��=I��Hv"3-�7��+�H�8��/>-'�j������@�>����p���F`2[���#��9��~(\c4dFM[���W�^�:>����_�������l��
���V����J�`������im��}���\�����(:.��IG�������5���'\R�{v�����&���B��|�	A!��x�b���%Gh����nd�����9�0�t��<>~|���;����y������
�y������4z������;�X�c�b:�����F~08��Y�?��x>j�s�Z�S+�8�ecxF�G�����F��"�����Fr������9�U��x��d���id=�4�&�g��P����_1�����`���F��q7�#P���gB����(�*X���tY�y�y��fL�>�h�-��Y�l:^T5l����@X���E�Mg�VS���53����l��<]��~�]���l�d��`h sp�-���-�j��jqE�|&�?]�{{{w?)������O���?:|����_�'&���Su���)}T&�'�
�����'k�m���]���'�y���tQ�)6��t�/���X����o�.t�`���������_|q�Z�c�����`^[��I(5�T�|�Q�������P����C�!�J�G�Sl�@vn��������{�NP��~���I������4FWx�}x>������Zi9���z�gf�{��=iU��PA�l��q�k,�u�o���oaS<~��%w��������??}y��v�ml������F�AL9�&L�.�Z��U����#f�p
�������/���@�"t�7
������[��Gm�mI�_�t{�}���������n������^G�������h������F���l��?S�^���f1��������>�j��\��dj��pyv��	r����~��N0Y�gnX=��uC�,�X��<�p���\�j����%"n0��j�wh64[[����j�AK��#�;+w���{}����@#)v�>~���Lho9>�#�����tE{���J,��H�T�JL�v����)$���� �h7�����6'���`�e(���V�/����o��m?~y�3Z�&!`����\So�<�ax����Wo5�>^�Y{w�3�4��7TDh�.wX���%D�J�Z1����i�5DDO���pL���XNC�������]W�="@�l$�R��;���j1�����P
��8��>�I��k��b�&	D������������o�<z�����\���{w�-�������"%�����S�C��Q0��rt�o����y�U�m���=���h<.�!��X'#7~�m��y.������?�X�������36��e�v1���Gj�	���5����6��?�f����'�������'O�|�/��o�?��u�����X}��cP��k
�N����t8^�-�8_`��T������jY���-�F����o���)�����k� ������t�:����C�0j��Y���� ����Wox��{����O���G2���G���4�,O|kb�+��Z� �_|�-L���}wD�:)������P�j��$��`y3`�4��M��}iyY����
��yor�������5n"�����h��s�>3��t)G���&&��2�}M_���������h��#�Ws����������}	�@i�\4-x�`�Rb��M��
 ��>����9��QRV��<8@���{���:�r�����l�T}:^z��d~�~4^�����^!��A�L�������
�v��	��V��Y��X��	��)g��qc���5����k�lt�r}'��q�tA�y��rNd��u�B���N����)����^���}��'����� �t<�r��U=��=�U����E&�.n8��)b�P
\91nI-�'*;��==La2��M�n��L�h��`%�]
��O��Ge]�^��eqr�-0��Y���i��zm��+Y��A��w��o����k��'�D�4]
���@�>u�������F�>Jp�����!����:}��n7�h��~(
��k����-Xq��7=���y����~���IIV������<�I�O��)�F�����5�U�L�w�q�:D-�Q��}�=��F��,,k�j��\j}��@����=�j��@��N�~�'Y���M<_q��b;+K 7���t�.^�:|���y����>}��q�!0���y?�5��qi.���i
�y���w���f�cz�����/�������J���(B =�h�OM�f qq���zU��S-K�VF���a������)>nV�ydZ��|+�������"�Ao�31�Wz�����\z��h��1��Y��t�^�j�+���cl�F[y���U���Qs��+�����/9/'�p����
Xdt�TKI�3�P��q
�H�]�����||�������+OG����h������v�Z�	@k8�h>�hY���h�����^O��r|
e����[�o��j�M���1���� 
�#�a����mJr>h���W�/-����Z��s���|�����a���*7�� �U6��:�������,:�v�\����Fp�:��m����+�������	[A�0b[{}~+��_xv������>z�=���X|kNof��J����t���g������C�7�m�����S&�u�1����J�����0��GRD�C6a��"�9�o�38s��q�X��]�r����@n(@|����Rso��LTy�9��jY*��l����g��$Z�Vth��d%��0&�>����#n{��z>��>
��iQ���y�U��F����iV'�}�����lfZME�fu��b��g���L��bM�������&����'0��_���}��`�o�8ou�c7���+:�gi}k��#���~��A��uf8?����)~��T/'kK�-��*oI�o@@3o^~��Fx�Qt���N6�up�aE��J��!�g�Lf��9mnH���B:��z5��g��j�6�����U���lb��^k��8��p�������j�����B��RE� �#v�	���9�����G~}����,'��G��R��AAM�������@=^0���Z'`��7A6�>�L8r����m9i>R�fi��t�Yn������4�q��v�e�����"g7��0S��\����e�G����!�����w�S&C�s���������h���9��
P}7�������7�!9���>0�����i|6=Z�,�a��9��<U����F��nZ����^���;���:�4��j6q�`�����nc���S#8����N������@'�����t����(����15*vX�g����}����;8��(x�Z�I�
���z�sv}m��n�k11Y��������i����
��M�l�`p4������	G�\�B�h���jq��p%�w���X��?�*�O����
�4����g�6)���$�RqD2�'<hp���N�r�M}[�viM�j�b]f�VH���N����N���ml��j� ������������������,�����2�G��\Z'�N~^�l��7�����_�v�\�d;���o��>�����r(i���}�N�:�<k��'������;�>{���m�4�������_����1�S��2�us@u��k��+�FR��g�{�x�>M�^�����J�E������@��Lt�
Z�b���X�?h�Z��)���
`�f_n���%������V���g���C'!���
"e�Xt���'%�O����
����\����u��$�'�%gW�������awB�o�4�[��~~�yqo �(;�HA��rV��H=�LN�8'�������O�IPy�(�Q�A,B��s����N[���Z����M����jY7����#S����A�+�h�<�mZ���0#Zn��E�Y6������PO
�w+�g�C�|7�9��������
3�{��j�n�v,`��GLm�������Y���G�c�N�vh�F��^�b><2��������F�0�������{��f	��p�����l�w<�&��5g�k�~��7J��V��w���6��pf�V��f*�thZ�����v���M�[k���nvm�����R���UrO�66�W]�`/���2Jb0b
t�[Jt�77���Y�~Z-���&w��w���=��|�M3�&�eK�Z5��qFD�s�����1��^�[��������=���������y�����:XM�'��x:���n4�������n)*���j��]_���^L���%o������h�z���h���u��`����-����;9����1 �>P�b���%����W��h�40�<��7�.��pI��OM�FeN�s�W����|�F�^��zo�F�8��E�`I���2A��C%ms�)�q�!��w<���%�e���I^�����Vg����B��9F��%��T�.q��������`�v����&�VG�x1=���[���&������x��SLr��5gh��9�T�������6�&�5���C�����r������O��w�������z=c�y��0��x�t��
E!�J�+��wo�����9����-zV��Y���������W��6`B.tA���X{;)?nw���|l�:�Df��SX-�@�b��#��AgUu����T[j_*#p��r*.:��e���^��6�a���@�����R�U
��������i�M��tr�2�4�eQV �vk��|2%���S�1��zm�����i��}��8���:[���5A���#u��ww|�.W����?��OmU��OV��d������r����Fe�~�����kk|�&����kel��2
J�S=t�G�6V������c"�-�<��v(W�tf�ph���cj�m
�gT��G�&y���{�u��O�����"iGao�����Ws�:�AX�V']��v��W���rU�v~�k��wFM���
��gM[;��r^.�c����+7"�<O�+�[��Z�1�C��;�)t���j��g��iB�m���r�N0��,���i��\8�]��� 9~[���f�H���"�+yA���FD��Qc�������h�aS+)�L����Ph������s�\- O��=^�?�u��}�xcH��j�����z�#2P6�
AO\Q[A�.�������#�
���W�^z��������w�C�s����:��O�C�:po/��zI��wB� e	������#���O����������K�l�����[[�1pX!�+�2�cD1�7�S��J���e4�N���D�����T�J���������5���/j�����%������5�KJ�Ak��-	d�\��xtB#���*y����5�z(��t�w�
�����6i-���0���f,�A�f����7���9+ve�������%@3H��R��?�z�y���(��D�A>��-u��?�c��S�I���onV������B�#Zr��&v#�u�F�Rw?�����;;�7������4��m�"=���h�m:VH���������P��B������*�B���c��sq�fV78���p$�"����oU�l��b����H`���zd��w���	�����|K���:;[�O5�j�<_-�n���S�@�O�7�K���j.��v���TZ������\�{��_x��-�t;����Y5����w�i��qU���P�)��.��>p�j4Sy w/��=��/�^�7�c����kIh0d89�K�sGK�H�~<yZ�w0�O�v���������Z�=OgSe�������~��n�����,�����ioOz����<,R�.��l�F~����������)������[E�v.���.��ou�|SE�@�
��Y��N9M��'D����v����9)�V'���x�{��=��t�������Gw4�k��=���`�G:�2=G�.{����O����jwa-��'�{x�[��{����(MF�]����u2M�"L��f^w�o_{C�Lm�`���cS�M���Wsaps�~��m;����W�|���9[y��=WN������/3�� t��6��K�0K����i�r�2���q�;�����k:��w������`���P�5�=�����k��D�1�2{{�^���������!��
}n/���G��b�3��n���~�6�+��BZ�H���Y4$��v��SS\�9��4�k2����Y5?9h/�s��u�5������m�e��Ou��
i#�2h��h�WK�mc��u����X��^�RV���r����Ng�z���}V���lP�+eG����r�����Wz����RS�����{���T�/i����{��}f�,xT�^��C�)4O������h|k�W/9c�mY7f$O?K�;���{���'�U�[`S�=xoD�A�8�v{^�����l>����2�m���h�Pt��$2��������9�w�����+�����]j^�^{�\���Rv=����� w&���������V��J�?]�������B���w�����y1��>��y5�IIt��7���LQ��R���=2�n�H�=�I���(Y#���B���m�u������i��{�z?���6�'{����������}.�s���?�Q�� �b�lE3Q���A��g���~ou�����}�m�������8�Y|����.MT����Y����ZK�N�ms!���x3�k�ho��%2�r~�� "�� }�m�F���]�r�]>��2��K���/�i��p'�=��G�G=�md	9��iZ!md�����G��d.���/H�	j�U��J����"�nc�	���|������E�`������B��S_��F.���s��U����zOC��TYj���`�����dU�7)p&�K �]����d�IN���ZA�d����<r�<�DMj�Pmx�v�����go�=��KJ�����s�Y@����~��M���������q�?��L7j<Xa�ib6������i5�_�hx���������Z��68�=hs��P������V�ANG����������Z���������MN]�)+��p��a��&�����E��_���j��m0�����������/n��z���t.���zu�d�{�~�������2<������}������_l&_���R�z��C\jzs9���}�>������~�4Qz4A@���x��4�f�V��&�����E_-B����u��c\�j��Z�mKZ��9�{��I����2
���?7�o1�d��,>A�������L��K���ZGV����l�|v�W��xQ�?�bn�H��o�+�m��kG��I3������B[�Wt�g�]$��q�|$W]2��u����}���6��������l!��nX�������%t�`��5��w���{���u���=}y����{��O��>?�%���E�kw���P7\��W�?���������n]2�|����w����Kl��`������<�LQ0w	aP�+b�
v�Yw���i9�)��R��"�1���{�~�������k������y���������}w�V�A�C��a����v�g���f�t����
��I8��������	
�M;B�,'�W�`���o=���&<No��4��i���Ee����7�i�o���/���i������m^>���}���=��gh�u?��e���tU��c]�9���{���f!%���v	�g=:n�H��[���]n�����*z�(h�������v�x�R��~<�1���7��L����jR��AM}��&UZ���;��K1S.�pY��n����h���u�0�2}d~ZB5��%���
!K�?�,����T����`D�����v���l���~���X����������i����"�Lo�o��gc��X��iH�-�z��n������`��'����f�y�VL��MG���)O��m8TY=��������je7&����������n��1�	�R�������xV�x���9��Kh���L���������38�k���]$��7�;�cvL�~a��Vz#�
G�x:u���z�es5n����j_w�v����!�N�xG+-������f�>�/���oo�t�����������t�2���ZU�n}����q���
��i������������z�U�W
f���M'�`�+�zi�?��}��m�RW�]Eg��;����W���������<0{�R���5V���p�~u2(�'<��rot���]���.�n�&0�8�m*������~�a�y��X���������`N�;����g��gx�jLs]�H������
�E�Kq]��}����	���O?]4Tl�����>��?��^>���`�[���~|J��9�5~g���^��H��������|���9KcT[����	��ecr�.k[��pZ���;�DO�������;k�!����_������_?t��W?�������g�={���o�t��j����~Z�/���&j1������i=�5�?�*�[J5�������w��PC��k�,�-#g��?}=������������~.BA���~N���q=t	v�1���S8���lI���~[���+b�K��r�14�5�������~x���gp���IM/?�0KDkb��FD���a��*�v���]�6�+��$e�3\
leW��p?��N�.�����P0	��0re��kw4������i�������)���t������=�� ��[���?=��~�������>yz[�}W7?�X�S#- l2�F�)���U��k?��@��_�nI�`��mN�%���,���Y<44��;�.h\W��NgP���i�3����I�c�*�2��_k�����t
��
����s�Ykwt)��W-e�&�]��������^�]2=�oVX����#�6z�����g������[��G�I�4����%O���h��~�{��}v�������H�g@

gist-xeon.tgzapplication/x-compressed-tar; name=gist-xeon.tgzDownload

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: WIP: parallel GiST index builds

Hi Tomas!

On 7 Jun 2024, at 20:41, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.

I think this totally makes sense. I've took a look into tuples partitioning (for sorted build) in your Github and I see that it's complicated feature. So, probably, we can do it later.
I'm trying to review the patch as it is now. Currently I have some questions about code.

1. Do I get it right that is_parallel argument for gistGetFakeLSN() is only needed for assertion? And this assertion can be ensured just by inspecting code. Is it really necessary?
2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be not used anywhere. AFAIK it's only needed to buffered build.
3. I think we need a test that reliably triggers parallel and serial builds.

As far as I know, there's a well known trick to build better GiST over PostGIS data: randomize input. I think parallel scan is just what is needed, it will shuffle tuples enough...

Thanks for working on this!

Best regards, Andrey Borodin.

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#3)

Re: WIP: parallel GiST index builds

On 7/21/24 21:31, Andrey M. Borodin wrote:

Hi Tomas!

On 7 Jun 2024, at 20:41, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I
knew next to nothing about how GiST works, but I gave it a shot and
here's what I have - the attached patch allows parallel GiST builds
for the "unsorted" case (i.e. when the opclass does not include
sortsupport), and does not support buffered builds.

I think this totally makes sense. I've took a look into tuples
partitioning (for sorted build) in your Github and I see that it's
complicated feature. So, probably, we can do it later. I'm trying to
review the patch as it is now. Currently I have some questions about
code.

OK. I'm not even sure partitioning is the right approach for sorted
builds. Or how to do it, exactly.

1. Do I get it right that is_parallel argument for gistGetFakeLSN()
is only needed for assertion? And this assertion can be ensured just
by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]/messages/by-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf@enterprisedb.com. But I apparently forgot to post an
updated version of this patch :-(

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

[1]: /messages/by-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf@enterprisedb.com
/messages/by-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf@enterprisedb.com

2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
not used anywhere. AFAIK it's only needed to buffered build. 3. I
think we need a test that reliably triggers parallel and serial
builds.

Yeah, it's possible the variable is unused. Agreed on the testing.

As far as I know, there's a well known trick to build better GiST
over PostGIS data: randomize input. I think parallel scan is just
what is needed, it will shuffle tuples enough...

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#4)

Re: WIP: parallel GiST index builds

On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

1. Do I get it right that is_parallel argument for gistGetFakeLSN()
is only needed for assertion? And this assertion can be ensured just
by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(

Oops, I just though that it's a version with solved FakeLSN problem.

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

Why don't we just use an atomic counter withtin shared build state?

[1]
/messages/by-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf@enterprisedb.com

Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here. You did a great review, and I want to address all issues there wholistically. Thank you!

2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
not used anywhere. AFAIK it's only needed to buffered build. 3. I
think we need a test that reliably triggers parallel and serial
builds.

Yeah, it's possible the variable is unused. Agreed on the testing.

As far as I know, there's a well known trick to build better GiST
over PostGIS data: randomize input. I think parallel scan is just
what is needed, it will shuffle tuples enough...

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?

I think parallel build indexes will have faster IndexScans.

Best regards, Andrey Borodin.

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#5)

Re: WIP: parallel GiST index builds

On 7/22/24 11:00, Andrey M. Borodin wrote:

On 21 Jul 2024, at 23:42, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

1. Do I get it right that is_parallel argument for gistGetFakeLSN()
is only needed for assertion? And this assertion can be ensured just
by inspecting code. Is it really necessary?

Yes, in the patch it's only used for an assert. But it's actually
incorrect - just as I speculated in my initial message (in the section
about gistGetFakeLSN), it sometimes fails to detect a page split. I
noticed that while testing the patch adding GiST to amcheck, and I
reported that in that thread [1]. But I apparently forgot to post an
updated version of this patch :-(

Oops, I just though that it's a version with solved FakeLSN problem.

I'll post a new version tomorrow, but in short it needs to update the
fake LSN even if (lastlsn != currlsn) if is_parallel=true. It's a bit
annoying this means we generate a new fake LSN on every call, and I'm
not sure that's actually necessary. But I've been unable to come up with
a better condition when to generate a new LSN.

Why don't we just use an atomic counter withtin shared build state?

I don't understand how would that solve the problem, can you elaborate?
Which of the values are you suggesting should be replaced with the
shared counter? lastlsn?

[1]
/messages/by-id/79622955-6d1a-4439-b358-ec2b6a9e7cbf@enterprisedb.com

Yes, I'll be back in that thread soon. I'm still on vacation and it's hard to get continuous uninterrupted time here. You did a great review, and I want to address all issues there wholistically. Thank you!

2. gistBuildParallelCallback() updates indtuplesSize, but it seems to be
not used anywhere. AFAIK it's only needed to buffered build. 3. I
think we need a test that reliably triggers parallel and serial
builds.

Yeah, it's possible the variable is unused. Agreed on the testing.

As far as I know, there's a well known trick to build better GiST
over PostGIS data: randomize input. I think parallel scan is just
what is needed, it will shuffle tuples enough...

I'm not sure I understand this comment. What do you mean by "better
GiST" or what does that mean for this patch?

I think parallel build indexes will have faster IndexScans.

That's interesting. I haven't thought about measuring stuff like that
(and it hasn't occurred to me it might have this benefit, or why would
that be the case).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#6)

Re: WIP: parallel GiST index builds

On 22 Jul 2024, at 12:26, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

I don't understand how would that solve the problem, can you elaborate?
Which of the values are you suggesting should be replaced with the
shared counter? lastlsn?

I think during build we should consider index unlogged and always use GetFakeLSNForUnloggedRel() or something similar. Anyway we will calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end.

Best regards, Andrey Borodin.

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#7)

Re: WIP: parallel GiST index builds

On 7/22/24 13:08, Andrey M. Borodin wrote:

On 22 Jul 2024, at 12:26, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I don't understand how would that solve the problem, can you
elaborate? Which of the values are you suggesting should be
replaced with the shared counter? lastlsn?

I think during build we should consider index unlogged and always use
GetFakeLSNForUnloggedRel() or something similar. Anyway we will
calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end.

But that doesn't update the page LSN, which GiST uses to detect
concurrent splits, no?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: WIP: parallel GiST index builds

On 22 Jul 2024, at 14:53, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/22/24 13:08, Andrey M. Borodin wrote:

On 22 Jul 2024, at 12:26, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

I don't understand how would that solve the problem, can you
elaborate? Which of the values are you suggesting should be
replaced with the shared counter? lastlsn?

I think during build we should consider index unlogged and always use
GetFakeLSNForUnloggedRel() or something similar. Anyway we will
calllog_newpage_range(RelationGetNumberOfBlocks(index)) at the end.

But that doesn't update the page LSN, which GiST uses to detect
concurrent splits, no?

During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which in turn will call GetFakeLSNForUnloggedRel(). Or any other shared counter.
After inserting tuples we call log_newpage_range() to actually WAL-log pages.
All NSNs used during build must be less than LSNs used to insert new tuples after index is built.

Best regards, Andrey Borodin.

#10

Andreas Karlsson

andreas@proxel.se

over 1 year ago

In reply to: Andrey M. Borodin (#9)

Re: WIP: parallel GiST index builds

On 7/22/24 2:08 PM, Andrey M. Borodin wrote:

During inserting tuples we need NSN on page. For NSN we can use just a counter, generated by gistGetFakeLSN() which in turn will call GetFakeLSNForUnloggedRel(). Or any other shared counter.
After inserting tuples we call log_newpage_range() to actually WAL-log pages.
All NSNs used during build must be less than LSNs used to insert new tuples after index is built.

I feel the tricky part about doing that is that we need to make sure the
fake LSNs are all less than the current real LSN when the index build
completes and while that normally should be the case we will have a
almost never exercised code path for when the fake LSN becomes bigger
than the real LSN which may contain bugs. Is that really worth it to
optimize.

But if we are going to use fake LSN: since the index being built is not
visible to any scans we do not have to use GetFakeLSNForUnloggedRel()
but could use an own counter in shared memory in the GISTShared struct
for this specific index which starts at FirstNormalUnloggedLSN. This
would give us slightly less contention plus decrease the risk (for good
and bad) of the fake LSN being larger than the real LSN.

Andreas

#11

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Andreas Karlsson (#10)

Re: WIP: parallel GiST index builds

On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:

I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSN when the index build completes and while that normally should be the case we will have a almost never exercised code path for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize.

But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel() but could use an own counter in shared memory in the GISTShared struct for this specific index which starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for good and bad) of the fake LSN being larger than the real LSN.

+1 for atomic counter in GISTShared.
BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().

Best regards, Andrey Borodin.

#12

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#11)

2 attachment(s)

Re: WIP: parallel GiST index builds

On 7/26/24 14:13, Andrey M. Borodin wrote:

On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:

I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSN when the index build completes and while that normally should be the case we will have a almost never exercised code path for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize.

But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel() but could use an own counter in shared memory in the GISTShared struct for this specific index which starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for good and bad) of the fake LSN being larger than the real LSN.

+1 for atomic counter in GISTShared.

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

The way it passes the shared counter to places that actually need it is
pretty ugly. The thing is - the counter needs to be in shared memory,
but places like gistplacetopage() have no idea/need of that. I chose to
simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
there's a better way to do this?

I thought maybe we could simply increment the counter before each call
and pass the LSN value - 64bits should be enough, not sure about the
overhead. But gistplacetopage() also uses the LSN twice, and I'm not
sure it'd be legal to use the same value twice.

Any better ideas?

BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().

Why would the reset be necessary? Doesn't log_newpage_range() set page
LSN to current insert LSN? So why would reset that?

I'm not sure about the discussion about NSN and the need to handle the
case when NSN / fake LSN values get ahead of LSN. Is that really a
problem? If the values generated from the counter are private to the
index build, and log_newpage_range() replaces them with current LSN, do
we still need to worry about it?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240730-0001-WIP-parallel-GiST-build.patchtext/x-patch; charset=UTF-8; name=v20240730-0001-WIP-parallel-GiST-build.patchDownload

From 77aadc5dab96557390449d172665a1404df97320 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20240730 1/6] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builds (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  37 +-
 src/backend/access/gist/gistbuild.c   | 713 +++++++++++++++++++++++++-
 src/backend/access/gist/gistutil.c    |  10 +-
 src/backend/access/gist/gistvacuum.c  |   6 +-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  12 +-
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 739 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..f5f56fb2503 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -75,7 +75,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -182,7 +182,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -230,7 +230,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -501,9 +502,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -511,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -570,7 +579,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -589,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 		PageSetLSN(page, recptr);
 
@@ -632,7 +646,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -646,6 +661,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1303,7 +1319,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1722,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ba06df30faf..c8fa67beebb 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,106 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +210,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +266,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +295,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +335,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +449,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   (void *) &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -861,7 +1043,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -900,6 +1082,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1068,7 +1292,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1577,3 +1802,439 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 78e98d68b15..733d5849317 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel)
+gistGetFakeLSN(Relation rel, bool is_parallel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,8 +1035,12 @@ gistGetFakeLSN(Relation rel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
+		/*
+		 * Shouldn't be called for WAL-logging relations, but parallell
+		 * builds are an exception - we need the fake LSN to detect
+		 * concurrent changes.
+		 */
+		Assert(is_parallel || !RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..082804e9c7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = gistGetFakeLSN(rel, false);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = gistGetFakeLSN(info->index, false);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..7e09fc79c30 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..d5b22bc1018 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -531,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 3deb6113b80..4a6385e022b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -973,6 +973,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -983,6 +984,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.45.2

v20240730-0002-atomic-LSN-counter.patchtext/x-patch; charset=UTF-8; name=v20240730-0002-atomic-LSN-counter.patchDownload

From a2f606d5dba1d7902f925a1c9af95cd27ade7488 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 29 Jul 2024 18:03:44 +0200
Subject: [PATCH v20240730 2/6] atomic LSN counter

---
 src/backend/access/gist/gist.c       | 24 ++++++++++++------------
 src/backend/access/gist/gistbuild.c  | 19 +++++++++++++++----
 src/backend/access/gist/gistutil.c   | 10 +++-------
 src/backend/access/gist/gistvacuum.c |  6 +++---
 src/include/access/gist_private.h    |  8 ++++----
 5 files changed, 37 insertions(+), 30 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f5f56fb2503..63d5580120d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -182,7 +182,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, NULL);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -231,7 +231,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				bool markfollowright,
 				Relation heapRel,
 				bool is_build,
-				bool is_parallel)
+				pg_atomic_uint64 *fakelsn)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -508,8 +508,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 */
 		if (is_build)
 		{
-			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+			if (fakelsn)
+				recptr = pg_atomic_fetch_add_u64(fakelsn, 1);
 			else
 				recptr = GistBuildLSN;
 		}
@@ -520,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -580,8 +580,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 
 		if (is_build)
 		{
-			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+			if (fakelsn)
+				recptr = pg_atomic_fetch_add_u64(fakelsn, 1);
 			else
 				recptr = GistBuildLSN;
 		}
@@ -603,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 		PageSetLSN(page, recptr);
 
@@ -647,7 +647,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			 GISTSTATE *giststate, Relation heapRel, bool is_build,
-			 bool is_parallel)
+			 pg_atomic_uint64 *fakelsn)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -661,7 +661,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
-	state.is_parallel = is_parallel;
+	state.fakelsn = fakelsn;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1320,7 +1320,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   true,
 							   state->heapRel,
 							   state->is_build,
-							   state->is_parallel);
+							   state->fakelsn);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1739,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel, false));
+			PageSetLSN(page, gistGetFakeLSN(rel));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index c8fa67beebb..c7ac2487cda 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -140,6 +140,9 @@ typedef struct GISTShared
 	double		reltuples;
 	double		indtuples;
 
+	/* Used to generate LSNs during parallel build. */
+	pg_atomic_uint64	parallelLSN;
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -217,6 +220,7 @@ typedef struct
 	 */
 	bool		is_parallel;
 	GISTLeader *gist_leader;
+	pg_atomic_uint64 *fakelsn;
 
 	/*
 	 * Extra data structures used during a sorting build.
@@ -1043,7 +1047,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true, false);
+					 buildstate->giststate, buildstate->heaprel, true, NULL);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -1098,6 +1102,7 @@ gistBuildParallelCallback(Relation index,
 						  void *state)
 {
 	GISTBuildState *buildstate = (GISTBuildState *) state;
+
 	IndexTuple	itup;
 	MemoryContext oldCtx;
 
@@ -1118,7 +1123,8 @@ gistBuildParallelCallback(Relation index,
 	 * locked, we call gistdoinsert directly.
 	 */
 	gistdoinsert(index, itup, buildstate->freespace,
-				 buildstate->giststate, buildstate->heaprel, true, true);
+				 buildstate->giststate, buildstate->heaprel, true,
+				 buildstate->fakelsn);
 
 	MemoryContextSwitchTo(oldCtx);
 	MemoryContextReset(buildstate->giststate->tempCxt);
@@ -1292,8 +1298,7 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true,
-							   buildstate->is_parallel);
+							   buildstate->heaprel, true, NULL);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1925,6 +1930,9 @@ _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
 	gistshared->reltuples = 0.0;
 	gistshared->indtuples = 0.0;
 
+	/* initialize the counter used to generate fake LSNs */
+	pg_atomic_init_u64(&gistshared->parallelLSN, 1);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGistShared(gistshared),
 								  snapshot);
@@ -1974,6 +1982,7 @@ _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
 	/* Save leader state now that it's clear build will be parallel */
 	buildstate->is_parallel = true;
 	buildstate->gist_leader = gistleader;
+	buildstate->fakelsn = &gistshared->parallelLSN;
 
 	/* Join heap scan ourselves */
 	if (leaderparticipates)
@@ -2202,6 +2211,8 @@ _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.is_parallel = true;
 	buildstate.gist_leader = NULL;
 
+	buildstate.fakelsn = &gistshared->parallelLSN;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 733d5849317..78e98d68b15 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel, bool is_parallel)
+gistGetFakeLSN(Relation rel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,12 +1035,8 @@ gistGetFakeLSN(Relation rel, bool is_parallel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/*
-		 * Shouldn't be called for WAL-logging relations, but parallell
-		 * builds are an exception - we need the fake LSN to detect
-		 * concurrent changes.
-		 */
-		Assert(is_parallel || !RelationNeedsWAL(rel));
+		/* Shouldn't be called for WAL-logging relations */
+		Assert(!RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 082804e9c7d..24fb94f473e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel, false);
+		vstate.startNSN = gistGetFakeLSN(rel);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel, false));
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index, false);
+		recptr = gistGetFakeLSN(info->index);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index d5b22bc1018..09f535ff56e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -255,7 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
-	bool		is_parallel;
+	pg_atomic_uint64 *fakelsn;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -416,7 +416,7 @@ extern void gistdoinsert(Relation r,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
 						 bool is_build,
-						 bool is_parallel);
+						 pg_atomic_uint64 *fakelsn);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -434,7 +434,7 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							bool markfollowright,
 							Relation heapRel,
 							bool is_build,
-							bool is_parallel);
+							pg_atomic_uint64 *fakelsn);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -535,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
-- 
2.45.2

#13

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#12)

Re: WIP: parallel GiST index builds

On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/26/24 14:13, Andrey M. Borodin wrote:

On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:

I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSN when the index build completes and while that normally should be the case we will have a almost never exercised code path for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize.

But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel() but could use an own counter in shared memory in the GISTShared struct for this specific index which starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for good and bad) of the fake LSN being larger than the real LSN.

+1 for atomic counter in GISTShared.

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

I agree. Passing this pointer everywhere seems ugly.

The way it passes the shared counter to places that actually need it is
pretty ugly. The thing is - the counter needs to be in shared memory,
but places like gistplacetopage() have no idea/need of that. I chose to
simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
there's a better way to do this?

I thought maybe we could simply increment the counter before each call
and pass the LSN value - 64bits should be enough, not sure about the
overhead. But gistplacetopage() also uses the LSN twice, and I'm not
sure it'd be legal to use the same value twice.

Any better ideas?

How about global pointer to fake LSN?
Just set it to correct pointer when in parallel build, or NULL either way.

BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().

Why would the reset be necessary? Doesn't log_newpage_range() set page
LSN to current insert LSN? So why would reset that?

I'm not sure about the discussion about NSN and the need to handle the
case when NSN / fake LSN values get ahead of LSN. Is that really a
problem? If the values generated from the counter are private to the
index build, and log_newpage_range() replaces them with current LSN, do
we still need to worry about it?

Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so.

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN.

Best regards, Andrey Borodin.

#14

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#13)

Re: WIP: parallel GiST index builds

On 7/30/24 11:39, Andrey M. Borodin wrote:

On 30 Jul 2024, at 14:05, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

On 7/26/24 14:13, Andrey M. Borodin wrote:

On 26 Jul 2024, at 14:30, Andreas Karlsson <andreas@proxel.se> wrote:

I feel the tricky part about doing that is that we need to make sure the fake LSNs are all less than the current real LSN when the index build completes and while that normally should be the case we will have a almost never exercised code path for when the fake LSN becomes bigger than the real LSN which may contain bugs. Is that really worth it to optimize.

But if we are going to use fake LSN: since the index being built is not visible to any scans we do not have to use GetFakeLSNForUnloggedRel() but could use an own counter in shared memory in the GISTShared struct for this specific index which starts at FirstNormalUnloggedLSN. This would give us slightly less contention plus decrease the risk (for good and bad) of the fake LSN being larger than the real LSN.

+1 for atomic counter in GISTShared.

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

I agree. Passing this pointer everywhere seems ugly.

Yeah.

The way it passes the shared counter to places that actually need it is
pretty ugly. The thing is - the counter needs to be in shared memory,
but places like gistplacetopage() have no idea/need of that. I chose to
simply pass a pg_atomic_uint64 pointer, but that's ... not pretty. Is
there's a better way to do this?

I thought maybe we could simply increment the counter before each call
and pass the LSN value - 64bits should be enough, not sure about the
overhead. But gistplacetopage() also uses the LSN twice, and I'm not
sure it'd be legal to use the same value twice.

Any better ideas?

How about global pointer to fake LSN?
Just set it to correct pointer when in parallel build, or NULL either way.

I'm not sure adding a global variable is pretty either. What if there's
some error, for example? Will it get reset to NULL?

BTW we can just reset LSNs to GistBuildLSN just before doing log_newpage_range().

Why would the reset be necessary? Doesn't log_newpage_range() set page
LSN to current insert LSN? So why would reset that?

I'm not sure about the discussion about NSN and the need to handle the
case when NSN / fake LSN values get ahead of LSN. Is that really a
problem? If the values generated from the counter are private to the
index build, and log_newpage_range() replaces them with current LSN, do
we still need to worry about it?

Stamping pages with new real LSN will do the trick. I didn’t know that log_newpage_range() is already doing so.

I believe it does, or at least that's what I believe this code at the
end is meant to do:

recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);

for (i = 0; i < nbufs; i++)
{
PageSetLSN(BufferGetPage(bufpack[i]), recptr);
UnlockReleaseBuffer(bufpack[i]);
}

Unless I misunderstood what this does.

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN.

I'm confused. How is this related to unloggedLSN at all?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Andrey M. Borodin

x4mmm@yandex-team.ru

over 1 year ago

In reply to: Tomas Vondra (#14)

Re: WIP: parallel GiST index builds

On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN.

I'm confused. How is this related to unloggedLSN at all?

Parallel build should work for both logged and unlogged indexes.
If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.

In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.

Best regards, Andrey Borodin.

#16

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andrey M. Borodin (#15)

Re: WIP: parallel GiST index builds

On 7/30/24 13:31, Andrey M. Borodin wrote:

On 30 Jul 2024, at 14:57, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN.

I'm confused. How is this related to unloggedLSN at all?

Parallel build should work for both logged and unlogged indexes.

Agreed, no argument here.

If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.

Ah, right. For unlogged relations we won't invoke log_newpage_range(),
so we'd end up with the bogus page LSNs ...

In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.

Interesting idea. IIRC you suggested this earlier, but I didn't realize
it has the benefit of already using an atomic counter - which solves the
"ugliness" of my patch.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Andreas Karlsson

andreas@proxel.se

over 1 year ago

In reply to: Andrey M. Borodin (#15)

Re: WIP: parallel GiST index builds

On 7/30/24 1:31 PM, Andrey M. Borodin wrote:>> On 30 Jul 2024, at 14:57,
Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

How do we synchronize Shared Fake LSN with global XLogCtl->unloggedLSN? Just bump XLogCtl->unloggedLSN if necessary?
Perhaps, consider using GetFakeLSNForUnloggedRel() instead of shared counter? As long as we do not care about FakeLSN>RealLSN.

I'm confused. How is this related to unloggedLSN at all?

Parallel build should work for both logged and unlogged indexes.
If we use fake LSN in shared memory, we have to make sure that FakeLSN < XLogCtl->unloggedLSN after build.
Either way we can just use XLogCtl->unloggedLSN instead of FakeLSN in shared memory.

In other words I propose to use GetFakeLSNForUnloggedRel() instead of "pg_atomic_uint64 *fakelsn;”.

Yeah,

Great point, given the ugliness of passing around the fakelsn we might
as well just use GetFakeLSNForUnloggedRel().

Andreas

#18

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Tomas Vondra (#16)

2 attachment(s)

Re: WIP: parallel GiST index builds

Hi,

Here's an updated patch using GetFakeLSNForUnloggedRel() instead of the
atomic counter. I think this looks much nicer and less invasive, as it
simply uses XLogCtl shared memory (instead of having to pass a new
pointer everywhere).

We still need to pass the is_parallel flag, though. I wonder if we could
get rid of that too, and just use GetFakeLSNForUnloggedRel() for both
parallel and non-parallel builds? Why wouldn't that work?

I've spent quite a bit of time testing this, but mostly for correctness.
I haven't redone the benchmarks, that's on my TODO.

regards

--
Tomas Vondra

Attachments:

v20240805-0001-WIP-parallel-GiST-build.patchtext/x-patch; charset=UTF-8; name=v20240805-0001-WIP-parallel-GiST-build.patchDownload

From c2766218c657097bb969c97d664c12803b374ff0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20240805 1/6] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builds (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  37 +-
 src/backend/access/gist/gistbuild.c   | 713 +++++++++++++++++++++++++-
 src/backend/access/gist/gistutil.c    |  10 +-
 src/backend/access/gist/gistvacuum.c  |   6 +-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  12 +-
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 739 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ed4ffa63a77..f5f56fb2503 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -75,7 +75,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -182,7 +182,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -230,7 +230,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -501,9 +502,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -511,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -570,7 +579,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -589,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 		PageSetLSN(page, recptr);
 
@@ -632,7 +646,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -646,6 +661,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1303,7 +1319,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1722,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ba06df30faf..c8fa67beebb 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,106 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +210,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +266,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +295,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +335,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +449,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   (void *) &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -861,7 +1043,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -900,6 +1082,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1068,7 +1292,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1577,3 +1802,439 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 78e98d68b15..733d5849317 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel)
+gistGetFakeLSN(Relation rel, bool is_parallel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,8 +1035,12 @@ gistGetFakeLSN(Relation rel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
+		/*
+		 * Shouldn't be called for WAL-logging relations, but parallell
+		 * builds are an exception - we need the fake LSN to detect
+		 * concurrent changes.
+		 */
+		Assert(is_parallel || !RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..082804e9c7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = gistGetFakeLSN(rel, false);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = gistGetFakeLSN(info->index, false);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..7e09fc79c30 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..d5b22bc1018 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -531,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6e6b7c27118..21af0c7ce4f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -973,6 +973,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -983,6 +984,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.45.2

v20240805-0002-use-GetFakeLSNForUnloggedRel.patchtext/x-patch; charset=UTF-8; name=v20240805-0002-use-GetFakeLSNForUnloggedRel.patchDownload

From 92401b291bfd87a2b7db6de53fbc9d9689916039 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 29 Jul 2024 18:03:44 +0200
Subject: [PATCH v20240805 2/6] use GetFakeLSNForUnloggedRel

---
 src/backend/access/gist/gist.c       | 10 +++++-----
 src/backend/access/gist/gistutil.c   | 10 +++-------
 src/backend/access/gist/gistvacuum.c |  6 +++---
 src/include/access/gist_private.h    |  2 +-
 4 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index f5f56fb2503..21c87c8c12f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -509,7 +509,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (is_build)
 		{
 			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+				recptr = GetFakeLSNForUnloggedRel();
 			else
 				recptr = GistBuildLSN;
 		}
@@ -520,7 +520,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -581,7 +581,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (is_build)
 		{
 			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+				recptr = GetFakeLSNForUnloggedRel();
 			else
 				recptr = GistBuildLSN;
 		}
@@ -603,7 +603,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 		PageSetLSN(page, recptr);
 
@@ -1739,7 +1739,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel, false));
+			PageSetLSN(page, gistGetFakeLSN(rel));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 733d5849317..78e98d68b15 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1012,7 +1012,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel, bool is_parallel)
+gistGetFakeLSN(Relation rel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1035,12 +1035,8 @@ gistGetFakeLSN(Relation rel, bool is_parallel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/*
-		 * Shouldn't be called for WAL-logging relations, but parallell
-		 * builds are an exception - we need the fake LSN to detect
-		 * concurrent changes.
-		 */
-		Assert(is_parallel || !RelationNeedsWAL(rel));
+		/* Shouldn't be called for WAL-logging relations */
+		Assert(!RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 082804e9c7d..24fb94f473e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel, false);
+		vstate.startNSN = gistGetFakeLSN(rel);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel, false));
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index, false);
+		recptr = gistGetFakeLSN(info->index);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index d5b22bc1018..1aeb35cdcb7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -535,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
-- 
2.45.2

#19

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Tomas Vondra (#12)

Re: WIP: parallel GiST index builds

On Tue, Jul 30, 2024 at 11:05:56AM +0200, Tomas Vondra wrote:

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

While scanning quickly the code, please be careful about the query ID
that should be passed down to _gist_parallel_build_main()..
--
Michael

#20

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Michael Paquier (#19)

2 attachment(s)

Re: WIP: parallel GiST index builds

On 10/8/24 04:05, Michael Paquier wrote:

On Tue, Jul 30, 2024 at 11:05:56AM +0200, Tomas Vondra wrote:

I tried implementing this, see the attached 0002 patch that replaces the
fake LSN with an atomic counter in shared memory. It seems to work (more
testing needed), but I can't say I'm very happy with the code :-(

While scanning quickly the code, please be careful about the query ID
that should be passed down to _gist_parallel_build_main()..

Here's an updated patch adding the queryid.

regards

--
Tomas Vondra

Attachments:

v20241008-0001-WIP-parallel-GiST-build.patchtext/x-patch; charset=UTF-8; name=v20241008-0001-WIP-parallel-GiST-build.patchDownload

From dbbb8c8f03d49aa7732c80f65ece5577a0e1641d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20241008 1/2] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builds (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  37 +-
 src/backend/access/gist/gistbuild.c   | 721 +++++++++++++++++++++++++-
 src/backend/access/gist/gistutil.c    |  10 +-
 src/backend/access/gist/gistvacuum.c  |   6 +-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  12 +-
 src/tools/pgindent/typedefs.list      |   2 +
 7 files changed, 747 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2d7a0687d4a..280a56760dc 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -75,7 +75,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -183,7 +183,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 						 values, isnull, true /* size is currently bogus */ );
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -231,7 +231,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -502,9 +503,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -512,7 +521,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -571,7 +580,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = gistGetFakeLSN(rel, is_parallel);
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -590,7 +604,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel);
+				recptr = gistGetFakeLSN(rel, false);
 		}
 		PageSetLSN(page, recptr);
 
@@ -633,7 +647,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -647,6 +662,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1304,7 +1320,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1723,7 +1740,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel));
+			PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index ba06df30faf..e647b1e66c1 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,109 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/* Query ID, for report in worker processes */
+	uint64		queryid;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +213,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +269,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +298,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +338,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +452,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   (void *) &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -861,7 +1046,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -900,6 +1085,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1068,7 +1295,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1577,3 +1805,444 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	gistshared->queryid = pgstat_get_my_query_id();
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Track query ID */
+	pgstat_report_query_id(gistshared->queryid, false);
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index d2d0b36d4ea..6580dcfcc88 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel)
+gistGetFakeLSN(Relation rel, bool is_parallel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1036,8 +1036,12 @@ gistGetFakeLSN(Relation rel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/* Shouldn't be called for WAL-logging relations */
-		Assert(!RelationNeedsWAL(rel));
+		/*
+		 * Shouldn't be called for WAL-logging relations, but parallell
+		 * builds are an exception - we need the fake LSN to detect
+		 * concurrent changes.
+		 */
+		Assert(is_parallel || !RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 24fb94f473e..082804e9c7d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel);
+		vstate.startNSN = gistGetFakeLSN(rel, false);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel));
+				PageSetLSN(page, gistGetFakeLSN(rel, false));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index);
+		recptr = gistGetFakeLSN(info->index, false);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index d4e84aabac7..964162172f4 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 7b8749c8db0..d5b22bc1018 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -531,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a65e1c07c5d..40998937f7a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -975,6 +975,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -985,6 +986,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.46.2

v20241008-0002-use-GetFakeLSNForUnloggedRel.patchtext/x-patch; charset=UTF-8; name=v20241008-0002-use-GetFakeLSNForUnloggedRel.patchDownload

From 49c836cf46dd1776c77f8af0f71aeeb3f58b1f29 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 29 Jul 2024 18:03:44 +0200
Subject: [PATCH v20241008 2/2] use GetFakeLSNForUnloggedRel

---
 src/backend/access/gist/gist.c       | 10 +++++-----
 src/backend/access/gist/gistutil.c   | 10 +++-------
 src/backend/access/gist/gistvacuum.c |  6 +++---
 src/include/access/gist_private.h    |  2 +-
 4 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 280a56760dc..e1e43cf7206 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -510,7 +510,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (is_build)
 		{
 			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+				recptr = GetFakeLSNForUnloggedRel();
 			else
 				recptr = GistBuildLSN;
 		}
@@ -521,7 +521,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 									   dist, oldrlink, oldnsn, leftchildbuf,
 									   markfollowright);
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 
 		for (ptr = dist; ptr; ptr = ptr->next)
@@ -582,7 +582,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		if (is_build)
 		{
 			if (is_parallel)
-				recptr = gistGetFakeLSN(rel, is_parallel);
+				recptr = GetFakeLSNForUnloggedRel();
 			else
 				recptr = GistBuildLSN;
 		}
@@ -604,7 +604,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 										leftchildbuf);
 			}
 			else
-				recptr = gistGetFakeLSN(rel, false);
+				recptr = gistGetFakeLSN(rel);
 		}
 		PageSetLSN(page, recptr);
 
@@ -1740,7 +1740,7 @@ gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 			PageSetLSN(page, recptr);
 		}
 		else
-			PageSetLSN(page, gistGetFakeLSN(rel, false));
+			PageSetLSN(page, gistGetFakeLSN(rel));
 
 		END_CRIT_SECTION();
 	}
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 6580dcfcc88..d2d0b36d4ea 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -1013,7 +1013,7 @@ gistproperty(Oid index_oid, int attno,
  * purpose.
  */
 XLogRecPtr
-gistGetFakeLSN(Relation rel, bool is_parallel)
+gistGetFakeLSN(Relation rel)
 {
 	if (rel->rd_rel->relpersistence == RELPERSISTENCE_TEMP)
 	{
@@ -1036,12 +1036,8 @@ gistGetFakeLSN(Relation rel, bool is_parallel)
 		static XLogRecPtr lastlsn = InvalidXLogRecPtr;
 		XLogRecPtr	currlsn = GetXLogInsertRecPtr();
 
-		/*
-		 * Shouldn't be called for WAL-logging relations, but parallell
-		 * builds are an exception - we need the fake LSN to detect
-		 * concurrent changes.
-		 */
-		Assert(is_parallel || !RelationNeedsWAL(rel));
+		/* Shouldn't be called for WAL-logging relations */
+		Assert(!RelationNeedsWAL(rel));
 
 		/* No need for an actual record if we already have a distinct LSN */
 		if (!XLogRecPtrIsInvalid(lastlsn) && lastlsn == currlsn)
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 082804e9c7d..24fb94f473e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -181,7 +181,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(rel))
 		vstate.startNSN = GetInsertRecPtr();
 	else
-		vstate.startNSN = gistGetFakeLSN(rel, false);
+		vstate.startNSN = gistGetFakeLSN(rel);
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -376,7 +376,7 @@ restart:
 				PageSetLSN(page, recptr);
 			}
 			else
-				PageSetLSN(page, gistGetFakeLSN(rel, false));
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
 			END_CRIT_SECTION();
 
@@ -664,7 +664,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (RelationNeedsWAL(info->index))
 		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
 	else
-		recptr = gistGetFakeLSN(info->index, false);
+		recptr = gistGetFakeLSN(info->index);
 	PageSetLSN(parentPage, recptr);
 	PageSetLSN(leafPage, recptr);
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index d5b22bc1018..1aeb35cdcb7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -535,7 +535,7 @@ extern void gistMakeUnionKey(GISTSTATE *giststate, int attno,
 							 GISTENTRY *entry2, bool isnull2,
 							 Datum *dst, bool *dstisnull);
 
-extern XLogRecPtr gistGetFakeLSN(Relation rel, bool is_parallel);
+extern XLogRecPtr gistGetFakeLSN(Relation rel);
 
 /* gistvacuum.c */
 extern IndexBulkDeleteResult *gistbulkdelete(IndexVacuumInfo *info,
-- 
2.46.2

#21

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Tomas Vondra (#20)

Re: WIP: parallel GiST index builds

On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:

Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we will have: buffered build, parallel build, sorting build. All 3 aiming to speed things up. I really hope that we will find a way to parallel sorting build, because it will be fastest for sure.

Currently we have some instances of such code...or similar... or related code.

if (is_build)
{
if (is_parallel)
recptr = GetFakeLSNForUnloggedRel();
else
recptr = GistBuildLSN;
}
else
{
if (RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to gistGetFakeLSN(). But now I see that now logic of choosing LSN source is sprawling across various places...
Now I do not have a strong point of view on this. Do you think if something like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things around entering parallel mode seems much more complicated. But as far as I can see, large portions of this code are taken from B-tree\BRIN.

Best regards, Andrey Borodin.

#22

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Andrey M. Borodin (#21)

Re: WIP: parallel GiST index builds

On 10/31/24 19:05, Andrey M. Borodin wrote:

On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:

Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we
will have: buffered build, parallel build, sorting build. All 3 aiming
to speed things up. I really hope that we will find a way to parallel
sorting build, because it will be fastest for sure.

The number of different ways to build a GiST index worries me. When we
had just buffered vs. sorted builds, that was pretty easy decision,
because buffered builds are much faster 99% of the time.

But for parallel builds it's not that clear - it can easily happen that
we use much more CPU, without speeding anything up. We just start as
many parallel workers as possible, given the maintenance_work_mem value,
and hope for the best.

Maybe with parallel buffered builds it's be clearer ... but I'm not
sufficiently familiar with that code, and I don't have time to study
that at the moment because of other patches. Someone else would have to
take a stab at that ...

Currently we have some instances of such code...or similar... or
related code.

if (is_build)
{
if (is_parallel)
recptr = GetFakeLSNForUnloggedRel();
else
recptr = GistBuildLSN;
}
else
{
if (RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to
gistGetFakeLSN(). But now I see that now logic of choosing LSN
source is sprawling across various places...

Now I do not have a strong point of view on this. Do you think if
something like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things
around entering parallel mode seems much more complicated. But as far as
I can see, large portions of this code are taken from B-tree\BRIN.

I agree the repeated code is pretty tedious, and it's also easy to miss
one of the places when changing the logic, so I think wrapping that in
some function makes sense.

regards

--
Tomas Vondra

#23

Kirill Reshke

reshkekirill@gmail.com

3 months ago

In reply to: Tomas Vondra (#22)

1 attachment(s)

Re: WIP: parallel GiST index builds

On Fri, 8 Nov 2024 at 19:53, Tomas Vondra <tomas@vondra.me> wrote:

On 10/31/24 19:05, Andrey M. Borodin wrote:

On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:

Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we
will have: buffered build, parallel build, sorting build. All 3 aiming
to speed things up. I really hope that we will find a way to parallel
sorting build, because it will be fastest for sure.

The number of different ways to build a GiST index worries me. When we
had just buffered vs. sorted builds, that was pretty easy decision,
because buffered builds are much faster 99% of the time.

But for parallel builds it's not that clear - it can easily happen that
we use much more CPU, without speeding anything up. We just start as
many parallel workers as possible, given the maintenance_work_mem value,
and hope for the best.

Maybe with parallel buffered builds it's be clearer ... but I'm not
sufficiently familiar with that code, and I don't have time to study
that at the moment because of other patches. Someone else would have to
take a stab at that ...

Currently we have some instances of such code...or similar... or
related code.

if (is_build)
{
if (is_parallel)
recptr = GetFakeLSNForUnloggedRel();
else
recptr = GistBuildLSN;
}
else
{
if (RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to
gistGetFakeLSN(). But now I see that now logic of choosing LSN
source is sprawling across various places...

Now I do not have a strong point of view on this. Do you think if
something like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things
around entering parallel mode seems much more complicated. But as far as
I can see, large portions of this code are taken from B-tree\BRIN.

I agree the repeated code is pretty tedious, and it's also easy to miss
one of the places when changing the logic, so I think wrapping that in
some function makes sense.

regards

--
Tomas Vondra

Hi!

Here is a rebased version of your patch.
I have tested this patch on a very biased dataset. My test was:

yes 2 | awk '{print $1","$1}'| head -400000000 > data.dat
create table t (p point);
copy t from '....data.dat';

and then build GiST index on this.

I did get 3x time speedup for buffered build:
create index on t using gist(p) with (buffering=on);

But this was still longer than the sorted build. Should we consider
speeding up sorted builds with parallel sorting?

I also checked GiST indexes with amcheck from [0]/messages/by-id/CALdSSPhs4sCd5S_euKxTufk8sOA739ydBwhoGFYQenya7YZyiA@mail.gmail.com. It did not complain.

[0]: /messages/by-id/CALdSSPhs4sCd5S_euKxTufk8sOA739ydBwhoGFYQenya7YZyiA@mail.gmail.com

--
Best regards,
Kirill Reshke

Attachments:

v20251023-0001-WIP-parallel-GiST-build.patchapplication/octet-stream; name=v20251023-0001-WIP-parallel-GiST-build.patchDownload

From dcc597545e5f50fdd93114c1b1fb410f55ad3fb7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sun, 26 May 2024 21:44:27 +0200
Subject: [PATCH v20251023] WIP parallel GiST build

Implements parallel GiST index build for the unsorted case. The build
simply starts parallel workers that insert values into the index the
usual way (as if there were multiple clients doing INSERT).

The basic infrastructure is copied from parallel BRIN builds (and also
from the nearby parallel GIN build). There's nothing particularly
special or interesting, except for the gistBuildParallelCallback()
callback. The two significant changes in the callback are:

1) disabling buffering

Buffered builds assume the worker is the only backend that can split
index pages etc. With serial workers that is trivially true, but with
parallel workers this leads to confusion.

In principle this is solvable by moving the buffers into shared memory
and coordinating the workers (locking etc.). But the patch does not do
that yet - it's clearly non-trivial, and I'm not really convinced it's
worth it.

2) generating "proper" fake LSNs

The serial builds disable all WAL-logging for the index, until the very
end when the whole index is WAL-logged. This however also means we don't
set page LSNs on the index pages - but page LSNs are used to detect
concurrent changes to the index structure (e.g. page splits). For serial
builds this does not matter, because only the build worker can modify
the index, so it just sets the same LSN "1" for all pages. Both of this
(disabling WAL-logging and using bogus page LSNs) is controlled by the
same flag is_build.

Having the same page LSN does not work for parallel builds, as it would
mean workers won't notice splits done by other workers, etc.

One option would be to set is_bild=false, which enables WAL-logging, as
if during regular inserts, and also assigns proper page LSNs. But we
don't want to WAL-log everything, that's unnecessary. We want to only
start WAL-logging the index once the build completes, just like for
serial builds. And only do the fake LSNs, as for unlogged indexes etc.

So this introduces a separate flag is_parallel, which forces generating
the "proper" fake LSN. But we can still do is_build=true, and only log
the index at the end of the build.
---
 src/backend/access/gist/gist.c        |  31 +-
 src/backend/access/gist/gistbuild.c   | 721 +++++++++++++++++++++++++-
 src/backend/access/transam/parallel.c |   4 +
 src/include/access/gist_private.h     |  10 +-
 src/tools/pgindent/typedefs.list      |   2 +
 5 files changed, 733 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a96796d5c7d..afaf7324f0e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -78,7 +78,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = true;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = true;
 	amroutine->amusemaintenanceworkmem = false;
 	amroutine->amsummarizing = false;
@@ -187,7 +187,7 @@ gistinsert(Relation r, Datum *values, bool *isnull,
 	itup = gistFormTuple(giststate, r, values, isnull, true);
 	itup->t_tid = *ht_ctid;
 
-	gistdoinsert(r, itup, 0, giststate, heapRel, false);
+	gistdoinsert(r, itup, 0, giststate, heapRel, false, false);
 
 	/* cleanup */
 	MemoryContextSwitchTo(oldCxt);
@@ -235,7 +235,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 				List **splitinfo,
 				bool markfollowright,
 				Relation heapRel,
-				bool is_build)
+				bool is_build,
+				bool is_parallel)
 {
 	BlockNumber blkno = BufferGetBlockNumber(buffer);
 	Page		page = BufferGetPage(buffer);
@@ -506,9 +507,17 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 		 * smaller than any real or fake unlogged LSN that might be generated
 		 * later. (There can't be any concurrent scans during index build, so
 		 * we don't need to be able to detect concurrent splits yet.)
+		 *
+		 * However, with a parallel index build, we need to assign valid LSN,
+		 * as it's used to detect concurrent index modifications.
 		 */
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = GetFakeLSNForUnloggedRel();
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -575,7 +584,12 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 			MarkBufferDirty(leftchildbuf);
 
 		if (is_build)
-			recptr = GistBuildLSN;
+		{
+			if (is_parallel)
+				recptr = GetFakeLSNForUnloggedRel();
+			else
+				recptr = GistBuildLSN;
+		}
 		else
 		{
 			if (RelationNeedsWAL(rel))
@@ -637,7 +651,8 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
  */
 void
 gistdoinsert(Relation r, IndexTuple itup, Size freespace,
-			 GISTSTATE *giststate, Relation heapRel, bool is_build)
+			 GISTSTATE *giststate, Relation heapRel, bool is_build,
+			 bool is_parallel)
 {
 	ItemId		iid;
 	IndexTuple	idxtuple;
@@ -651,6 +666,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 	state.r = r;
 	state.heapRel = heapRel;
 	state.is_build = is_build;
+	state.is_parallel = is_parallel;
 
 	/* Start from the root */
 	firststack.blkno = GIST_ROOT_BLKNO;
@@ -1315,7 +1331,8 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel,
-							   state->is_build);
+							   state->is_build,
+							   state->is_parallel);
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 9b2ec9815f1..f59c0e80c04 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -36,18 +36,28 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "optimizer/optimizer.h"
 #include "storage/bufmgr.h"
 #include "storage/bulk_write.h"
-
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/memutils.h"
 #include "utils/rel.h"
 #include "utils/tuplesort.h"
 
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIST_SHARED		UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000004)
+
 /* Step of index tuples for check whether to switch to buffering build mode */
 #define BUFFERING_MODE_SWITCH_CHECK_STEP 256
 
@@ -78,6 +88,109 @@ typedef enum
 	GIST_BUFFERING_ACTIVE,		/* in buffering build mode */
 } GistBuildMode;
 
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GISTShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 *
+	 * XXX nparticipants is the number or workers we expect to participage in
+	 * the build, possibly including the leader process.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			nparticipants;
+
+	/* Parameters determined by the leader, passed to the workers. */
+	GistBuildMode buildMode;
+	int			freespace;
+
+	/* Query ID, for report in worker processes */
+	uint64		queryid;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can finish
+	 * building the index.
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIST index
+	 * builds that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GISTShared;
+
+/*
+ * Return pointer to a GISTShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGistShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GISTShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GISTLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipants is the exact number of worker processes successfully
+	 * launched, plus one leader process if it participates as a worker (only
+	 * DISABLE_LEADER_PARTICIPATION builds avoid leader participating as a
+	 * worker).
+	 *
+	 * XXX Seems a bit redundant with nparticipants in GISTShared. Although
+	 * that is the expected number, this is what we actually got.
+	 */
+	int			nparticipants;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GISTShared is the shared state for entire build. snapshot is the
+	 * snapshot used by the scan iff an MVCC snapshot is required.
+	 */
+	GISTShared *gistshared;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GISTLeader;
+
 /* Working state for gistbuild and its callback */
 typedef struct
 {
@@ -100,6 +213,14 @@ typedef struct
 	GISTBuildBuffers *gfbb;
 	HTAB	   *parentMap;
 
+	/*
+	 * gist_leader is only present when a parallel index build is performed,
+	 * and only in the leader process. (Actually, only the leader process has
+	 * a GISTBuildState.)
+	 */
+	bool		is_parallel;
+	GISTLeader *gist_leader;
+
 	/*
 	 * Extra data structures used during a sorting build.
 	 */
@@ -148,6 +269,12 @@ static void gistBuildCallback(Relation index,
 							  bool *isnull,
 							  bool tupleIsAlive,
 							  void *state);
+static void gistBuildParallelCallback(Relation index,
+									  ItemPointer tid,
+									  Datum *values,
+									  bool *isnull,
+									  bool tupleIsAlive,
+									  void *state);
 static void gistBufferingBuildInsert(GISTBuildState *buildstate,
 									 IndexTuple itup);
 static bool gistProcessItup(GISTBuildState *buildstate, IndexTuple itup,
@@ -171,6 +298,18 @@ static void gistMemorizeAllDownlinks(GISTBuildState *buildstate,
 									 Buffer parentbuf);
 static BlockNumber gistGetParent(GISTBuildState *buildstate, BlockNumber child);
 
+/* parallel index builds */
+static void _gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+								 bool isconcurrent, int request);
+static void _gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state);
+static Size _gist_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gist_parallel_heapscan(GISTBuildState *buildstate);
+static void _gist_leader_participate_as_worker(GISTBuildState *buildstate,
+											   Relation heap, Relation index);
+static void _gist_parallel_scan_and_build(GISTBuildState *buildstate,
+										  GISTShared *gistshared,
+										  Relation heap, Relation index,
+										  int workmem, bool progress);
 
 /*
  * Main entry point to GiST index build.
@@ -199,6 +338,10 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.sortstate = NULL;
 	buildstate.giststate = initGISTstate(index);
 
+	/* assume serial build */
+	buildstate.is_parallel = false;
+	buildstate.gist_leader = NULL;
+
 	/*
 	 * Create a temporary memory context that is reset once for each tuple
 	 * processed.  (Note: we don't bother to make this a child of the
@@ -309,37 +452,79 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 
 		END_CRIT_SECTION();
 
-		/* Scan the table, inserting all the tuples to the index. */
-		reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
-										   gistBuildCallback,
-										   &buildstate, NULL);
-
 		/*
-		 * If buffering was used, flush out all the tuples that are still in
-		 * the buffers.
+		 * Attempt to launch parallel worker scan when required
+		 *
+		 * XXX plan_create_index_workers makes the number of workers dependent
+		 * on maintenance_work_mem, requiring 32MB for each worker. That makes
+		 * sense for btree, but maybe not for GIST (at least when not using
+		 * buffering)? So maybe make that somehow less strict, optionally?
 		 */
-		if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
-		{
-			elog(DEBUG1, "all tuples processed, emptying buffers");
-			gistEmptyAllBuffers(&buildstate);
-			gistFreeBuildBuffers(buildstate.gfbb);
-		}
+		if (indexInfo->ii_ParallelWorkers > 0)
+			_gist_begin_parallel(&buildstate, heap,
+								 index, indexInfo->ii_Concurrent,
+								 indexInfo->ii_ParallelWorkers);
 
 		/*
-		 * We didn't write WAL records as we built the index, so if
-		 * WAL-logging is required, write all pages to the WAL now.
+		 * If parallel build requested and at least one worker process was
+		 * successfully launched, set up coordination state, wait for workers
+		 * to complete and end the parallel build.
+		 *
+		 * In serial mode, simply scan the table and build the index one index
+		 * tuple at a time.
 		 */
-		if (RelationNeedsWAL(index))
+		if (buildstate.gist_leader)
 		{
-			log_newpage_range(index, MAIN_FORKNUM,
-							  0, RelationGetNumberOfBlocks(index),
-							  true);
+			/* scan the relation and wait for parallel workers to finish */
+			reltuples = _gist_parallel_heapscan(&buildstate);
+
+			_gist_end_parallel(buildstate.gist_leader, &buildstate);
+
+			/*
+			 * We didn't write WAL records as we built the index, so if WAL-logging is
+			 * required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
 		}
-	}
+		else
+		{
+			/* Scan the table, inserting all the tuples to the index. */
+			reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+											   gistBuildCallback,
+											   (void *) &buildstate, NULL);
 
-	/* okay, all heap tuples are indexed */
-	MemoryContextSwitchTo(oldcxt);
-	MemoryContextDelete(buildstate.giststate->tempCxt);
+			/*
+			 * If buffering was used, flush out all the tuples that are still
+			 * in the buffers.
+			 */
+			if (buildstate.buildMode == GIST_BUFFERING_ACTIVE)
+			{
+				elog(DEBUG1, "all tuples processed, emptying buffers");
+				gistEmptyAllBuffers(&buildstate);
+				gistFreeBuildBuffers(buildstate.gfbb);
+			}
+
+			/*
+			 * We didn't write WAL records as we built the index, so if
+			 * WAL-logging is required, write all pages to the WAL now.
+			 */
+			if (RelationNeedsWAL(index))
+			{
+				log_newpage_range(index, MAIN_FORKNUM,
+								  0, RelationGetNumberOfBlocks(index),
+								  true);
+			}
+
+			/* okay, all heap tuples are indexed */
+			MemoryContextSwitchTo(oldcxt);
+			MemoryContextDelete(buildstate.giststate->tempCxt);
+		}
+	}
 
 	freeGISTstate(buildstate.giststate);
 
@@ -863,7 +1048,7 @@ gistBuildCallback(Relation index,
 		 * locked, we call gistdoinsert directly.
 		 */
 		gistdoinsert(index, itup, buildstate->freespace,
-					 buildstate->giststate, buildstate->heaprel, true);
+					 buildstate->giststate, buildstate->heaprel, true, false);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
@@ -902,6 +1087,48 @@ gistBuildCallback(Relation index,
 	}
 }
 
+/*
+ * Per-tuple callback for table_index_build_scan.
+ *
+ * XXX Almost the same as gistBuildCallback, but with is_build=false when
+ * calling gistdoinsert. Otherwise we get assert failures due to workers
+ * modifying the index concurrently.
+ */
+static void
+gistBuildParallelCallback(Relation index,
+						  ItemPointer tid,
+						  Datum *values,
+						  bool *isnull,
+						  bool tupleIsAlive,
+						  void *state)
+{
+	GISTBuildState *buildstate = (GISTBuildState *) state;
+	IndexTuple	itup;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->giststate->tempCxt);
+
+	/* form an index tuple and point it at the heap tuple */
+	itup = gistFormTuple(buildstate->giststate, index,
+						 values, isnull,
+						 true);
+	itup->t_tid = *tid;
+
+	/* Update tuple count and total size. */
+	buildstate->indtuples += 1;
+	buildstate->indtuplesSize += IndexTupleSize(itup);
+
+	/*
+	 * There's no buffers (yet). Since we already have the index relation
+	 * locked, we call gistdoinsert directly.
+	 */
+	gistdoinsert(index, itup, buildstate->freespace,
+				 buildstate->giststate, buildstate->heaprel, true, true);
+
+	MemoryContextSwitchTo(oldCtx);
+	MemoryContextReset(buildstate->giststate->tempCxt);
+}
+
 /*
  * Insert function for buffering index build.
  */
@@ -1070,7 +1297,8 @@ gistbufferinginserttuples(GISTBuildState *buildstate, Buffer buffer, int level,
 							   InvalidBuffer,
 							   &splitinfo,
 							   false,
-							   buildstate->heaprel, true);
+							   buildstate->heaprel, true,
+							   buildstate->is_parallel);
 
 	/*
 	 * If this is a root split, update the root path item kept in memory. This
@@ -1579,3 +1807,444 @@ gistGetParent(GISTBuildState *buildstate, BlockNumber child)
 
 	return entry->parentblkno;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's gistLeader, which caller must use to shut down parallel
+ * mode by passing it to _gist_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gist_begin_parallel(GISTBuildState *buildstate, Relation heap, Relation index,
+					 bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			nparticipants;
+	Snapshot	snapshot;
+	Size		estgistshared;
+	GISTShared *gistshared;
+	GISTLeader *gistleader = (GISTLeader *) palloc0(sizeof(GISTLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of GIST
+	 * index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gist_parallel_build_main",
+								 request);
+
+	nparticipants = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIST_SHARED workspace.
+	 */
+	estgistshared = _gist_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estgistshared);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	gistshared = (GISTShared *) shm_toc_allocate(pcxt->toc, estgistshared);
+	/* Initialize immutable state */
+	gistshared->heaprelid = RelationGetRelid(heap);
+	gistshared->indexrelid = RelationGetRelid(index);
+	gistshared->isconcurrent = isconcurrent;
+	gistshared->nparticipants = nparticipants;
+
+	gistshared->queryid = pgstat_get_my_query_id();
+
+	/* */
+	gistshared->buildMode = buildstate->buildMode;
+	gistshared->freespace = buildstate->freespace;
+
+	ConditionVariableInit(&gistshared->workersdonecv);
+	SpinLockInit(&gistshared->mutex);
+
+	/* Initialize mutable state */
+	gistshared->nparticipantsdone = 0;
+	gistshared->reltuples = 0.0;
+	gistshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGistShared(gistshared),
+								  snapshot);
+
+	/* Store shared state, for which we reserved space. */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIST_SHARED, gistshared);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	gistleader->pcxt = pcxt;
+	gistleader->nparticipants = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		gistleader->nparticipants++;
+	gistleader->gistshared = gistshared;
+	gistleader->snapshot = snapshot;
+	gistleader->walusage = walusage;
+	gistleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gist_end_parallel(gistleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->is_parallel = true;
+	buildstate->gist_leader = gistleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gist_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gist_end_parallel(GISTLeader *gistleader, GISTBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(gistleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < gistleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&gistleader->bufferusage[i], &gistleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(gistleader->snapshot))
+		UnregisterSnapshot(gistleader->snapshot);
+	DestroyParallelContext(gistleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gist_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe needs to flush data if GIST_BUFFERING_ACTIVE, a bit like in
+ * the serial build?
+ */
+static double
+_gist_parallel_heapscan(GISTBuildState *state)
+{
+	GISTShared *gistshared = state->gist_leader->gistshared;
+	int			nparticipants;
+
+	nparticipants = state->gist_leader->nparticipants;
+	for (;;)
+	{
+		SpinLockAcquire(&gistshared->mutex);
+		if (gistshared->nparticipantsdone == nparticipants)
+		{
+			/* copy the data into leader state */
+			state->indtuples = gistshared->indtuples;
+
+			SpinLockRelease(&gistshared->mutex);
+			break;
+		}
+		SpinLockRelease(&gistshared->mutex);
+
+		ConditionVariableSleep(&gistshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->indtuples;
+}
+
+
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gist index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gist_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GISTShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gist_leader_participate_as_worker(GISTBuildState *buildstate,
+								   Relation heap, Relation index)
+{
+	GISTLeader *gistleader = buildstate->gist_leader;
+	int			workmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistleader->nparticipants;
+
+	/* Perform work common to all participants */
+	_gist_parallel_scan_and_build(buildstate, gistleader->gistshared,
+								  heap, index, workmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel scan and insert.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gist_parallel_scan_and_build(GISTBuildState *state,
+							  GISTShared *gistshared,
+							  Relation heap, Relation index,
+							  int workmem, bool progress)
+{
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+	MemoryContext oldcxt = CurrentMemoryContext;
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = gistshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGistShared(gistshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   gistBuildParallelCallback, state, scan);
+
+	/*
+	 * If buffering was used, flush out all the tuples that are still in the
+	 * buffers.
+	 */
+	if (state->buildMode == GIST_BUFFERING_ACTIVE)
+	{
+		elog(DEBUG1, "all tuples processed, emptying buffers");
+		gistEmptyAllBuffers(state);
+		gistFreeBuildBuffers(state->gfbb);
+	}
+
+	/* okay, all heap tuples are indexed */
+	MemoryContextSwitchTo(oldcxt);
+	MemoryContextDelete(state->giststate->tempCxt);
+
+	/* FIXME Do we need to do something else with active buffering? */
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&gistshared->mutex);
+	gistshared->nparticipantsdone++;
+	gistshared->reltuples += reltuples;
+	gistshared->indtuples += state->indtuples;
+	SpinLockRelease(&gistshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&gistshared->workersdonecv);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gist_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GISTShared *gistshared;
+	GISTBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			workmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up GIST shared state */
+	gistshared = shm_toc_lookup(toc, PARALLEL_KEY_GIST_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!gistshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Track query ID */
+	pgstat_report_query_id(gistshared->queryid, false);
+
+	/* Open relations within worker */
+	heapRel = table_open(gistshared->heaprelid, heapLockmode);
+	indexRel = index_open(gistshared->indexrelid, indexLockmode);
+
+	buildstate.indexrel = indexRel;
+	buildstate.heaprel = heapRel;
+	buildstate.sortstate = NULL;
+	buildstate.giststate = initGISTstate(indexRel);
+
+	buildstate.is_parallel = true;
+	buildstate.gist_leader = NULL;
+
+	/*
+	 * Create a temporary memory context that is reset once for each tuple
+	 * processed.  (Note: we don't bother to make this a child of the
+	 * giststate's scanCxt, so we have to delete it separately at the end.)
+	 */
+	buildstate.giststate->tempCxt = createTempGistContext();
+
+	/* FIXME */
+	buildstate.buildMode = gistshared->buildMode;
+	buildstate.freespace = gistshared->freespace;
+
+	buildstate.indtuples = 0;
+	buildstate.indtuplesSize = 0;
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	workmem = maintenance_work_mem / gistshared->nparticipants;
+
+	_gist_parallel_scan_and_build(&buildstate, gistshared,
+								  heapRel, indexRel, workmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 94db1ec3012..cb853a4ecd8 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -16,6 +16,7 @@
 
 #include "access/brin.h"
 #include "access/gin.h"
+#include "access/gist_private.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -152,6 +153,9 @@ static const struct
 	{
 		"_gin_parallel_build_main", _gin_parallel_build_main
 	},
+	{
+		"_gist_parallel_build_main", _gist_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 39404ec7cdb..a756462aae9 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "storage/bufmgr.h"
 #include "storage/buffile.h"
+#include "storage/shm_toc.h"
 #include "utils/hsearch.h"
 #include "access/genam.h"
 
@@ -254,6 +255,7 @@ typedef struct
 	Relation	heapRel;
 	Size		freespace;		/* free space to be left */
 	bool		is_build;
+	bool		is_parallel;
 
 	GISTInsertStack *stack;
 } GISTInsertState;
@@ -413,7 +415,8 @@ extern void gistdoinsert(Relation r,
 						 Size freespace,
 						 GISTSTATE *giststate,
 						 Relation heapRel,
-						 bool is_build);
+						 bool is_build,
+						 bool is_parallel);
 
 /* A List of these is returned from gistplacetopage() in *splitinfo */
 typedef struct
@@ -430,7 +433,8 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 							List **splitinfo,
 							bool markfollowright,
 							Relation heapRel,
-							bool is_build);
+							bool is_build,
+							bool is_parallel);
 
 extern SplitPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 								  int len, GISTSTATE *giststate);
@@ -568,4 +572,6 @@ extern void gistRelocateBuildBuffersOnSplit(GISTBuildBuffers *gfbb,
 											List *splitinfo);
 extern void gistUnloadNodeBuffers(GISTBuildBuffers *gfbb);
 
+extern void _gist_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIST_PRIVATE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5fc2460f05c..e73fbbc0976 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1011,6 +1011,7 @@ GISTInsertStack
 GISTInsertState
 GISTIntArrayBigOptions
 GISTIntArrayOptions
+GISTLeader
 GISTNodeBuffer
 GISTNodeBufferPage
 GISTPageOpaque
@@ -1021,6 +1022,7 @@ GISTScanOpaque
 GISTScanOpaqueData
 GISTSearchHeapItem
 GISTSearchItem
+GISTShared
 GISTTYPE
 GIST_SPLITVEC
 GMReaderTupleBuffer
-- 
2.43.0

#24

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Kirill Reshke (#23)

Re: WIP: parallel GiST index builds

On 10/23/25 16:13, Kirill Reshke wrote:

On Fri, 8 Nov 2024 at 19:53, Tomas Vondra <tomas@vondra.me> wrote:

On 10/31/24 19:05, Andrey M. Borodin wrote:

On 8 Oct 2024, at 17:05, Tomas Vondra <tomas@vondra.me> wrote:

Here's an updated patch adding the queryid.

I've took another round of looking through the patch.
Everything I see seems correct to me. It just occurred to me that we
will have: buffered build, parallel build, sorting build. All 3 aiming
to speed things up. I really hope that we will find a way to parallel
sorting build, because it will be fastest for sure.

The number of different ways to build a GiST index worries me. When we
had just buffered vs. sorted builds, that was pretty easy decision,
because buffered builds are much faster 99% of the time.

But for parallel builds it's not that clear - it can easily happen that
we use much more CPU, without speeding anything up. We just start as
many parallel workers as possible, given the maintenance_work_mem value,
and hope for the best.

Maybe with parallel buffered builds it's be clearer ... but I'm not
sufficiently familiar with that code, and I don't have time to study
that at the moment because of other patches. Someone else would have to
take a stab at that ...

Currently we have some instances of such code...or similar... or
related code.

if (is_build)
{
if (is_parallel)
recptr = GetFakeLSNForUnloggedRel();
else
recptr = GistBuildLSN;
}
else
{
if (RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel);
}
// Use recptr

In previous review I was proponent of not adding arguments to
gistGetFakeLSN(). But now I see that now logic of choosing LSN
source is sprawling across various places...

Now I do not have a strong point of view on this. Do you think if
something like following would be clearer?
if (!is_build && RelationNeedsWAL(rel))
{
recptr = actuallyWriteWAL();
}
else
recptr = gistGetFakeLSN(rel, is_build, is_parallel);

Just as an idea.

I'm mostly looking on GiST-specific parts of the patch, while things
around entering parallel mode seems much more complicated. But as far as
I can see, large portions of this code are taken from B-tree\BRIN.

I agree the repeated code is pretty tedious, and it's also easy to miss
one of the places when changing the logic, so I think wrapping that in
some function makes sense.

regards

--
Tomas Vondra

Hi!

Here is a rebased version of your patch.
I have tested this patch on a very biased dataset. My test was:

yes 2 | awk '{print $1","$1}'| head -400000000 > data.dat
create table t (p point);
copy t from '....data.dat';

and then build GiST index on this.

I did get 3x time speedup for buffered build:
create index on t using gist(p) with (buffering=on);

But this was still longer than the sorted build. Should we consider
speeding up sorted builds with parallel sorting?

Yes, this was the reason why I did not try to do parallel GiST builds in
PG18. The sorted builds are more efficient than parallel builds - maybe
not always, but I'm not sure how to reliably decide that. Maybe parallel
sorted builds would be faster, not sure.

regards

--
Tomas Vondra