Proposal: speeding up GIN build with parallel workers

Started by Constantin S. Panalmost 10 years ago30 messages
#1Constantin S. Pan
kvapen@gmail.com
1 attachment(s)

Hi, Hackers.

The task of building GIN can require lots of time and eats 100 % CPU,
but we could easily make it use more than a 100 %, especially since we
now have parallel workers in postgres.

The process of building GIN looks like this:

1. Accumulate a batch of index records into an rbtree in maintenance
work memory.

2. Dump the batch to disk.

3. Repeat.

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

This speeds up the first step N times, but slows down the second one,
because when multiple workers dump item pointers for the same key, each
of them has to read and decode the results of the previous one. That is
a huge waste, but there is an idea on how to eliminate it.

When it comes to dumping the next batch, a worker does not do it
independently. Instead, it (and every other worker) sends the
accumulated index records to the parent (backend) in ascending key
order. The backend, which receives the records from the workers through
shared memory, can merge them and dump each of them once, without the
need to reread the records N-1 times.

In current state the implementation is just a proof of concept
and it has all the configuration hardcoded, but it already works as is,
though it does not speed up the build process more than 4 times on my
configuration (12 CPUs). There is also a problem with temporary tables,
for which the parallel mode does not work.

Please leave your feedback.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

pgin.patchtext/x-patchDownload
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8bcd159..6ea8f78 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,11 +16,14 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -265,6 +268,10 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+#define KEY_TASK 42
+#define GIN_MAX_WORKERS 4
+#define GIN_BLOCKS_PER_WORKER 4
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
@@ -306,17 +313,34 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
-Datum
-ginbuild(PG_FUNCTION_ARGS)
-{
-	Relation	heap = (Relation) PG_GETARG_POINTER(0);
-	Relation	index = (Relation) PG_GETARG_POINTER(1);
-	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
-	IndexBuildResult *result;
-	double		reltuples;
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" field as
+ * the workers read and increase its value to distribute the task between each
+ * other.
+ */
+typedef struct {
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+} PGinBuildTask;
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void pginbuild(dsm_segment *seg, shm_toc *toc) {
 	GinBuildState buildstate;
-	Buffer		RootBuffer,
-				MetaBuffer;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	int reltuples = 0;
 	ItemPointerData *list;
 	Datum		key;
 	GinNullCategory category;
@@ -324,48 +348,16 @@ ginbuild(PG_FUNCTION_ARGS)
 	MemoryContext oldCtx;
 	OffsetNumber attnum;
 
-	if (RelationGetNumberOfBlocks(index) != 0)
-		elog(ERROR, "index \"%s\" already contains data",
-			 RelationGetRelationName(index));
+	volatile PGinBuildTask *task = (PGinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+
+	heap = heap_open(task->heap_oid, NoLock);
+	index = index_open(task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
 
 	initGinState(&buildstate.ginstate, index);
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
-	/* initialize the meta page */
-	MetaBuffer = GinNewBuffer(index);
-
-	/* initialize the root page */
-	RootBuffer = GinNewBuffer(index);
-
-	START_CRIT_SECTION();
-	GinInitMetabuffer(MetaBuffer);
-	MarkBufferDirty(MetaBuffer);
-	GinInitBuffer(RootBuffer, GIN_LEAF);
-	MarkBufferDirty(RootBuffer);
-
-	if (RelationNeedsWAL(index))
-	{
-		XLogRecPtr	recptr;
-		Page		page;
-
-		XLogBeginInsert();
-		XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT);
-		XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT);
-
-		recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX);
-
-		page = BufferGetPage(RootBuffer);
-		PageSetLSN(page, recptr);
-
-		page = BufferGetPage(MetaBuffer);
-		PageSetLSN(page, recptr);
-	}
-
-	UnlockReleaseBuffer(MetaBuffer);
-	UnlockReleaseBuffer(RootBuffer);
-	END_CRIT_SECTION();
-
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
@@ -396,8 +388,27 @@ ginbuild(PG_FUNCTION_ARGS)
 	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
 	 * prefers to receive tuples in TID order.
 	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
+	while (true) {
+		int subtuples;
+		int scan_from, to_scan_this_time;
+		SpinLockAcquire(&task->lock);
+		if (task->scanned >= task->to_scan)
+		{
+			SpinLockRelease(&task->lock);
+			break;
+		}
+		scan_from = task->scanned;
+		to_scan_this_time = GIN_BLOCKS_PER_WORKER;
+		if (to_scan_this_time > task->to_scan - task->scanned)
+			to_scan_this_time = task->to_scan - task->scanned;
+		task->scanned += to_scan_this_time;
+		SpinLockRelease(&task->lock);
+
+		subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+											scan_from, to_scan_this_time,
+											ginBuildCallback, (void *)&buildstate);
+		reltuples += subtuples;
+	};
 
 	/* dump remaining entries to the index */
 	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
@@ -421,13 +432,94 @@ ginbuild(PG_FUNCTION_ARGS)
 	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
 	ginUpdateStats(index, &buildstate.buildStats);
 
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+}
+
+Datum
+ginbuild(PG_FUNCTION_ARGS)
+{
+	Relation	heap = (Relation) PG_GETARG_POINTER(0);
+	Relation	index = (Relation) PG_GETARG_POINTER(1);
+	IndexBuildResult *result;
+	Buffer		RootBuffer,
+				MetaBuffer;
+
+	if (RelationGetNumberOfBlocks(index) != 0)
+		elog(ERROR, "index \"%s\" already contains data",
+			 RelationGetRelationName(index));
+
+	/* initialize the meta page */
+	MetaBuffer = GinNewBuffer(index);
+
+	/* initialize the root page */
+	RootBuffer = GinNewBuffer(index);
+
+	START_CRIT_SECTION();
+	GinInitMetabuffer(MetaBuffer);
+	MarkBufferDirty(MetaBuffer);
+	GinInitBuffer(RootBuffer, GIN_LEAF);
+	MarkBufferDirty(RootBuffer);
+
+	if (RelationNeedsWAL(index))
+	{
+		XLogRecPtr	recptr;
+		Page		page;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, MetaBuffer, REGBUF_WILL_INIT);
+		XLogRegisterBuffer(1, RootBuffer, REGBUF_WILL_INIT);
+
+		recptr = XLogInsert(RM_GIN_ID, XLOG_GIN_CREATE_INDEX);
+
+		page = BufferGetPage(RootBuffer);
+		PageSetLSN(page, recptr);
+
+		page = BufferGetPage(MetaBuffer);
+		PageSetLSN(page, recptr);
+	}
+
+	UnlockReleaseBuffer(MetaBuffer);
+	UnlockReleaseBuffer(RootBuffer);
+	END_CRIT_SECTION();
+
+	EnterParallelMode();
+	{
+		int size = sizeof(PGinBuildTask);
+		int keys = 1;
+		PGinBuildTask   *task;
+		ParallelContext *pcxt;
+
+		pcxt = CreateParallelContext(pginbuild, GIN_MAX_WORKERS);
+
+		shm_toc_estimate_chunk(&pcxt->estimator, size);
+		shm_toc_estimate_keys(&pcxt->estimator, keys);
+		InitializeParallelDSM(pcxt);
+		task = (PGinBuildTask*)shm_toc_allocate(pcxt->toc, size);
+		shm_toc_insert(pcxt->toc, KEY_TASK, (void*)task);
+		SpinLockInit(&task->lock);
+		task->to_scan = RelationGetNumberOfBlocks(heap);
+		task->heap_oid = heap->rd_id;
+		task->index_oid = index->rd_id;
+		LaunchParallelWorkers(pcxt);
+
+		WaitForParallelWorkersToFinish(pcxt);
+
+		DestroyParallelContext(pcxt);
+	}
+	ExitParallelMode();
+
 	/*
 	 * Return statistics
 	 */
 	result = (IndexBuildResult *) palloc(sizeof(IndexBuildResult));
 
-	result->heap_tuples = reltuples;
-	result->index_tuples = buildstate.indtuples;
+	/*
+	 * FIXME: implement statistics aggregation from the workers.
+	 * These numbers do not mean anything until then.
+	 */
+	result->heap_tuples = 123;
+	result->index_tuples = 456;
 
 	PG_RETURN_POINTER(result);
 }
#2Peter Geoghegan
pg@heroku.com
In reply to: Constantin S. Pan (#1)
Re: Proposal: speeding up GIN build with parallel workers

On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting
it.

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

I understand that a long term ambition for the gin am is to merge it
with nbtree, to almost automatically benefit from enhancements, and to
reduce the maintenance burden of each.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Constantin S. Pan
kvapen@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Proposal: speeding up GIN build with parallel workers

On Fri, 15 Jan 2016 15:29:51 -0800
Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com>
wrote:
Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

While building GIN we need a quick way to update the posting list of
the same key, this is where rbtree comes to rescue. Using tuplesort will
require a completely different approach to building the index: dump
(key, itempointer) pairs into a tuplesort heap, then sort the heap and
merge the itempointers for the same key values.

Both rbtree and sorting require NlogN operations, and abbreviated keys
will not help here, because GIN is used for the case where there are
lots of repeated keys. The benefit of tuplesort is that it would be
better for huge data that does not fit into memory, but on the other
hand it would need twice as much free disk space for sorting as the
data itself took. Are we ready for such cost?

I think we have to experiment with both approaches, and see how it goes.

What are your thoughts?

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jeff Janes
jeff.janes@gmail.com
In reply to: Peter Geoghegan (#2)
Re: Proposal: speeding up GIN build with parallel workers

On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting
it.

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort. I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.

(Disclaimer: this was on those identifiers which had little difference
in the first 8 bytes, so they didn't really benefit from the
abbreviated keys.)

So in addition to shimming (key, tid) pairs to look like something
that can be fed to tuple sort, I think we would also need to make
tuple sort do on-the fly aggregation when it finds adjacent equal sort
columns, so it can make (key,tid[]) structures rather than (key,tid).
Without that, I think using tuple sort would be a loss at least as
often as it would be a win.

But I do think this with aggregation would be worth it despite the
amount of work involved. In addition to automatically benefiting from
parallel sorts and any other future sort improvements, I think it
would also generate better indexes. I have a theory that a problem
with very large gin indexes is that repeatedly building
maintenance_work_mem worth of rbtree entries and then merging them to
disk yields highly fragmented btrees, in which logical adjacent
key-space does not occupy physically nearby leaf pages, which then can
causes problems either with access that follows a pattern (like range
scans, except gin indexes can't really do those anyway) or further
bulk operations.

So I agree with that a better long term approach would probably be to
make gin index builds (at least the bulk ones, I don't know about the
pending-list-cleanup) to use a tuple sort. But it looks like
Constantin has already done most of the work on his current proposal,
and no one has volunteered to do the work on converting it to tuple
sort instead.

Cheers,

Jeff

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Geoghegan
pg@heroku.com
In reply to: Jeff Janes (#4)
Re: Proposal: speeding up GIN build with parallel workers

On Sun, Jan 17, 2016 at 12:03 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

Interesting analysis.

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort. I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.

(Disclaimer: this was on those identifiers which had little difference
in the first 8 bytes, so they didn't really benefit from the
abbreviated keys.)

Sorry for going on about it, but I think you'd be surprised how
effective abbreviated keys are even in seemingly marginal cases.

So I agree with that a better long term approach would probably be to
make gin index builds (at least the bulk ones, I don't know about the
pending-list-cleanup) to use a tuple sort. But it looks like
Constantin has already done most of the work on his current proposal,
and no one has volunteered to do the work on converting it to tuple
sort instead.

I'm not going to stand in the way of incremental progress,
particularly given that this looks to be a simple patch that doesn't
commit us to anything. I suspect that we should consolidate GIN and
nbtree at some point, though. I think that there are some useful
general consequences for performance that come from consolidation. For
example, my ongoing work on external sorting makes it use much of the
same infrastructure as internal sorting. Now, external sorts
automatically benefit from optimizations to internal sorting, like the
"onlyKey" quicksort optimization.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Robert Haas
robertmhaas@gmail.com
In reply to: Constantin S. Pan (#1)
Re: Proposal: speeding up GIN build with parallel workers

On Fri, Jan 15, 2016 at 5:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

In current state the implementation is just a proof of concept
and it has all the configuration hardcoded, but it already works as is,
though it does not speed up the build process more than 4 times on my
configuration (12 CPUs). There is also a problem with temporary tables,
for which the parallel mode does not work.

I have yet to see a case where parallel query, with any current or
pending patch, gets more than about a 4x speedup. I can't think of
any reason that there should be a wall at 4x, and I'm not sure we're
hitting the wall there for the same reason in all cases. But your
observation corresponds to my experience.

I hope we eventually figure out how to make that much better, but I
wouldn't feel too bad about being at that spot now. 4x faster is
still a lot faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Constantin S. Pan
kvapen@gmail.com
In reply to: Constantin S. Pan (#1)
3 attachment(s)
Re: [WIP] speeding up GIN build with parallel workers

On Sat, 16 Jan 2016 01:38:39 +0300
"Constantin S. Pan" <kvapen@gmail.com> wrote:

The task of building GIN can require lots of time and eats 100 % CPU,
but we could easily make it use more than a 100 %, especially since we
now have parallel workers in postgres.

The process of building GIN looks like this:

1. Accumulate a batch of index records into an rbtree in maintenance
work memory.

2. Dump the batch to disk.

3. Repeat.

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

This speeds up the first step N times, but slows down the second one,
because when multiple workers dump item pointers for the same key,
each of them has to read and decode the results of the previous one.
That is a huge waste, but there is an idea on how to eliminate it.

When it comes to dumping the next batch, a worker does not do it
independently. Instead, it (and every other worker) sends the
accumulated index records to the parent (backend) in ascending key
order. The backend, which receives the records from the workers
through shared memory, can merge them and dump each of them once,
without the need to reread the records N-1 times.

In current state the implementation is just a proof of concept
and it has all the configuration hardcoded, but it already works as
is, though it does not speed up the build process more than 4 times
on my configuration (12 CPUs). There is also a problem with temporary
tables, for which the parallel mode does not work.

Hey Hackers!

I have made some progress on the proposal (see the attached patch):

0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
ginbuildCommon).

1. Implemented results merging on backend.

2. Disabled the usage of parallel mode when creating index on temporary
tables. No point in using parallel mode for temporary tables anyway,
right?

3. Added GUC parameter to control the number of workers for GIN
building.

4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).

In order to analyze the performance issues, I have made the following:

create table t (k int, v int[]);

create or replace
function randarray(width int, low int, high int)
returns int[] as
$$
select array(select (random()*(high-low) + low)::int
from generate_series(1,width))
$$ language sql;

insert into t select k, randarray(3000, 0, 100000)
from generate_series(1, 100000) as k;

create index t_v_idx on t using gin (v);

This creates 100000 arrays of 3000 random numbers each. The random
numbers are in range [0, 100000]. Then I measure how long the gin
building steps take. There are two steps: scan and merge.

The results show that 'scan' step is sped up perfectly. But the
'merge' step takes longer as you increase the number of workers. The
profiler shows that the bottleneck here is ginMergeItemPointers(), which
I use to merge the results.

Also, I did encounter the problem with workers deadlocking during
heap_open, but that seems to have been resolved by Robert Haas in his
commit regarding group locking.

Please leave your feedback!

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

pgin-performance.pdfapplication/pdfDownload
pgin-performance.datapplication/octet-stream; name=pgin-performance.datDownload
pgin-2.patchtext/x-patchDownload
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..a8c5ec5 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,14 +16,20 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
 
 typedef struct
 {
@@ -265,6 +271,141 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+#define KEY_TASK 42
+#define GIN_MAX_WORKERS 24
+#define GIN_BLOCKS_PER_WORKER 4
+#define GIN_RESULT_LEN 1024
+#define GIN_RESULT_KEYLEN 1024
+
+typedef struct {
+	bool ready;
+	bool finished;
+
+	Datum			key;
+	OffsetNumber	attnum;
+	GinNullCategory	category;
+
+	char keycoded[GIN_RESULT_KEYLEN];
+	int keylen;
+
+	ItemPointerData list[GIN_RESULT_LEN];
+	int			nlist;
+
+	Latch	blatch;
+	Latch	wlatch;
+} WorkerResult;
+
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct {
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+	WorkerResult results[GIN_MAX_WORKERS];
+} PGinBuildTask;
+
+static volatile PGinBuildTask *task;
+
+static void waitBool(volatile bool *actual, bool wanted, volatile Latch *l)
+{
+	if (*actual == wanted) return;
+
+	while (*actual != wanted)
+		WaitLatch(l, WL_LATCH_SET, 0);
+	ResetLatch(l);
+}
+
+static void setBool(volatile bool *actual, bool newvalue, volatile Latch *l)
+{
+	*actual = newvalue;
+	SetLatch(l);
+}
+
+static void ginDumpEntry(GinState *ginstate,
+						 volatile WorkerResult *r,
+						 OffsetNumber attnum,
+						 Datum key,
+						 GinNullCategory category,
+						 ItemPointerData *list,
+						 int nlist)
+{
+	volatile char *addr;
+	bool isnull;
+	Form_pg_attribute att;
+
+	Assert(nlist > 0);
+	waitBool(&r->ready, false, &r->wlatch);
+
+	Assert(r->keylen == 0);
+	addr = r->keycoded;
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	r->attnum = attnum;
+	r->category = category;
+	r->keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+	Assert(r->keylen > 0);
+	Assert(r->keylen <= GIN_RESULT_KEYLEN);
+
+	datumSerialize(key, isnull, att->attbyval, att->attlen, (char**)&addr);
+
+	while (nlist > 0)
+	{
+		if (nlist > GIN_RESULT_LEN)
+			r->nlist = GIN_RESULT_LEN;
+		else
+			r->nlist = nlist;
+		nlist -= r->nlist;
+
+		memcpy((void*)r->list, list, r->nlist * sizeof(ItemPointerData));
+		setBool(&r->ready, true, &r->blatch);
+		waitBool(&r->ready, false, &r->wlatch);
+	}
+}
+
+static void ginDumpAccumulator(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+							  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		if (IsParallelWorker())
+		{
+			volatile WorkerResult *r = &task->results[ParallelWorkerNumber];
+			ginDumpEntry(&buildstate->ginstate, r, attnum, key, category, list, nlist);
+		}
+		else
+			ginEntryInsert(&buildstate->ginstate,
+						   attnum, key, category,
+						   list, nlist,
+						   &buildstate->buildStats);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
@@ -283,52 +424,307 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
 	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-								  &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-			ginEntryInsert(&buildstate->ginstate, attnum, key, category,
-						   list, nlist, &buildstate->buildStats);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
+		ginDumpAccumulator(buildstate);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * Get the next key from the specified worker. Wait until it is available or
+ * the result is exhausted. Return true if got a key, false if the result is
+ * exhausted. Fill everything in, except "list".
+ */
+static bool getKeyFromWorker(volatile WorkerResult *result)
+{
+	if (result->finished) return false;
+
+	if (result->keylen)
+	{
+		bool isnull;
+		volatile char *addr = result->keycoded;
+		result->key = datumRestore((char**)&addr, &isnull);
+		if (isnull)
+			Assert(result->category == GIN_CAT_NULL_KEY);
+		else
+			Assert(result->category == GIN_CAT_NORM_KEY);
+		Assert(result->nlist > 0);
+		result->keylen = 0;
+	}
+	return true;
+}
+
+/*
+ * Claim "max_blocks" or less blocks. Return the actual number of claimed
+ * blocks and set "first" to point to the first block of the claimed range.
+ * 0 return value means the task has been finished.
+ */
+static int claimSomeBlocks(volatile PGinBuildTask *task, int max_blocks, int *first)
+{
+	int blocks = 0;
+
+	SpinLockAcquire(&task->lock);
+
+	if (task->scanned >= task->to_scan)
+	{
+		SpinLockRelease(&task->lock);
+		return 0;
+	}
+
+	*first = task->scanned;
+	blocks = max_blocks;
+	if (blocks > task->to_scan - task->scanned)
+		blocks = task->to_scan - task->scanned;
+	task->scanned += blocks;
+
+	SpinLockRelease(&task->lock);
+	return blocks;
+}
+
+static void reportReltuples(volatile PGinBuildTask *task, double reltuples)
+{
+	SpinLockAcquire(&task->lock);
+	task->reltuples += reltuples;
+	SpinLockRelease(&task->lock);
+}
+
+static double
+ginbuildCommon(GinBuildState *buildstate, Relation heap, Relation index, IndexInfo *indexInfo)
+{
+	double reltuples = 0;
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate->tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context",
+											   ALLOCSET_DEFAULT_MINSIZE,
+											   ALLOCSET_DEFAULT_INITSIZE,
+											   ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate->funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+												"Gin build temporary context for user-defined function",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												ALLOCSET_DEFAULT_MAXSIZE);
+
+	buildstate->accum.ginstate = &buildstate->ginstate;
+	ginInitBA(&buildstate->accum);
+
+	/*
+	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
+	 * prefers to receive tuples in TID order.
+	 */
+	if (IsParallelWorker())
+	{
+		while (true)
+		{
+			double subtuples;
+			int first, blocks;
+
+			blocks = claimSomeBlocks(task, GIN_BLOCKS_PER_WORKER, &first);
+			if (blocks == 0)
+				break;
+
+			subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+												first, blocks,
+												ginBuildCallback, (void *)buildstate);
+			reltuples += subtuples;
+		}
+	}
+	else
+	{
+		reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+									   ginBuildCallback, (void *)buildstate);
+	}
+
+	/* dump remaining entries to the index */
+	ginDumpAccumulator(buildstate);
+
+	MemoryContextDelete(buildstate->funcCtx);
+	MemoryContextDelete(buildstate->tmpCtx);
+
+	/*
+	 * Update metapage stats
+	 */
+	buildstate->buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
+	ginUpdateStats(index, &buildstate->buildStats);
+
+	return reltuples;
+}
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void ginbuildWorker(dsm_segment *seg, shm_toc *toc)
+{
+	GinBuildState buildstate;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	volatile WorkerResult *r;
+	double reltuples;
+
+	task = (PGinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+	r = &task->results[ParallelWorkerNumber];
+	r->finished = false;
+
+	OwnLatch(&r->wlatch);
+
+	heap = heap_open(task->heap_oid, NoLock);
+	index = index_open(task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
+
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+
+	reportReltuples(task, reltuples);
+
+	waitBool(&r->ready, false, &r->wlatch);
+	r->finished = true;
+	setBool(&r->ready, true, &r->blatch);
+}
+
+typedef struct GinEntryStack {
+	struct GinEntryStack *next;
+	Datum			key;
+	GinNullCategory	category;
+	OffsetNumber	attnum;
+	ItemPointerData	*list;
+	int				nlist;
+} GinEntryStack;
+
+static GinEntryStack *pushEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = palloc(sizeof(GinEntryStack));
+	head->next = stack;
+	head->list = palloc(sizeof(ItemPointerData)); /* make ginMergeItemPointers happy */
+	head->nlist = 0;
+	return head;
+}
+
+static GinEntryStack *popEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = stack;
+	Assert(stack != NULL);
+	stack = stack->next;
+	pfree(head->list);
+	pfree(head);
+	return stack;
+}
+
+static void mergeResults(GinBuildState *buildstate, ParallelContext *pcxt, volatile PGinBuildTask *task)
+{
+	GinEntryStack *entry = NULL;
+
+	while (true)
+	{
+		bool merged = false;
+		int i;
+
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			ItemPointerData *oldlist;
+			bool have_key;
+			int cmp;
+			volatile WorkerResult *r = &task->results[i];
+			if (pcxt->worker[i].error_mqh == NULL) continue;
+			waitBool(&r->ready, true, &r->blatch);
+			if (r->finished) continue;
+
+			/* The worker has something for us. */
+
+			have_key = getKeyFromWorker(r);
+			Assert(have_key);
+
+			cmp = -1;
+			if (entry != NULL)
+			{
+				cmp = ginCompareAttEntries(&buildstate->ginstate,
+										   r->attnum, r->key, r->category,
+										   entry->attnum, entry->key, entry->category);
+			}
+
+			if (cmp > 0)
+			{
+				/* The key is greater, skip the worker. */
+				continue;
+			}
+
+			if (cmp < 0)
+			{
+				/*
+				 * The key is less than what we have on the stack.
+				 * Push a new entry onto the stack.
+				 */
+				entry = pushEntry(entry);
+				entry->key      = r->key;
+				entry->category = r->category;
+				entry->attnum   = r->attnum;
+			}
+
+			/*
+			 * The key is less than or equal. Merge the item pointers.
+			 * FIXME: Should we first copy the list and let the worker continue
+			 * before merging?
+			 */
+			oldlist = entry->list;
+			entry->list = ginMergeItemPointers(entry->list, entry->nlist,
+											   (ItemPointerData*)r->list, r->nlist,
+											   &entry->nlist);
+			pfree(oldlist);
+			setBool(&r->ready, false, &r->wlatch);
+			merged = true;
+		}
+
+		if (!merged)
+		{
+			/* Nothing merged. Insert the entry into the index and pop the stack. */
+			if (entry == NULL)
+			{
+				/* Also nothing to dump - we have finished. */
+				break;
+			}
+
+			ginEntryInsert(&buildstate->ginstate,
+						   entry->attnum, entry->key, entry->category,
+						   entry->list, entry->nlist,
+						   &buildstate->buildStats);
+			entry = popEntry(entry);
+		}
+	}
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	GinBuildState buildstate;
+
 	IndexBuildResult *result;
-	double		reltuples;
-	GinBuildState buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
-	ItemPointerData *list;
-	Datum		key;
-	GinNullCategory category;
-	uint32		nlist;
-	MemoryContext oldCtx;
-	OffsetNumber attnum;
+	double reltuples = 0;
+	bool parallel_workers_helped = false;
 
 	if (RelationGetNumberOfBlocks(index) != 0)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	initGinState(&buildstate.ginstate, index);
-	buildstate.indtuples = 0;
-	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
-
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -363,60 +759,76 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	UnlockReleaseBuffer(RootBuffer);
 	END_CRIT_SECTION();
 
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
-	/*
-	 * create a temporary memory context that is used to hold data not yet
-	 * dumped out to the index
-	 */
-	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
-											  "Gin build temporary context",
-											  ALLOCSET_DEFAULT_MINSIZE,
-											  ALLOCSET_DEFAULT_INITSIZE,
-											  ALLOCSET_DEFAULT_MAXSIZE);
-
-	/*
-	 * create a temporary memory context that is used for calling
-	 * ginExtractEntries(), and can be reset after each tuple
-	 */
-	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
-					 "Gin build temporary context for user-defined function",
-											   ALLOCSET_DEFAULT_MINSIZE,
-											   ALLOCSET_DEFAULT_INITSIZE,
-											   ALLOCSET_DEFAULT_MAXSIZE);
-
-	buildstate.accum.ginstate = &buildstate.ginstate;
-	ginInitBA(&buildstate.accum);
-
-	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
-	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
-
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	if (gin_parallel_workers > GIN_MAX_WORKERS)
+		gin_parallel_workers = GIN_MAX_WORKERS;
+
+	if (gin_parallel_workers > 0)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
-	}
-	MemoryContextSwitchTo(oldCtx);
+		if (RelationUsesLocalBuffers(heap))
+		{
+			fprintf(stderr, "not using parallel GIN build on temporary table %s\n", NameStr(heap->rd_rel->relname));
+		}
+		else
+		{
+			EnterParallelMode();
+			{
+				int i;
+				int size = sizeof(PGinBuildTask);
+				int keys = 1;
+				PGinBuildTask   *task;
+				ParallelContext *pcxt;
+
+				pcxt = CreateParallelContext(ginbuildWorker, gin_parallel_workers);
+
+				shm_toc_estimate_chunk(&pcxt->estimator, size);
+				shm_toc_estimate_keys(&pcxt->estimator, keys);
+				InitializeParallelDSM(pcxt);
+				task = (PGinBuildTask*)shm_toc_allocate(pcxt->toc, size);
+				shm_toc_insert(pcxt->toc, KEY_TASK, (void*)task);
+				SpinLockInit(&task->lock);
+				for (i = 0; i < pcxt->nworkers; i++)
+				{
+					volatile WorkerResult *r = &task->results[i];
+
+					InitSharedLatch(&r->blatch);
+					OwnLatch(&r->blatch);
+					InitSharedLatch(&r->wlatch);
 
-	MemoryContextDelete(buildstate.funcCtx);
-	MemoryContextDelete(buildstate.tmpCtx);
+					r->keylen = 0;
+					r->ready = false;
+				}
+				task->reltuples = 0;
+				task->to_scan = RelationGetNumberOfBlocks(heap);
+				task->heap_oid = heap->rd_id;
+				task->index_oid = index->rd_id;
+				LaunchParallelWorkers(pcxt);
+				if (pcxt->nworkers_launched > 0)
+				{
+					mergeResults(&buildstate, pcxt, task);
 
-	/*
-	 * Update metapage stats
-	 */
-	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
-	ginUpdateStats(index, &buildstate.buildStats);
+					WaitForParallelWorkersToFinish(pcxt);
+					reltuples = task->reltuples;
+
+					parallel_workers_helped = true;
+				}
+				DestroyParallelContext(pcxt);
+			}
+			ExitParallelMode();
+		}
+	}
+
+	if (!parallel_workers_helped)
+	{
+		/* Do everything myself */
+		reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+	}
 
 	/*
 	 * Return statistics
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a69ca..b26e362 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2697,6 +2697,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"gin_parallel_workers",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of parallel workers for GIN buiding."),
+			NULL,
+		},
+		&gin_parallel_workers,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 09b2003..70cb37a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -537,6 +537,7 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#gin_parallel_workers = 0		# 0 disables parallel gin build
 
 # - Locale and Formatting -
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..c487677 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -68,6 +68,7 @@ typedef char GinTernaryValue;
 /* GUC parameters */
 extern PGDLLIMPORT int GinFuzzySearchLimit;
 extern int	gin_pending_list_limit;
+extern int	gin_parallel_workers;
 
 /* ginutil.c */
 extern void ginGetStats(Relation index, GinStatsData *stats);
#8Oleg Bartunov
obartunov@gmail.com
In reply to: Constantin S. Pan (#7)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Feb 17, 2016 at 6:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

On Sat, 16 Jan 2016 01:38:39 +0300
"Constantin S. Pan" <kvapen@gmail.com> wrote:

The task of building GIN can require lots of time and eats 100 % CPU,
but we could easily make it use more than a 100 %, especially since we
now have parallel workers in postgres.

The process of building GIN looks like this:

1. Accumulate a batch of index records into an rbtree in maintenance
work memory.

2. Dump the batch to disk.

3. Repeat.

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

This speeds up the first step N times, but slows down the second one,
because when multiple workers dump item pointers for the same key,
each of them has to read and decode the results of the previous one.
That is a huge waste, but there is an idea on how to eliminate it.

When it comes to dumping the next batch, a worker does not do it
independently. Instead, it (and every other worker) sends the
accumulated index records to the parent (backend) in ascending key
order. The backend, which receives the records from the workers
through shared memory, can merge them and dump each of them once,
without the need to reread the records N-1 times.

In current state the implementation is just a proof of concept
and it has all the configuration hardcoded, but it already works as
is, though it does not speed up the build process more than 4 times
on my configuration (12 CPUs). There is also a problem with temporary
tables, for which the parallel mode does not work.

Hey Hackers!

I have made some progress on the proposal (see the attached patch):

0. Moved some repeated code to functions (e.g. ginDumpAccumulator,
ginbuildCommon).

1. Implemented results merging on backend.

2. Disabled the usage of parallel mode when creating index on temporary
tables. No point in using parallel mode for temporary tables anyway,
right?

3. Added GUC parameter to control the number of workers for GIN
building.

4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).

In order to analyze the performance issues, I have made the following:

create table t (k int, v int[]);

create or replace
function randarray(width int, low int, high int)
returns int[] as
$$
select array(select (random()*(high-low) + low)::int
from generate_series(1,width))
$$ language sql;

insert into t select k, randarray(3000, 0, 100000)
from generate_series(1, 100000) as k;

create index t_v_idx on t using gin (v);

This creates 100000 arrays of 3000 random numbers each. The random
numbers are in range [0, 100000]. Then I measure how long the gin
building steps take. There are two steps: scan and merge.

The results show that 'scan' step is sped up perfectly. But the
'merge' step takes longer as you increase the number of workers. The
profiler shows that the bottleneck here is ginMergeItemPointers(), which
I use to merge the results.

Also, I did encounter the problem with workers deadlocking during
heap_open, but that seems to have been resolved by Robert Haas in his
commit regarding group locking.

Please leave your feedback!

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: FATAL: the
database system is in recovery mode
Failed.

Show quoted text

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Geoghegan
pg@heroku.com
In reply to: Constantin S. Pan (#7)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com> wrote:

4. Hit the 8x speedup limit. Made some analysis of the reasons (see the
attached plot or the data file).

Did you actually compare this to the master branch? I wouldn't like to
assume that the one worker case was equivalent. Obviously that's the
really interesting baseline.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Constantin S. Pan
kvapen@gmail.com
In reply to: Peter Geoghegan (#9)
Re: [WIP] speeding up GIN build with parallel workers

I was testing with number of workers starting at 0. The 0 case is (in
theory) equivalent to master branch. But I should certainly compare to
the master branch, you are right. Will do that shortly.

On Wed, 17 Feb 2016 12:26:05 -0800
Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com>
wrote:

4. Hit the 8x speedup limit. Made some analysis of the reasons (see
the attached plot or the data file).

Did you actually compare this to the master branch? I wouldn't like to
assume that the one worker case was equivalent. Obviously that's the
really interesting baseline.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Constantin S. Pan
kvapen@gmail.com
In reply to: Oleg Bartunov (#8)
1 attachment(s)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.

Attachments:

pgin-3.patchtext/x-patchDownload
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..2f6f142 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,14 +16,20 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
 
 typedef struct
 {
@@ -265,6 +271,148 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+#define KEY_TASK 42
+#define GIN_MAX_WORKERS 24
+#define GIN_BLOCKS_PER_WORKER 4
+#define GIN_RESULT_LEN 1024
+#define GIN_RESULT_KEYLEN 1024
+
+typedef struct {
+	bool ready;
+	bool finished;
+
+	Datum			key;
+	OffsetNumber	attnum;
+	GinNullCategory	category;
+
+	char keycoded[GIN_RESULT_KEYLEN];
+	int keylen;
+
+	ItemPointerData list[GIN_RESULT_LEN];
+	int			nlist;
+
+	Latch	blatch;
+	Latch	wlatch;
+} WorkerResult;
+
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct {
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+	WorkerResult results[GIN_MAX_WORKERS];
+} PGinBuildTask;
+
+static volatile PGinBuildTask *task;
+
+static void waitBool(volatile bool *actual, bool wanted, volatile Latch *l)
+{
+	if (*actual == wanted) return;
+
+	while (*actual != wanted)
+		WaitLatch(l, WL_LATCH_SET, 0);
+	ResetLatch(l);
+}
+
+static void setBool(volatile bool *actual, bool newvalue, volatile Latch *l)
+{
+	*actual = newvalue;
+	SetLatch(l);
+}
+
+static void ginDumpEntry(GinState *ginstate,
+						 volatile WorkerResult *r,
+						 OffsetNumber attnum,
+						 Datum key,
+						 GinNullCategory category,
+						 ItemPointerData *list,
+						 int nlist)
+{
+	volatile char *addr;
+	bool isnull;
+	Form_pg_attribute att;
+
+	Assert(nlist > 0);
+	waitBool(&r->ready, false, &r->wlatch);
+
+	Assert(r->keylen == 0);
+	addr = r->keycoded;
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	r->attnum = attnum;
+	r->category = category;
+	if (r->category == GIN_CAT_EMPTY_ITEM || r->category == GIN_CAT_NULL_ITEM)
+	{
+		r->keylen = 1;
+	}
+	else
+	{
+		r->keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+		Assert(r->keylen > 0);
+		Assert(r->keylen <= GIN_RESULT_KEYLEN);
+
+		datumSerialize(key, isnull, att->attbyval, att->attlen, (char**)&addr);
+	}
+
+	while (nlist > 0)
+	{
+		if (nlist > GIN_RESULT_LEN)
+			r->nlist = GIN_RESULT_LEN;
+		else
+			r->nlist = nlist;
+		nlist -= r->nlist;
+
+		memcpy((void*)r->list, list, r->nlist * sizeof(ItemPointerData));
+		setBool(&r->ready, true, &r->blatch);
+		waitBool(&r->ready, false, &r->wlatch);
+	}
+}
+
+static void ginDumpAccumulator(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+							  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		if (IsParallelWorker())
+		{
+			volatile WorkerResult *r = &task->results[ParallelWorkerNumber];
+			ginDumpEntry(&buildstate->ginstate, r, attnum, key, category, list, nlist);
+		}
+		else
+			ginEntryInsert(&buildstate->ginstate,
+						   attnum, key, category,
+						   list, nlist,
+						   &buildstate->buildStats);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
@@ -283,52 +431,315 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
 	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-								  &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-			ginEntryInsert(&buildstate->ginstate, attnum, key, category,
-						   list, nlist, &buildstate->buildStats);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
+		ginDumpAccumulator(buildstate);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * Get the next key from the specified worker. Wait until it is available or
+ * the result is exhausted. Return true if got a key, false if the result is
+ * exhausted. Fill everything in, except "list".
+ */
+static bool getKeyFromWorker(volatile WorkerResult *result)
+{
+	if (result->finished) return false;
+
+	if (result->keylen)
+	{
+		if (result->category == GIN_CAT_EMPTY_ITEM || result->category == GIN_CAT_NULL_ITEM)
+		{
+			result->key = 0;
+		}
+		else
+		{
+			bool isnull;
+			volatile char *addr = result->keycoded;
+			result->key = datumRestore((char**)&addr, &isnull);
+			if (isnull)
+				Assert(result->category == GIN_CAT_NULL_KEY);
+			else
+				Assert(result->category == GIN_CAT_NORM_KEY);
+		}
+		result->keylen = 0;
+	}
+
+	Assert(result->nlist > 0);
+	return true;
+}
+
+/*
+ * Claim "max_blocks" or less blocks. Return the actual number of claimed
+ * blocks and set "first" to point to the first block of the claimed range.
+ * 0 return value means the task has been finished.
+ */
+static int claimSomeBlocks(volatile PGinBuildTask *task, int max_blocks, int *first)
+{
+	int blocks = 0;
+
+	SpinLockAcquire(&task->lock);
+
+	if (task->scanned >= task->to_scan)
+	{
+		SpinLockRelease(&task->lock);
+		return 0;
+	}
+
+	*first = task->scanned;
+	blocks = max_blocks;
+	if (blocks > task->to_scan - task->scanned)
+		blocks = task->to_scan - task->scanned;
+	task->scanned += blocks;
+
+	SpinLockRelease(&task->lock);
+	return blocks;
+}
+
+static void reportReltuples(volatile PGinBuildTask *task, double reltuples)
+{
+	SpinLockAcquire(&task->lock);
+	task->reltuples += reltuples;
+	SpinLockRelease(&task->lock);
+}
+
+static double
+ginbuildCommon(GinBuildState *buildstate, Relation heap, Relation index, IndexInfo *indexInfo)
+{
+	double reltuples = 0;
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate->tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context",
+											   ALLOCSET_DEFAULT_MINSIZE,
+											   ALLOCSET_DEFAULT_INITSIZE,
+											   ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate->funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+												"Gin build temporary context for user-defined function",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												ALLOCSET_DEFAULT_MAXSIZE);
+
+	buildstate->accum.ginstate = &buildstate->ginstate;
+	ginInitBA(&buildstate->accum);
+
+	/*
+	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
+	 * prefers to receive tuples in TID order.
+	 */
+	if (IsParallelWorker())
+	{
+		while (true)
+		{
+			double subtuples;
+			int first, blocks;
+
+			blocks = claimSomeBlocks(task, GIN_BLOCKS_PER_WORKER, &first);
+			if (blocks == 0)
+				break;
+
+			subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+												first, blocks,
+												ginBuildCallback, (void *)buildstate);
+			reltuples += subtuples;
+		}
+	}
+	else
+	{
+		reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+									   ginBuildCallback, (void *)buildstate);
+	}
+
+	/* dump remaining entries to the index */
+	ginDumpAccumulator(buildstate);
+
+	MemoryContextDelete(buildstate->funcCtx);
+	MemoryContextDelete(buildstate->tmpCtx);
+
+	/*
+	 * Update metapage stats
+	 */
+	buildstate->buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
+	ginUpdateStats(index, &buildstate->buildStats);
+
+	return reltuples;
+}
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void ginbuildWorker(dsm_segment *seg, shm_toc *toc)
+{
+	GinBuildState buildstate;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	volatile WorkerResult *r;
+	double reltuples;
+
+	task = (PGinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+	r = &task->results[ParallelWorkerNumber];
+	r->finished = false;
+
+	OwnLatch(&r->wlatch);
+
+	heap = heap_open(task->heap_oid, NoLock);
+	index = index_open(task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
+
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+
+	reportReltuples(task, reltuples);
+
+	waitBool(&r->ready, false, &r->wlatch);
+	r->finished = true;
+	setBool(&r->ready, true, &r->blatch);
+}
+
+typedef struct GinEntryStack {
+	struct GinEntryStack *next;
+	Datum			key;
+	GinNullCategory	category;
+	OffsetNumber	attnum;
+	ItemPointerData	*list;
+	int				nlist;
+} GinEntryStack;
+
+static GinEntryStack *pushEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = palloc(sizeof(GinEntryStack));
+	head->next = stack;
+	head->list = palloc(sizeof(ItemPointerData)); /* make ginMergeItemPointers happy */
+	head->nlist = 0;
+	return head;
+}
+
+static GinEntryStack *popEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = stack;
+	Assert(stack != NULL);
+	stack = stack->next;
+	pfree(head->list);
+	pfree(head);
+	return stack;
+}
+
+static void mergeResults(GinBuildState *buildstate, ParallelContext *pcxt, volatile PGinBuildTask *task)
+{
+	GinEntryStack *entry = NULL;
+
+	while (true)
+	{
+		bool merged = false;
+		int i;
+
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			ItemPointerData *oldlist;
+			bool have_key;
+			int cmp;
+			volatile WorkerResult *r = &task->results[i];
+			if (pcxt->worker[i].error_mqh == NULL) continue;
+			waitBool(&r->ready, true, &r->blatch);
+			if (r->finished) continue;
+
+			/* The worker has something for us. */
+
+			have_key = getKeyFromWorker(r);
+			Assert(have_key);
+
+			cmp = -1;
+			if (entry != NULL)
+			{
+				cmp = ginCompareAttEntries(&buildstate->ginstate,
+										   r->attnum, r->key, r->category,
+										   entry->attnum, entry->key, entry->category);
+			}
+
+			if (cmp > 0)
+			{
+				/* The key is greater, skip the worker. */
+				continue;
+			}
+
+			if (cmp < 0)
+			{
+				/*
+				 * The key is less than what we have on the stack.
+				 * Push a new entry onto the stack.
+				 */
+				entry = pushEntry(entry);
+				entry->key      = r->key;
+				entry->category = r->category;
+				entry->attnum   = r->attnum;
+			}
+
+			/*
+			 * The key is less than or equal. Merge the item pointers.
+			 * FIXME: Should we first copy the list and let the worker continue
+			 * before merging?
+			 */
+			oldlist = entry->list;
+			entry->list = ginMergeItemPointers(entry->list, entry->nlist,
+											   (ItemPointerData*)r->list, r->nlist,
+											   &entry->nlist);
+			pfree(oldlist);
+			setBool(&r->ready, false, &r->wlatch);
+			merged = true;
+		}
+
+		if (!merged)
+		{
+			/* Nothing merged. Insert the entry into the index and pop the stack. */
+			if (entry == NULL)
+			{
+				/* Also nothing to dump - we have finished. */
+				break;
+			}
+
+			ginEntryInsert(&buildstate->ginstate,
+						   entry->attnum, entry->key, entry->category,
+						   entry->list, entry->nlist,
+						   &buildstate->buildStats);
+			entry = popEntry(entry);
+		}
+	}
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	GinBuildState buildstate;
+
 	IndexBuildResult *result;
-	double		reltuples;
-	GinBuildState buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
-	ItemPointerData *list;
-	Datum		key;
-	GinNullCategory category;
-	uint32		nlist;
-	MemoryContext oldCtx;
-	OffsetNumber attnum;
+	double reltuples = 0;
+	bool parallel_workers_helped = false;
 
 	if (RelationGetNumberOfBlocks(index) != 0)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	initGinState(&buildstate.ginstate, index);
-	buildstate.indtuples = 0;
-	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
-
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -363,60 +774,76 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	UnlockReleaseBuffer(RootBuffer);
 	END_CRIT_SECTION();
 
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
-	/*
-	 * create a temporary memory context that is used to hold data not yet
-	 * dumped out to the index
-	 */
-	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
-											  "Gin build temporary context",
-											  ALLOCSET_DEFAULT_MINSIZE,
-											  ALLOCSET_DEFAULT_INITSIZE,
-											  ALLOCSET_DEFAULT_MAXSIZE);
-
-	/*
-	 * create a temporary memory context that is used for calling
-	 * ginExtractEntries(), and can be reset after each tuple
-	 */
-	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
-					 "Gin build temporary context for user-defined function",
-											   ALLOCSET_DEFAULT_MINSIZE,
-											   ALLOCSET_DEFAULT_INITSIZE,
-											   ALLOCSET_DEFAULT_MAXSIZE);
-
-	buildstate.accum.ginstate = &buildstate.ginstate;
-	ginInitBA(&buildstate.accum);
-
-	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
-	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
-
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	if (gin_parallel_workers > GIN_MAX_WORKERS)
+		gin_parallel_workers = GIN_MAX_WORKERS;
+
+	if (gin_parallel_workers > 0)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
-	}
-	MemoryContextSwitchTo(oldCtx);
+		if (RelationUsesLocalBuffers(heap))
+		{
+			fprintf(stderr, "not using parallel GIN build on temporary table %s\n", NameStr(heap->rd_rel->relname));
+		}
+		else
+		{
+			EnterParallelMode();
+			{
+				int i;
+				int size = sizeof(PGinBuildTask);
+				int keys = 1;
+				PGinBuildTask   *task;
+				ParallelContext *pcxt;
+
+				pcxt = CreateParallelContext(ginbuildWorker, gin_parallel_workers);
+
+				shm_toc_estimate_chunk(&pcxt->estimator, size);
+				shm_toc_estimate_keys(&pcxt->estimator, keys);
+				InitializeParallelDSM(pcxt);
+				task = (PGinBuildTask*)shm_toc_allocate(pcxt->toc, size);
+				shm_toc_insert(pcxt->toc, KEY_TASK, (void*)task);
+				SpinLockInit(&task->lock);
+				for (i = 0; i < pcxt->nworkers; i++)
+				{
+					volatile WorkerResult *r = &task->results[i];
+
+					InitSharedLatch(&r->blatch);
+					OwnLatch(&r->blatch);
+					InitSharedLatch(&r->wlatch);
 
-	MemoryContextDelete(buildstate.funcCtx);
-	MemoryContextDelete(buildstate.tmpCtx);
+					r->keylen = 0;
+					r->ready = false;
+				}
+				task->reltuples = 0;
+				task->to_scan = RelationGetNumberOfBlocks(heap);
+				task->heap_oid = heap->rd_id;
+				task->index_oid = index->rd_id;
+				LaunchParallelWorkers(pcxt);
+				if (pcxt->nworkers_launched > 0)
+				{
+					mergeResults(&buildstate, pcxt, task);
 
-	/*
-	 * Update metapage stats
-	 */
-	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
-	ginUpdateStats(index, &buildstate.buildStats);
+					WaitForParallelWorkersToFinish(pcxt);
+					reltuples = task->reltuples;
+
+					parallel_workers_helped = true;
+				}
+				DestroyParallelContext(pcxt);
+			}
+			ExitParallelMode();
+		}
+	}
+
+	if (!parallel_workers_helped)
+	{
+		/* Do everything myself */
+		reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+	}
 
 	/*
 	 * Return statistics
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 31a69ca..b26e362 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2697,6 +2697,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"gin_parallel_workers",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of parallel workers for GIN buiding."),
+			NULL,
+		},
+		&gin_parallel_workers,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 09b2003..70cb37a 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -537,6 +537,7 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#gin_parallel_workers = 0		# 0 disables parallel gin build
 
 # - Locale and Formatting -
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..c487677 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -68,6 +68,7 @@ typedef char GinTernaryValue;
 /* GUC parameters */
 extern PGDLLIMPORT int GinFuzzySearchLimit;
 extern int	gin_pending_list_limit;
+extern int	gin_parallel_workers;
 
 /* ginutil.c */
 extern void ginGetStats(Relation index, GinStatsData *stats);
#12David Steele
david@pgmasters.net
In reply to: Constantin S. Pan (#11)
Re: [WIP] speeding up GIN build with parallel workers

On 2/18/16 10:10 AM, Constantin S. Pan wrote:

On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.

Oleg, it looks like Constantin has updated to patch to address the issue
you were seeing. Do you have time to retest and review?

Thanks,
--
-David
david@pgmasters.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Constantin S. Pan
kvapen@gmail.com
In reply to: David Steele (#12)
1 attachment(s)
Re: [WIP] speeding up GIN build with parallel workers

On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:

On 2/18/16 10:10 AM, Constantin S. Pan wrote:

On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.

Oleg, it looks like Constantin has updated to patch to address the
issue you were seeing. Do you have time to retest and review?

Thanks,

Actually, there was some progress since. The patch is
attached.

1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.

2. Changed the way results are merged. It uses shared memory
message queue now.

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
0 16 247
1 16 256
2 16 126
4 16 89
0 32 247
1 32 270
2 32 123
4 32 92
0 64 254
1 64 272
2 64 123
4 64 88
0 128 250
1 128 263
2 128 126
4 128 85
0 256 247
1 256 269
2 256 130
4 256 88
0 512 257
1 512 275
2 512 129
4 512 92
0 1024 255
1 1024 273
2 1024 130
4 1024 90

On Wed, 17 Feb 2016 12:26:05 -0800
Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Feb 17, 2016 at 7:55 AM, Constantin S. Pan <kvapen@gmail.com>
wrote:

4. Hit the 8x speedup limit. Made some analysis of the reasons (see
the attached plot or the data file).

Did you actually compare this to the master branch? I wouldn't like to
assume that the one worker case was equivalent. Obviously that's the
really interesting baseline.

Compared with the master branch. The case of 0 workers is
indeed equivalent to the master branch.

Regards,
Constantin

Attachments:

pgin-5.patchtext/x-patchDownload
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..ff267b3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,14 +16,21 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
+int gin_shared_mem = 0;
 
 typedef struct
 {
@@ -35,7 +42,6 @@ typedef struct
 	BuildAccumulator accum;
 } GinBuildState;
 
-
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -265,6 +271,169 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+#define KEY_TASK 42
+#define KEY_SHM_ORIGIN 43
+#define KEY_SHM_PER_WORKER 44
+#define GIN_MAX_WORKERS 24
+#define GIN_BLOCKS_PER_WORKER 4
+
+/*
+ * The shmem message structure:
+ * Entry, Key, List
+ */
+
+typedef struct {
+	GinNullCategory category;
+	OffsetNumber attnum;
+	int nlist;
+} GinShmemEntry;
+
+typedef struct {
+	void *mq;
+	shm_mq_handle *mqh;
+	bool end_of_tree;
+	bool end_of_forest;
+
+	void *msg_body;
+	Size msg_len;
+	Datum key;
+	int skipkey;
+} WorkerResult;
+
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct {
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+	WorkerResult results[GIN_MAX_WORKERS];
+} PGinBuildTask;
+
+static volatile PGinBuildTask *task;
+
+static shm_mq *mq;
+static shm_mq_handle *mqh;
+
+static void ginDumpEntry(GinState *ginstate,
+						 volatile WorkerResult *r,
+						 OffsetNumber attnum,
+						 Datum key,
+						 GinNullCategory category,
+						 ItemPointerData *list,
+						 int nlist)
+{
+	int keylen, listlen;
+
+	bool isnull;
+	Form_pg_attribute att;
+	GinShmemEntry e;
+
+	// The message consists of 2 or 3 parts. iovec allows us to send them as
+	// one message though the parts are located at unrelated addresses.
+	shm_mq_iovec iov[3];
+	int iovlen = 0;
+
+	char *buf = NULL;
+
+	e.category = category;
+	e.attnum = attnum;
+	e.nlist = nlist;
+
+	Assert(nlist > 0);
+
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	if (e.category == GIN_CAT_NORM_KEY)
+	{
+		keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+		Assert(keylen > 0);
+		listlen = e.nlist * sizeof(ItemPointerData);
+	}
+	else
+		keylen = 0;
+
+	listlen = e.nlist * sizeof(ItemPointerData);
+
+	iov[iovlen].data = (char *)&e;
+	iov[iovlen++].len = sizeof(e);
+
+	if (keylen > 0)
+	{
+		char *cursor;
+		buf = palloc(keylen);
+		cursor = buf;
+		datumSerialize(key, isnull, att->attbyval, att->attlen, &cursor);
+		iov[iovlen].data = buf;
+		iov[iovlen++].len = keylen;
+	}
+
+	iov[iovlen].data = (char *)list;
+	iov[iovlen++].len = listlen;
+
+	shm_mq_sendv(mqh, iov, iovlen, false);
+
+	if (buf)
+		pfree(buf);
+}
+
+static void ginDumpAccumulator(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+	volatile WorkerResult *r = NULL;
+
+	if (IsParallelWorker())
+	{
+		r = &task->results[ParallelWorkerNumber];
+	}
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+							  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		if (r)
+			ginDumpEntry(&buildstate->ginstate,
+						 r, attnum, key,
+						 category, list, nlist);
+		else
+			ginEntryInsert(&buildstate->ginstate,
+						   attnum, key, category,
+						   list, nlist,
+						   &buildstate->buildStats);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	if (IsParallelWorker())
+	{
+		// send empty message as an "end-of-tree" marker
+		shm_mq_result r = shm_mq_send(mqh, 0, NULL, false);
+		if (r != SHM_MQ_SUCCESS)
+		{
+			elog(ERROR, "failed to send the results from worker to backend");
+		}
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
@@ -283,52 +452,350 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
 	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-								  &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-			ginEntryInsert(&buildstate->ginstate, attnum, key, category,
-						   list, nlist, &buildstate->buildStats);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
+		ginDumpAccumulator(buildstate);
 	}
 
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * Claim "max_blocks" or less blocks. Return the actual number of claimed
+ * blocks and set "first" to point to the first block of the claimed range.
+ * 0 return value means the task has been finished.
+ */
+static int claimSomeBlocks(volatile PGinBuildTask *task, int max_blocks, int *first)
+{
+	int blocks = 0;
+
+	SpinLockAcquire(&task->lock);
+
+	if (task->scanned >= task->to_scan)
+	{
+		SpinLockRelease(&task->lock);
+		return 0;
+	}
+
+	*first = task->scanned;
+	blocks = max_blocks;
+	if (blocks > task->to_scan - task->scanned)
+		blocks = task->to_scan - task->scanned;
+	task->scanned += blocks;
+
+	SpinLockRelease(&task->lock);
+	return blocks;
+}
+
+static void reportReltuples(volatile PGinBuildTask *task, double reltuples)
+{
+	SpinLockAcquire(&task->lock);
+	task->reltuples += reltuples;
+	SpinLockRelease(&task->lock);
+}
+
+static double
+ginbuildCommon(GinBuildState *buildstate, Relation heap, Relation index, IndexInfo *indexInfo)
+{
+	double reltuples = 0;
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate->tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context",
+											   ALLOCSET_DEFAULT_MINSIZE,
+											   ALLOCSET_DEFAULT_INITSIZE,
+											   ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate->funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+												"Gin build temporary context for user-defined function",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												ALLOCSET_DEFAULT_MAXSIZE);
+
+	buildstate->accum.ginstate = &buildstate->ginstate;
+	ginInitBA(&buildstate->accum);
+
+	/*
+	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
+	 * prefers to receive tuples in TID order.
+	 */
+	if (IsParallelWorker())
+	{
+		while (true)
+		{
+			double subtuples;
+			int first, blocks;
+
+			blocks = claimSomeBlocks(task, GIN_BLOCKS_PER_WORKER, &first);
+			if (blocks == 0)
+				break;
+
+			subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+												first, blocks,
+												ginBuildCallback, (void *)buildstate);
+			reltuples += subtuples;
+		}
+	}
+	else
+	{
+		reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
+									   ginBuildCallback, (void *)buildstate);
+	}
+
+	/* dump remaining entries to the index */
+	ginDumpAccumulator(buildstate);
+
+	MemoryContextDelete(buildstate->funcCtx);
+	MemoryContextDelete(buildstate->tmpCtx);
+
+	/*
+	 * Update metapage stats
+	 */
+	buildstate->buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
+	ginUpdateStats(index, &buildstate->buildStats);
+
+	return reltuples;
+}
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void ginbuildWorker(dsm_segment *seg, shm_toc *toc)
+{
+	GinBuildState buildstate;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	double reltuples;
+
+	char *shm_origin;
+	int mqsize;
+
+	task = (PGinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+
+	shm_origin = (char *)shm_toc_lookup(toc, KEY_SHM_ORIGIN);
+	mqsize = *(int*)shm_toc_lookup(toc, KEY_SHM_PER_WORKER);
+	mq = (shm_mq *)(shm_origin + ParallelWorkerNumber * mqsize);
+	shm_mq_set_sender(mq, MyProc);
+	mqh = shm_mq_attach(mq, seg, NULL);
+	shm_mq_wait_for_attach(mqh);
+
+	heap = heap_open(task->heap_oid, NoLock);
+	index = index_open(task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
+
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+
+	reportReltuples(task, reltuples);
+
+	shm_mq_detach(mq);
+}
+
+typedef struct GinEntryStack {
+	struct GinEntryStack *next;
+	Datum			key;
+	GinNullCategory	category;
+	OffsetNumber	attnum;
+	ItemPointerData	*list;
+	int				nlist;
+} GinEntryStack;
+
+static GinEntryStack *pushEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = palloc(sizeof(GinEntryStack));
+	head->next = stack;
+	head->list = palloc(sizeof(ItemPointerData)); /* make ginMergeItemPointers happy */
+	head->nlist = 0;
+	return head;
+}
+
+static GinEntryStack *popEntry(GinEntryStack *stack)
+{
+	GinEntryStack *head = stack;
+	Assert(stack != NULL);
+	stack = stack->next;
+	pfree(head->list);
+	pfree(head);
+	return stack;
+}
+
+/*
+ * Wait until all unfinished workers start dumping their trees.
+ * Return the number of trees to merge.
+ */
+static int waitNextTree(ParallelContext *pcxt, volatile PGinBuildTask *task)
+{
+	int i;
+	int trees = 0;
+
+	for (i = 0; i < pcxt->nworkers; ++i)
+	{
+		volatile WorkerResult *r = &task->results[i];
+
+		if (shm_mq_receive(r->mqh, (Size *)&r->msg_len, (void *)&r->msg_body, false) == SHM_MQ_SUCCESS)
+		{
+			r->end_of_tree = false;
+			trees++;
+		}
+		else
+		{
+			r->end_of_forest = true;
+		}
+	}
+	return trees;
+}
+
+/* Merge the results from all ready (but unfinished) workers. */
+static void mergeReadyAndUnfinished(GinBuildState *buildstate, ParallelContext *pcxt, volatile PGinBuildTask *task)
+{
+	GinEntryStack *entry = NULL;
+	while (true)
+	{
+		int i;
+		bool merged = false;
+
+		for (i = 0; i < pcxt->nworkers; ++i)
+		{
+			GinShmemEntry *shentry;
+			ItemPointerData *oldlist;
+			int cmp;
+			char *cursor;
+			volatile WorkerResult *r = &task->results[i];
+
+			if (r->end_of_forest || r->end_of_tree)
+			{
+				continue;
+			}
+
+			if (r->msg_len == 0) /* end-of-tree */
+			{
+				r->end_of_tree = true;
+				continue;
+			}
+
+			cursor = r->msg_body;
+			Assert(cursor != NULL);
+			shentry = (GinShmemEntry*)cursor;
+			cursor += sizeof(shentry);
+
+			if (r->skipkey)
+				cursor += r->skipkey;
+			else
+			{
+				r->key = 0;
+				if (shentry->category == GIN_CAT_NORM_KEY)
+				{
+					bool isnull;
+					char *oldcursor = cursor;
+					r->key = datumRestore(&cursor, &isnull); // TODO: check if this leaks memory in a long-living context
+					r->skipkey = cursor - oldcursor;
+					Assert(!isnull);
+					Assert(r->skipkey);
+				}
+			}
+
+			cmp = -1;
+			if (entry != NULL)
+			{
+				cmp = ginCompareAttEntries(&buildstate->ginstate,
+										   shentry->attnum, r->key, shentry->category,
+										   entry->attnum, entry->key, entry->category);
+			}
+
+			if (cmp > 0)
+			{
+				/* The key is greater, skip the worker. */
+				continue;
+			}
+
+			if (cmp < 0)
+			{
+				/*
+				 * The key is less than what we have on the stack.
+				 * Push a new entry onto the stack.
+				 */
+				entry = pushEntry(entry);
+				entry->key      = r->key;
+				entry->category = shentry->category;
+				entry->attnum   = shentry->attnum;
+			}
+
+			/* The key is less than or equal. Merge the item pointers. */
+			{
+				ItemPointerData *list = (ItemPointerData*)cursor;
+				oldlist = entry->list;
+				entry->list = ginMergeItemPointers(entry->list, entry->nlist,
+												   list, shentry->nlist,
+												   &entry->nlist);
+
+				pfree(oldlist);
+			}
+
+			/* Message consumed. Receive the next one. */
+			r->skipkey = 0;
+			if (shm_mq_receive(r->mqh, (Size *)&r->msg_len, (void *)&r->msg_body, false) != SHM_MQ_SUCCESS)
+				r->end_of_forest = true;
+			merged = true;
+		}
+
+		if (!merged)
+		{
+			/* Nothing merged. Insert the entry into the index and pop the stack. */
+			if (entry == NULL)
+			{
+				/* Also nothing to dump - we have finished. */
+				break;
+			}
+
+			ginEntryInsert(&buildstate->ginstate,
+						   entry->attnum, entry->key, entry->category,
+						   entry->list, entry->nlist,
+						   &buildstate->buildStats);
+			entry = popEntry(entry);
+		}
+	}
+}
+
+static void mergeResults(GinBuildState *buildstate, ParallelContext *pcxt, volatile PGinBuildTask *task)
+{
+	int trees;
+	while ((trees = waitNextTree(pcxt, task)) > 0)
+	{
+		mergeReadyAndUnfinished(buildstate, pcxt, task);
+	}
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	GinBuildState buildstate;
+
 	IndexBuildResult *result;
-	double		reltuples;
-	GinBuildState buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
-	ItemPointerData *list;
-	Datum		key;
-	GinNullCategory category;
-	uint32		nlist;
-	MemoryContext oldCtx;
-	OffsetNumber attnum;
+	double reltuples = 0;
+	bool parallel_workers_helped = false;
 
 	if (RelationGetNumberOfBlocks(index) != 0)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	initGinState(&buildstate.ginstate, index);
-	buildstate.indtuples = 0;
-	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
-
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -363,60 +830,97 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	UnlockReleaseBuffer(RootBuffer);
 	END_CRIT_SECTION();
 
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
-	/*
-	 * create a temporary memory context that is used to hold data not yet
-	 * dumped out to the index
-	 */
-	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
-											  "Gin build temporary context",
-											  ALLOCSET_DEFAULT_MINSIZE,
-											  ALLOCSET_DEFAULT_INITSIZE,
-											  ALLOCSET_DEFAULT_MAXSIZE);
-
-	/*
-	 * create a temporary memory context that is used for calling
-	 * ginExtractEntries(), and can be reset after each tuple
-	 */
-	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
-					 "Gin build temporary context for user-defined function",
-											   ALLOCSET_DEFAULT_MINSIZE,
-											   ALLOCSET_DEFAULT_INITSIZE,
-											   ALLOCSET_DEFAULT_MAXSIZE);
-
-	buildstate.accum.ginstate = &buildstate.ginstate;
-	ginInitBA(&buildstate.accum);
-
-	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
-	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
-
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	if (gin_parallel_workers > GIN_MAX_WORKERS)
+		gin_parallel_workers = GIN_MAX_WORKERS;
+
+	if (gin_parallel_workers > 0)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
-	}
-	MemoryContextSwitchTo(oldCtx);
+		if (RelationUsesLocalBuffers(heap))
+		{
+			elog(WARNING, "not using parallel GIN build on temporary table %s\n", NameStr(heap->rd_rel->relname));
+		}
+		else
+		{
+			EnterParallelMode();
+			{
+				int i;
+
+				PGinBuildTask   *task;
+				ParallelContext *pcxt;
+				void *shm;
+				void *ptr;
+
+				int *mqsize;
+
+				int size = 0, keys = 0;
+				keys++; size += sizeof(PGinBuildTask);
+				keys++; size += gin_shared_mem * 1024;
+				keys++; size += sizeof(int);
+
+				pcxt = CreateParallelContext(ginbuildWorker, gin_parallel_workers);
+
+				shm_toc_estimate_chunk(&pcxt->estimator, size);
+				shm_toc_estimate_keys(&pcxt->estimator, keys);
+				InitializeParallelDSM(pcxt);
+
+				ptr = shm_toc_allocate(pcxt->toc, sizeof(PGinBuildTask));
+				shm_toc_insert(pcxt->toc, KEY_TASK, ptr);
+				task = (PGinBuildTask*)ptr;
 
-	MemoryContextDelete(buildstate.funcCtx);
-	MemoryContextDelete(buildstate.tmpCtx);
+				ptr = shm_toc_allocate(pcxt->toc, gin_shared_mem * 1024);
+				shm_toc_insert(pcxt->toc, KEY_SHM_ORIGIN, ptr);
+				shm = ptr;
 
-	/*
-	 * Update metapage stats
-	 */
-	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
-	ginUpdateStats(index, &buildstate.buildStats);
+				mqsize = (int *)shm_toc_allocate(pcxt->toc, sizeof(int));
+				*mqsize = gin_shared_mem * 1024 / pcxt->nworkers;
+				shm_toc_insert(pcxt->toc, KEY_SHM_PER_WORKER, mqsize);
+
+				SpinLockInit(&task->lock);
+
+				for (i = 0; i < pcxt->nworkers; i++)
+				{
+					volatile WorkerResult *r = &task->results[i];
+					r->mq = shm_mq_create((char *)shm + i * (*mqsize), *mqsize);
+					shm_mq_set_receiver(r->mq, MyProc);
+					r->mqh = shm_mq_attach(r->mq, pcxt->seg, NULL);
+					r->end_of_tree = false;
+					r->end_of_forest = false;
+					r->msg_body = NULL;
+					r->msg_len = 0;
+					r->skipkey = 0;
+				}
+				task->reltuples = 0;
+				task->to_scan = RelationGetNumberOfBlocks(heap);
+				task->heap_oid = heap->rd_id;
+				task->index_oid = index->rd_id;
+				LaunchParallelWorkers(pcxt);
+				if (pcxt->nworkers_launched > 0)
+				{
+					mergeResults(&buildstate, pcxt, task);
+
+					WaitForParallelWorkersToFinish(pcxt);
+					reltuples = task->reltuples;
+
+					parallel_workers_helped = true;
+				}
+				DestroyParallelContext(pcxt);
+			}
+			ExitParallelMode();
+		}
+	}
+
+	if (!parallel_workers_helped)
+	{
+		/* Do everything myself */
+		reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+	}
 
 	/*
 	 * Return statistics
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index edcafce..e53cc93 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2744,6 +2744,31 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"gin_parallel_workers",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of parallel workers for GIN buiding."),
+			NULL,
+		},
+		&gin_parallel_workers,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"gin_shared_mem",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("The size of shared memory segment for parallel GIN buiding."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&gin_shared_mem,
+		16 * 1024, 1024, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index ee3d378..a756218 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -538,6 +538,8 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#gin_parallel_workers = 0		# 0 disables parallel gin build
+#gin_shared_mem = 16MB
 
 # - Locale and Formatting -
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..91e5b27 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -68,6 +68,8 @@ typedef char GinTernaryValue;
 /* GUC parameters */
 extern PGDLLIMPORT int GinFuzzySearchLimit;
 extern int	gin_pending_list_limit;
+extern int	gin_parallel_workers;
+extern int	gin_shared_mem;
 
 /* ginutil.c */
 extern void ginGetStats(Relation index, GinStatsData *stats);
#14Amit Kapila
amit.kapila16@gmail.com
In reply to: Constantin S. Pan (#13)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com> wrote:

On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:

On 2/18/16 10:10 AM, Constantin S. Pan wrote:

On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using gin(body_tsvector);
LOG: worker process: parallel worker for PID 5689 (PID 6906) was
terminated by signal 11: Segmentation fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.

Oleg, it looks like Constantin has updated to patch to address the
issue you were seeing. Do you have time to retest and review?

Thanks,

Actually, there was some progress since. The patch is
attached.

1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.

2. Changed the way results are merged. It uses shared memory
message queue now.

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
0 16 247
1 16 256

It seems from you data that with 1 worker, you are always seeing slowdown,
have you investigated the reason of same?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#15Constantin S. Pan
kvapen@gmail.com
In reply to: Amit Kapila (#14)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, 16 Mar 2016 12:14:51 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
wrote:

On Mon, 14 Mar 2016 08:42:26 -0400
David Steele <david@pgmasters.net> wrote:

On 2/18/16 10:10 AM, Constantin S. Pan wrote:

On Wed, 17 Feb 2016 23:01:47 +0300
Oleg Bartunov <obartunov@gmail.com> wrote:

My feedback is (Mac OS X 10.11.3)

set gin_parallel_workers=2;
create index message_body_idx on messages using
gin(body_tsvector); LOG: worker process: parallel worker for
PID 5689 (PID 6906) was terminated by signal 11: Segmentation
fault

Fixed this, try the new patch. The bug was in incorrect handling
of some GIN categories.

Oleg, it looks like Constantin has updated to patch to address the
issue you were seeing. Do you have time to retest and review?

Thanks,

Actually, there was some progress since. The patch is
attached.

1. Added another GUC parameter for changing the amount of
shared memory for parallel GIN workers.

2. Changed the way results are merged. It uses shared memory
message queue now.

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
0 16 247
1 16 256

It seems from you data that with 1 worker, you are always seeing
slowdown, have you investigated the reason of same?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

That slowdown is expected. It slows down because with 1 worker it
has to transfer the results from the worker to the backend.

The backend just waits for the results from the workers and merges them
(in case wnum > 0). So the 1-worker configuration should never be used,
because it is as sequential as the 0-worker, but adds data transfer.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Peter Geoghegan
pg@heroku.com
In reply to: Constantin S. Pan (#15)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com> wrote:

The backend just waits for the results from the workers and merges them
(in case wnum > 0). So the 1-worker configuration should never be used,
because it is as sequential as the 0-worker, but adds data transfer.

This is why I wanted an easy way of atomically guaranteeing some
number of workers (typically 2), or not using parallelism at all. I
think the parallel worker API should offer a simple way to do that in
cases like this, where having only 1 worker is never going to win.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Constantin S. Pan
kvapen@gmail.com
In reply to: Peter Geoghegan (#16)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, 16 Mar 2016 02:43:47 -0700
Peter Geoghegan <pg@heroku.com> wrote:

On Wed, Mar 16, 2016 at 2:25 AM, Constantin S. Pan <kvapen@gmail.com>
wrote:

The backend just waits for the results from the workers and merges
them (in case wnum > 0). So the 1-worker configuration should never
be used, because it is as sequential as the 0-worker, but adds data
transfer.

This is why I wanted an easy way of atomically guaranteeing some
number of workers (typically 2), or not using parallelism at all. I
think the parallel worker API should offer a simple way to do that in
cases like this, where having only 1 worker is never going to win.

Well, we can check the number of workers actually launched and revert
back to single backend way when there is less than 2 workers. Let me
code that in.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Constantin S. Pan (#15)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

On Wed, 16 Mar 2016 12:14:51 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan <kvapen@gmail.com>
wrote:

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
0 16 247
1 16 256

It seems from you data that with 1 worker, you are always seeing
slowdown, have you investigated the reason of same?

That slowdown is expected. It slows down because with 1 worker it
has to transfer the results from the worker to the backend.

The backend just waits for the results from the workers and merges them
(in case wnum > 0).

Why backend just waits, why can't it does the same work as any worker
does? In general, for other parallelism features the backend also behaves
the same way as worker in producing the results if the results from workers
is not available.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#19Constantin S. Pan
kvapen@gmail.com
In reply to: Amit Kapila (#18)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, 16 Mar 2016 18:08:38 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 2:55 PM, Constantin S. Pan <kvapen@gmail.com>
wrote:

On Wed, 16 Mar 2016 12:14:51 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 5:41 AM, Constantin S. Pan
<kvapen@gmail.com> wrote:

3. Tested on some real data (GIN index on email message body
tsvectors). Here are the timings for different values of
'gin_shared_mem' and 'gin_parallel_workers' on a 4-CPU
machine. Seems 'gin_shared_mem' has no visible effect.

wnum mem(MB) time(s)
0 16 247
1 16 256

It seems from you data that with 1 worker, you are always seeing
slowdown, have you investigated the reason of same?

That slowdown is expected. It slows down because with 1 worker it
has to transfer the results from the worker to the backend.

The backend just waits for the results from the workers and merges
them (in case wnum > 0).

Why backend just waits, why can't it does the same work as any worker
does? In general, for other parallelism features the backend also
behaves the same way as worker in producing the results if the
results from workers is not available.

We can make backend do the same work as any worker, but that
will complicate the code for less than 2 % perfomance boost.
This performance issue matters even less as you increase the
number of workers N, since you save only 1/N-th of the
transfer time.

Is backend waiting for workers a really bad practice that
should be avoided at all costs, or can we leave it as is?

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Constantin S. Pan (#19)
Re: [WIP] speeding up GIN build with parallel workers

On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

On Wed, 16 Mar 2016 18:08:38 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

Why backend just waits, why can't it does the same work as any worker
does? In general, for other parallelism features the backend also
behaves the same way as worker in producing the results if the
results from workers is not available.

We can make backend do the same work as any worker, but that
will complicate the code for less than 2 % perfomance boost.

Why do you think it will be just 2%? I think for single worker case, it
should be much more as the master backend will be less busy in consuming
tuples from tuple queue. I can't say much about code-complexity, as I
haven't yet looked carefully at the logic of patch, but we didn't find much
difficulty while doing it for parallel scans. One of the commit which
might help you in understanding how currently heap scans are parallelised
is ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can help
you in anyway for writing a generic API for Gin parallel builds.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21Constantin S. Pan
kvapen@gmail.com
In reply to: Amit Kapila (#20)
Re: [WIP] speeding up GIN build with parallel workers

On Thu, 17 Mar 2016 13:21:32 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
wrote:

On Wed, 16 Mar 2016 18:08:38 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

Why backend just waits, why can't it does the same work as any
worker does? In general, for other parallelism features the
backend also behaves the same way as worker in producing the
results if the results from workers is not available.

We can make backend do the same work as any worker, but that
will complicate the code for less than 2 % perfomance boost.

Why do you think it will be just 2%? I think for single worker case,
it should be much more as the master backend will be less busy in
consuming tuples from tuple queue. I can't say much about
code-complexity, as I haven't yet looked carefully at the logic of
patch, but we didn't find much difficulty while doing it for parallel
scans. One of the commit which might help you in understanding how
currently heap scans are parallelised is
ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
help you in anyway for writing a generic API for Gin parallel builds.

I looked at the timing details some time ago, which showed
that the backend spent about 1% of total time on data
transfer from 1 worker, and 3% on transfer and merging from
2 workers. So if we use (active backend + 1 worker) instead
of (passive backend + 2 workers), we still have to spend
1.5% on transfer and merging.

Or we can look at these measurements (from yesterday's
message):

wnum mem(MB) time(s)
0 16 247
1 16 256
2 16 126

If 2 workers didn't have to transfer and merge their
results, they would have finished in 247 / 2 = 123.5
seconds. But the transfer and merging took another 2.5
seconds. The merging takes a little longer than the
transfer. If we now use backend+worker we get rid of 1
transfer, but still have to do 1 transfer and then merge, so
we will save less than a quarter of those 2.5 seconds.

In other words, we gain almost nothing by teaching the
backend how to be a worker.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Constantin S. Pan (#21)
Re: [WIP] speeding up GIN build with parallel workers

On Thu, Mar 17, 2016 at 2:56 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

On Thu, 17 Mar 2016 13:21:32 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Mar 16, 2016 at 7:50 PM, Constantin S. Pan <kvapen@gmail.com>
wrote:

On Wed, 16 Mar 2016 18:08:38 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

Why backend just waits, why can't it does the same work as any
worker does? In general, for other parallelism features the
backend also behaves the same way as worker in producing the
results if the results from workers is not available.

We can make backend do the same work as any worker, but that
will complicate the code for less than 2 % perfomance boost.

Why do you think it will be just 2%? I think for single worker case,
it should be much more as the master backend will be less busy in
consuming tuples from tuple queue. I can't say much about
code-complexity, as I haven't yet looked carefully at the logic of
patch, but we didn't find much difficulty while doing it for parallel
scans. One of the commit which might help you in understanding how
currently heap scans are parallelised is
ee7ca559fcf404f9a3bd99da85c8f4ea9fbc2e92, you can see if that can
help you in anyway for writing a generic API for Gin parallel builds.

I looked at the timing details some time ago, which showed
that the backend spent about 1% of total time on data
transfer from 1 worker, and 3% on transfer and merging from
2 workers. So if we use (active backend + 1 worker) instead
of (passive backend + 2 workers), we still have to spend
1.5% on transfer and merging.

I think here the comparison should be between the case of (active backend +
1 worker) with (passive backend + 1 worker) or (active backend + 2 worker)
with (passive backend + 2 workers). I don't think it is good assumption
that workers are always freely available and you can use them as and when
required for any operation.

Or we can look at these measurements (from yesterday's
message):

wnum mem(MB) time(s)
0 16 247
1 16 256
2 16 126

If 2 workers didn't have to transfer and merge their
results, they would have finished in 247 / 2 = 123.5
seconds. But the transfer and merging took another 2.5
seconds. The merging takes a little longer than the
transfer. If we now use backend+worker we get rid of 1
transfer, but still have to do 1 transfer and then merge, so
we will save less than a quarter of those 2.5 seconds.

If I understand the above data correctly, then it seems to indicate that
majority of the work is done in processing the data, so I think it should
be better if master and worker both can work together.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#23Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#22)
Re: [WIP] speeding up GIN build with parallel workers

On Thu, Mar 17, 2016 at 11:42 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

I think here the comparison should be between the case of (active backend +
1 worker) with (passive backend + 1 worker) or (active backend + 2 worker)
with (passive backend + 2 workers). I don't think it is good assumption
that workers are always freely available and you can use them as and when
required for any operation.

Strong +1. The pool of background workers is necessarily quite
limited and you can't just gobble them up. I'm not saying that it's
absolutely essential that the leader can also participate, but saying
that 1 active leader + 1 worker is only 2% faster than 1 passive
leader + 2 workers is not comparing apples to apples.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Dmitry Ivanov
d.ivanov@postgrespro.ru
In reply to: Constantin S. Pan (#13)
Re: [WIP] speeding up GIN build with parallel workers

Hi Constantin,

I did a quick review of your patch, and here are my comments:

- This patch applies cleanly to the current HEAD (61d2ebdbf91).

- Code compiles without warnings.

- Currently there's no documentation regarding parallel gin build feature and
provided GUC variables.

- Built indexes work seem to work normally.

Performance
-----------

I've made a few runs on my laptop (i7-4710HQ, default postgresql.conf), here
are the results:

workers avg. time (s)
0 412
4 133
8 81

Looks like 8 workers & a backend do the job 5x times faster than a sole
backend, which is good!

Code style
----------

There are some things that you've probably overlooked, such as:

task->heap_oid = heap->rd_id;
task->index_oid = index->rd_id;

You could replace direct access to 'rd_id' field with the RelationGetRelid
macro.

static void ginDumpEntry(GinState *ginstate,
volatile WorkerResult *r

Parameter 'r' is unused, you could remove it.

Some of the functions and pieces of code that you've added do not comply to
the formatting conventions, e. g.

static int claimSomeBlocks(...
static GinEntryStack *pushEntry(

// The message consists of 2 or 3 parts. iovec allows us to send them as

etc.

Keep up the good work!

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Constantin S. Pan
kvapen@gmail.com
In reply to: Dmitry Ivanov (#24)
Re: [WIP] speeding up GIN build with parallel workers

On Fri, 18 Mar 2016 20:40:16 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:

Hi Constantin,

I did a quick review of your patch, and here are my comments:

- This patch applies cleanly to the current HEAD (61d2ebdbf91).

- Code compiles without warnings.

- Currently there's no documentation regarding parallel gin build
feature and provided GUC variables.

- Built indexes work seem to work normally.

Performance
-----------

I've made a few runs on my laptop (i7-4710HQ, default
postgresql.conf), here are the results:

workers avg. time (s)
0 412
4 133
8 81

Looks like 8 workers & a backend do the job 5x times faster than a
sole backend, which is good!

Code style
----------

There are some things that you've probably overlooked, such as:

task->heap_oid = heap->rd_id;
task->index_oid = index->rd_id;

You could replace direct access to 'rd_id' field with the
RelationGetRelid macro.

static void ginDumpEntry(GinState *ginstate,
volatile
WorkerResult *r

Parameter 'r' is unused, you could remove it.

Some of the functions and pieces of code that you've added do not
comply to the formatting conventions, e. g.

static int claimSomeBlocks(...
static GinEntryStack *pushEntry(

// The message consists of 2 or 3 parts. iovec allows us to send
them as

etc.

Keep up the good work!

Hi Dmitry,

Thank you for the review. I am working on a new version, that will fix
the code style and also involve backend into the process as one of the
workers.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Constantin S. Pan
kvapen@gmail.com
In reply to: Dmitry Ivanov (#24)
1 attachment(s)
Re: [WIP] speeding up GIN build with parallel workers

On Fri, 18 Mar 2016 20:40:16 +0300
Dmitry Ivanov <d.ivanov@postgrespro.ru> wrote:

- Currently there's no documentation regarding parallel gin build
feature and provided GUC variables.

You could replace direct access to 'rd_id' field with the
RelationGetRelid macro.

Parameter 'r' is unused, you could remove it.

Some of the functions and pieces of code that you've added do not
comply to the formatting conventions, e. g.

On Fri, 18 Mar 2016 11:18:46 -0400
Robert Haas <robertmhaas@gmail.com> wrote:

Strong +1. The pool of background workers is necessarily quite
limited and you can't just gobble them up. I'm not saying that it's
absolutely essential that the leader can also participate, but saying
that 1 active leader + 1 worker is only 2% faster than 1 passive
leader + 2 workers is not comparing apples to apples.

On Fri, 18 Mar 2016 09:12:59 +0530
Amit Kapila <amit.kapila16@gmail.com> wrote:

If I understand the above data correctly, then it seems to indicate
that majority of the work is done in processing the data, so I think
it should be better if master and worker both can work together.

Thank you for the support and feedback! Fixed all of these in the new
version. Also added new queries to 'gin' regression test, which
immediately helped me catch one silly bug. The new patch is
attached.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

pgin-6.patchtext/x-patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7695ec1..7f2e302 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6850,6 +6850,35 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-gin-parallel-workers" xreflabel="gin_parallel_workers">
+      <term><varname>gin_parallel_workers</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>gin_parallel_workers</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Upper limit of the number of parallel workers that should be used when
+        building GIN. Set to 0 to disable parallel GIN build.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-gin-shared-mem" xreflabel="gin_shared_mem">
+      <term><varname>gin_shared_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>gin_parallel_workers</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The size of shared memory segment for parallel GIN buiding. The segment
+        is used to transfer partial results from workers for final merging.
+        Ignored if not using parallel GIN build.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
    </sect1>
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index fade0cb..4325d88 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -49,7 +49,7 @@ Features
   * Write-Ahead Logging (WAL).  (Recoverability from crashes.)
   * User-defined opclasses.  (The scheme is similar to GiST.)
   * Optimized index creation (Makes use of maintenance_work_mem to accumulate
-    postings in memory.)
+    postings in memory and gin_parallel_workers to speed the process up.)
   * Text search support via an opclass
   * Soft upper limit on the returned results set using a GUC variable:
     gin_fuzzy_search_limit
@@ -324,6 +324,24 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
 deleted pages around with the right-link intact until all concurrent scans
 have finished.)
 
+Parallel Building
+-----------------
+
+The most expensive part of GIN index building is accumulating the rbtree. GIN
+supports using parallel workers which divide the work between each other. This
+speeds up the accumulating stage almost perfectly, but slows down the dumping
+of the result a little, since the backend now needs to merge lists that
+correspond to the same key but come from multiple workers.
+
+When it is time to build a GIN index on a relation, the backend checks the value of
+gin_parallel_workers config variable and tries to launch the corresponding
+number of parallel workers. The backend also sets up a shared memory segment of
+configurable size (gin_shared_mem). The workers scan the relation, dividing it
+by blocks, until they fill the maintenance_work_mem, then send the accumulated
+tree to the backend through the shared memory segment. The backend merges the
+trees and inserts the entries to the index. The process repeats until the
+relation is fully scanned.
+
 Compatibility
 -------------
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..81ec5be 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,16 +16,69 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
+int gin_shared_mem = 0;
 
-typedef struct
+#define KEY_TASK 42
+#define KEY_SHM_ORIGIN 43
+#define KEY_SHM_PER_WORKER 44
+#define GIN_BLOCKS_PER_WORKER 4
+
+/*
+ * The shmem message structure:
+ * Entry, Key, List
+ */
+
+typedef struct GinShmemEntry
+{
+	GinNullCategory category;
+	OffsetNumber attnum;
+	int nlist;
+} GinShmemEntry;
+
+typedef struct WorkerResult
+{
+	void *mq;
+	shm_mq_handle *mqh;
+	bool end_of_tree;
+	bool end_of_forest;
+
+	void *msg_body;
+	Size msg_len;
+	Datum key;
+	int skipkey;
+} WorkerResult;
+
+/*
+ * This structure describes the GIN build task for the parallel workers. We use
+ * OIDs here because workers are separate processes and pointers may become
+ * meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct GinBuildTask
+{
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+} GinBuildTask;
+
+typedef struct GinBuildState
 {
 	GinState	ginstate;
 	double		indtuples;
@@ -33,9 +86,16 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	ParallelContext *pcxt;
+	volatile GinBuildTask *task;
+	WorkerResult *results; /* NULL in workers, array in backend */
+
+	/* these only get used by workers */
+	shm_mq *mq;
+	shm_mq_handle *mqh;
 } GinBuildState;
 
-
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -266,6 +326,359 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 }
 
 static void
+ginDumpEntry(GinState *ginstate,
+			 shm_mq_handle *mqh,
+			 OffsetNumber attnum,
+			 Datum key,
+			 GinNullCategory category,
+			 ItemPointerData *list,
+			 int nlist)
+{
+	int keylen, listlen;
+
+	bool isnull;
+	Form_pg_attribute att;
+	GinShmemEntry e;
+
+	/*
+	 * The message consists of 2 or 3 parts. iovec allows us to send them as
+	 * one message though the parts are located at unrelated addresses.
+	 */
+	shm_mq_iovec iov[3];
+	int iovlen = 0;
+
+	char *buf = NULL;
+
+	e.category = category;
+	e.attnum = attnum;
+	e.nlist = nlist;
+
+	Assert(nlist > 0);
+
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	if (e.category == GIN_CAT_NORM_KEY)
+	{
+		keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+		Assert(keylen > 0);
+		listlen = e.nlist * sizeof(ItemPointerData);
+	}
+	else
+		keylen = 0;
+
+	listlen = e.nlist * sizeof(ItemPointerData);
+
+	iov[iovlen].data = (char *)&e;
+	iov[iovlen++].len = sizeof(e);
+
+	if (keylen > 0)
+	{
+		char *cursor;
+		buf = palloc(keylen);
+		cursor = buf;
+		datumSerialize(key, isnull, att->attbyval, att->attlen, &cursor);
+		iov[iovlen].data = buf;
+		iov[iovlen++].len = keylen;
+	}
+
+	iov[iovlen].data = (char *)list;
+	iov[iovlen++].len = listlen;
+
+	if (shm_mq_sendv(mqh, iov, iovlen, false) != SHM_MQ_SUCCESS)
+		elog(ERROR,
+			 "worker %d failed to send a result entry to the backend",
+			 ParallelWorkerNumber);
+
+	if (buf)
+		pfree(buf);
+}
+
+static void
+ginSendTree(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+
+	Assert(IsParallelWorker());
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+							  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		ginDumpEntry(&buildstate->ginstate,
+					 buildstate->mqh, attnum, key,
+					 category, list, nlist);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	/* send an empty message as an "end-of-tree" marker */
+	if (shm_mq_send(buildstate->mqh, 0, NULL, false) != SHM_MQ_SUCCESS)
+	{
+		elog(ERROR,
+			 "worker %d failed to send the end-of-tree marker to the backend",
+			 ParallelWorkerNumber);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+/*
+ * Wait until all unfinished workers start dumping their trees.
+ * Return the number of trees to merge (except the backend's one).
+ */
+static int
+waitNextTree(GinBuildState *buildstate)
+{
+	int i;
+	int trees = 0;
+
+	int wnum = buildstate->pcxt ? buildstate->pcxt->nworkers_launched : 0;
+	for (i = 0; i < wnum; ++i)
+	{
+		WorkerResult *r = buildstate->results + i;
+
+		if (shm_mq_receive(r->mqh, (Size *)&r->msg_len, (void *)&r->msg_body, false) == SHM_MQ_SUCCESS)
+		{
+			r->end_of_tree = false;
+			trees++;
+		}
+		else
+		{
+			r->end_of_forest = true;
+		}
+	}
+	return trees;
+}
+
+typedef struct GinEntry
+{
+	struct GinEntry *next;
+	Datum			key;
+	GinNullCategory	category;
+	OffsetNumber	attnum;
+	ItemPointerData	*list;
+	uint32			nlist;
+} GinEntry;
+
+static bool
+getCandidate(GinBuildState *buildstate,
+			 int worker,
+			 GinEntry *backend_candidate,
+			 GinEntry *candidate)
+{
+	if (worker < 0)
+	{
+		if (!backend_candidate->list)
+			backend_candidate->list = ginGetBAEntry(&buildstate->accum,
+													&backend_candidate->attnum,
+													&backend_candidate->key, 
+													&backend_candidate->category, 
+													&backend_candidate->nlist);
+		if (!backend_candidate->list)
+			return false;
+		*candidate = *backend_candidate;
+	}
+	else
+	{
+		GinShmemEntry *shentry;
+		char *cursor;
+		WorkerResult *r = buildstate->results + worker;
+		Assert(worker >= 0);
+
+		if (r->end_of_forest || r->end_of_tree)
+			return false;
+
+		if (r->msg_len == 0) /* end-of-tree */
+		{
+			r->end_of_tree = true;
+			return false;
+		}
+
+		cursor = r->msg_body;
+		Assert(cursor != NULL);
+		shentry = (GinShmemEntry*)cursor;
+		cursor += sizeof(shentry);
+
+		if (r->skipkey)
+		{
+			/* the key has already been restored */
+			cursor += r->skipkey;
+		}
+		else
+		{
+			r->key = 0;
+			if (shentry->category == GIN_CAT_NORM_KEY)
+			{
+				bool isnull;
+				char *oldcursor = cursor;
+				r->key = datumRestore(&cursor, &isnull);
+				r->skipkey = cursor - oldcursor;
+				Assert(!isnull);
+				Assert(r->skipkey);
+			}
+		}
+		candidate->list = (ItemPointerData*)cursor;
+		candidate->nlist = shentry->nlist;
+		candidate->attnum = shentry->attnum;
+		candidate->key = r->key;
+		candidate->category = shentry->category;
+	}
+	return true;
+}
+
+/* Message consumed. Receive the next one, if not backend. */
+static void
+consumeCandidate(GinBuildState *buildstate, int worker, GinEntry *backend_candidate)
+{
+	if (worker < 0)
+		backend_candidate->list = NULL;
+	else
+	{
+		WorkerResult *r = buildstate->results + worker;
+		r->skipkey = 0;
+		if (shm_mq_receive(r->mqh, (Size *)&r->msg_len, (void *)&r->msg_body, false) != SHM_MQ_SUCCESS)
+			r->end_of_forest = true;
+	}
+}
+
+static GinEntry *
+pushEntry(GinEntry *stack)
+{
+	GinEntry *head = palloc(sizeof(GinEntry));
+	head->next = stack;
+	head->list = palloc(sizeof(ItemPointerData)); /* make ginMergeItemPointers happy */
+	head->nlist = 0;
+	return head;
+}
+
+static GinEntry *
+popEntry(GinEntry *stack)
+{
+	GinEntry *head = stack;
+	Assert(stack != NULL);
+	stack = stack->next;
+	pfree(head->list);
+	pfree(head);
+	return stack;
+}
+
+static int
+compare(GinBuildState *buildstate, GinEntry *a, GinEntry *b)
+{
+	return ginCompareAttEntries(&buildstate->ginstate,
+							    a->attnum, a->key, a->category,
+								b->attnum, b->key, b->category);
+}
+
+/* Merge the trees from all ready (but unfinished) workers (and from myself). */
+static void
+mergeTrees(GinBuildState *buildstate)
+{
+	GinEntry *entry = NULL;
+	GinEntry backend_candidate;
+	backend_candidate.list = NULL;
+
+	ginBeginBAScan(&buildstate->accum);
+
+	while (true)
+	{
+		int i;
+		bool merged = false;
+		int wnum = buildstate->pcxt ? buildstate->pcxt->nworkers_launched : 0;
+
+		/* -1 is used as the id of the backend */
+		for (i = -1; i < wnum; ++i)
+		{
+			GinEntry candidate;
+			int cmp;
+
+			if (!getCandidate(buildstate, i, &backend_candidate, &candidate))
+				continue;
+
+			cmp = -1;
+			if (entry != NULL)
+				cmp = compare(buildstate, &candidate, entry);
+
+			if (cmp > 0)
+			{
+				/* The candidate's key is greater, skip the worker. */
+				continue;
+			}
+
+			if (cmp < 0)
+			{
+				/*
+				 * The candidate's key is less than what we have on the stack.
+				 * Push a new entry onto the stack.
+				 */
+				entry = pushEntry(entry);
+				entry->key      = candidate.key;
+				entry->category = candidate.category;
+				entry->attnum   = candidate.attnum;
+			}
+
+			/* The key is less than or equal. Merge the item pointers. */
+			{
+				int newnlist;
+				ItemPointerData *oldlist = entry->list;
+				entry->list = ginMergeItemPointers(entry->list, entry->nlist,
+												   candidate.list, candidate.nlist,
+												   &newnlist);
+				entry->nlist = newnlist;
+				pfree(oldlist);
+			}
+
+			consumeCandidate(buildstate, i, &backend_candidate);
+			merged = true;
+		}
+
+		if (!merged)
+		{
+			/* Nothing merged. Insert the entry into the index and pop the stack. */
+			if (entry == NULL)
+			{
+				/* Also nothing to dump - we have finished. */
+				break;
+			}
+
+			ginEntryInsert(&buildstate->ginstate,
+						   entry->attnum, entry->key, entry->category,
+						   entry->list, entry->nlist,
+						   &buildstate->buildStats);
+			entry = popEntry(entry);
+		}
+	}
+
+	ginInitBA(&buildstate->accum);
+}
+
+static void
+ginDumpCommon(GinBuildState *buildstate, bool last)
+{
+	if (IsParallelWorker())
+		ginSendTree(buildstate);
+	else
+	{
+		int trees;
+		do
+		{
+			trees = waitNextTree(buildstate);
+			mergeTrees(buildstate);
+		} while (last && (trees > 0));
+	}
+}
+
+static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
 {
@@ -282,53 +695,239 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-								  &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-			ginEntryInsert(&buildstate->ginstate, attnum, key, category,
-						   list, nlist, &buildstate->buildStats);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginDumpCommon(buildstate, false);
 
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * Claim "max_blocks" or less blocks. Return the actual number of claimed
+ * blocks and set "first" to point to the first block of the claimed range.
+ * 0 return value means the task has been finished.
+ */
+static int
+claimSomeBlocks(volatile GinBuildTask *task, int max_blocks, int *first)
+{
+	int blocks = 0;
+
+	SpinLockAcquire(&task->lock);
+
+	if (task->scanned >= task->to_scan)
+	{
+		SpinLockRelease(&task->lock);
+		return 0;
+	}
+
+	*first = task->scanned;
+	blocks = max_blocks;
+	if (blocks > task->to_scan - task->scanned)
+		blocks = task->to_scan - task->scanned;
+	task->scanned += blocks;
+
+	SpinLockRelease(&task->lock);
+	return blocks;
+}
+
+static void
+reportReltuples(volatile GinBuildTask *task, double reltuples)
+{
+	SpinLockAcquire(&task->lock);
+	task->reltuples += reltuples;
+	SpinLockRelease(&task->lock);
+}
+
+static double
+ginbuildCommon(GinBuildState *buildstate, Relation heap, Relation index, IndexInfo *indexInfo)
+{
+	double reltuples = 0;
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate->tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context",
+											   ALLOCSET_DEFAULT_MINSIZE,
+											   ALLOCSET_DEFAULT_INITSIZE,
+											   ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate->funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+												"Gin build temporary context for user-defined function",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												ALLOCSET_DEFAULT_MAXSIZE);
+
+	buildstate->accum.ginstate = &buildstate->ginstate;
+	ginInitBA(&buildstate->accum);
+
+	/*
+	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
+	 * prefers to receive tuples in TID order.
+	 */
+	while (true)
+	{
+		double subtuples;
+		int first, blocks;
+
+		blocks = claimSomeBlocks(buildstate->task, GIN_BLOCKS_PER_WORKER, &first);
+		if (blocks == 0)
+			break;
+
+		subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+											first, blocks,
+											ginBuildCallback, (void *)buildstate);
+		reltuples += subtuples;
+	}
+
+	/* dump remaining entries to the index */
+	ginDumpCommon(buildstate, true);
+
+	MemoryContextDelete(buildstate->funcCtx);
+	MemoryContextDelete(buildstate->tmpCtx);
+
+	/*
+	 * Update metapage stats
+	 */
+	buildstate->buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
+	ginUpdateStats(index, &buildstate->buildStats);
+
+	return reltuples;
+}
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void
+ginbuildWorker(dsm_segment *seg, shm_toc *toc)
+{
+	GinBuildState buildstate;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	double reltuples;
+
+	char *shm_origin;
+	int mqsize;
+
+	buildstate.task = (GinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+
+	shm_origin = (char *)shm_toc_lookup(toc, KEY_SHM_ORIGIN);
+	mqsize = *(int*)shm_toc_lookup(toc, KEY_SHM_PER_WORKER);
+	buildstate.mq = (shm_mq *)(shm_origin + ParallelWorkerNumber * mqsize);
+	shm_mq_set_sender(buildstate.mq, MyProc);
+	buildstate.mqh = shm_mq_attach(buildstate.mq, seg, NULL);
+	shm_mq_wait_for_attach(buildstate.mqh);
+
+	heap = heap_open(buildstate.task->heap_oid, NoLock);
+	index = index_open(buildstate.task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
+
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+
+	reportReltuples(buildstate.task, reltuples);
+
+	shm_mq_detach(buildstate.mq);
+}
+
+static void
+initTask(volatile GinBuildTask *task, Relation heap, Relation index)
+{
+	task->to_scan = RelationGetNumberOfBlocks(heap);
+	task->scanned = 0;
+	SpinLockInit(&task->lock);
+	task->heap_oid = RelationGetRelid(heap);
+	task->index_oid = RelationGetRelid(index);
+	task->reltuples = 0;
+}
+
+static void
+launchWorkers(GinBuildState *buildstate, int wnum, Relation heap, Relation index)
+{
+	int i;
+	void *origin;
+	int *mqsize;
+
+	EnterParallelMode();
+	buildstate->pcxt = CreateParallelContext(ginbuildWorker, wnum);
+	{
+		int size = 0, keys = 0;
+		keys++; size += sizeof(GinBuildTask);
+		keys++; size += gin_shared_mem * 1024;
+		keys++; size += sizeof(int); /* for mqsize */
+
+		shm_toc_estimate_chunk(&buildstate->pcxt->estimator, size);
+		shm_toc_estimate_keys(&buildstate->pcxt->estimator, keys);
+	}
+	InitializeParallelDSM(buildstate->pcxt);
+
+	buildstate->task = (GinBuildTask*)shm_toc_allocate(buildstate->pcxt->toc, sizeof(GinBuildTask));
+	shm_toc_insert(buildstate->pcxt->toc, KEY_TASK, (GinBuildTask*)buildstate->task);
+
+	origin = shm_toc_allocate(buildstate->pcxt->toc, gin_shared_mem * 1024);
+	shm_toc_insert(buildstate->pcxt->toc, KEY_SHM_ORIGIN, origin);
+
+	mqsize = (int *)shm_toc_allocate(buildstate->pcxt->toc, sizeof(int));
+	*mqsize = gin_shared_mem * 1024 / buildstate->pcxt->nworkers;
+	shm_toc_insert(buildstate->pcxt->toc, KEY_SHM_PER_WORKER, mqsize);
+
+	initTask(buildstate->task, heap, index);
+
+	buildstate->results = palloc(buildstate->pcxt->nworkers * sizeof(WorkerResult));
+	for (i = 0; i < buildstate->pcxt->nworkers; i++)
+	{
+		WorkerResult *r = buildstate->results + i;
+		r->mq = shm_mq_create((char *)origin + i * (*mqsize), *mqsize);
+		shm_mq_set_receiver(r->mq, MyProc);
+		r->mqh = shm_mq_attach(r->mq, buildstate->pcxt->seg, NULL);
+		r->end_of_tree = false;
+		r->end_of_forest = false;
+		r->msg_body = NULL;
+		r->msg_len = 0;
+		r->skipkey = 0;
+	}
+
+	LaunchParallelWorkers(buildstate->pcxt);
+}
+
+static void
+finishWorkers(GinBuildState *buildstate, GinBuildTask *task)
+{
+	WaitForParallelWorkersToFinish(buildstate->pcxt);
+	memcpy(task, (GinBuildTask *)buildstate->task, sizeof(GinBuildTask)); /* copy the task out of the context before destroing it */
+	DestroyParallelContext(buildstate->pcxt);
+	ExitParallelMode();
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	GinBuildState buildstate;
+
 	IndexBuildResult *result;
-	double		reltuples;
-	GinBuildState buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
-	ItemPointerData *list;
-	Datum		key;
-	GinNullCategory category;
-	uint32		nlist;
-	MemoryContext oldCtx;
-	OffsetNumber attnum;
+	double reltuples = 0;
+	int wnum = gin_parallel_workers;
 
 	if (RelationGetNumberOfBlocks(index) != 0)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	initGinState(&buildstate.ginstate, index);
-	buildstate.indtuples = 0;
-	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
-
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -363,60 +962,40 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	UnlockReleaseBuffer(RootBuffer);
 	END_CRIT_SECTION();
 
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
-	/*
-	 * create a temporary memory context that is used to hold data not yet
-	 * dumped out to the index
-	 */
-	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
-											  "Gin build temporary context",
-											  ALLOCSET_DEFAULT_MINSIZE,
-											  ALLOCSET_DEFAULT_INITSIZE,
-											  ALLOCSET_DEFAULT_MAXSIZE);
-
-	/*
-	 * create a temporary memory context that is used for calling
-	 * ginExtractEntries(), and can be reset after each tuple
-	 */
-	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
-					 "Gin build temporary context for user-defined function",
-											   ALLOCSET_DEFAULT_MINSIZE,
-											   ALLOCSET_DEFAULT_INITSIZE,
-											   ALLOCSET_DEFAULT_MAXSIZE);
-
-	buildstate.accum.ginstate = &buildstate.ginstate;
-	ginInitBA(&buildstate.accum);
+	if ((wnum > 0) && RelationUsesLocalBuffers(heap))
+	{
+		elog(DEBUG1, "not using parallel GIN build on temporary table %s\n", RelationGetRelationName(heap));
+		wnum = 0;
+	}
 
-	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
-	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
+	buildstate.pcxt = NULL;
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
-	}
-	MemoryContextSwitchTo(oldCtx);
+		GinBuildTask task;
+		double backend_reltuples = 0;
 
-	MemoryContextDelete(buildstate.funcCtx);
-	MemoryContextDelete(buildstate.tmpCtx);
+		if (wnum > 0)
+		{
+			launchWorkers(&buildstate, wnum, heap, index);
+			backend_reltuples += ginbuildCommon(&buildstate, heap, index, indexInfo);
+			finishWorkers(&buildstate, &task);
+		}
+		else
+		{
+			buildstate.task = &task;
+			initTask(buildstate.task, heap, index);
+			backend_reltuples += ginbuildCommon(&buildstate, heap, index, indexInfo);
+		}
 
-	/*
-	 * Update metapage stats
-	 */
-	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
-	ginUpdateStats(index, &buildstate.buildStats);
+		reltuples = backend_reltuples + task.reltuples;
+	}
 
 	/*
 	 * Return statistics
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index a325943..32c6ea0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2775,6 +2775,31 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"gin_parallel_workers",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of parallel workers for GIN buiding."),
+			NULL,
+		},
+		&gin_parallel_workers,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"gin_shared_mem",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("The size of shared memory segment for parallel GIN buiding."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&gin_shared_mem,
+		16 * 1024, 1024, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 773b4e8..321d572 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -541,6 +541,8 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#gin_parallel_workers = 0		# 0 disables parallel gin build
+#gin_shared_mem = 16MB
 
 # - Locale and Formatting -
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..91e5b27 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -68,6 +68,8 @@ typedef char GinTernaryValue;
 /* GUC parameters */
 extern PGDLLIMPORT int GinFuzzySearchLimit;
 extern int	gin_pending_list_limit;
+extern int	gin_parallel_workers;
+extern int	gin_shared_mem;
 
 /* ginutil.c */
 extern void ginGetStats(Relation index, GinStatsData *stats);
diff --git a/src/test/regress/expected/gin.out b/src/test/regress/expected/gin.out
index a3911a6..fc6bac0 100644
--- a/src/test/regress/expected/gin.out
+++ b/src/test/regress/expected/gin.out
@@ -23,6 +23,88 @@ select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
                       0
 (1 row)
 
+-- Test parallel building
+drop index gin_test_idx;
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+set gin_parallel_workers = 1;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+drop index gin_test_idx;
+set gin_parallel_workers = 2;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+drop index gin_test_idx;
+set gin_parallel_workers = 8;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
 -- Test vacuuming
 delete from gin_test_tbl where i @> array[2];
 vacuum gin_test_tbl;
diff --git a/src/test/regress/sql/gin.sql b/src/test/regress/sql/gin.sql
index c566e9b..8b0eb45 100644
--- a/src/test/regress/sql/gin.sql
+++ b/src/test/regress/sql/gin.sql
@@ -19,6 +19,33 @@ vacuum gin_test_tbl; -- flush the fastupdate buffers
 
 select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
 
+-- Test parallel building
+
+drop index gin_test_idx;
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+set gin_parallel_workers = 1;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+drop index gin_test_idx;
+set gin_parallel_workers = 2;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+drop index gin_test_idx;
+set gin_parallel_workers = 8;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
 -- Test vacuuming
 delete from gin_test_tbl where i @> array[2];
 vacuum gin_test_tbl;
#27Constantin S. Pan
kvapen@gmail.com
In reply to: Constantin S. Pan (#26)
1 attachment(s)
Re: [PATCH] speeding up GIN build with parallel workers

Here is a new version of the patch, which:

1. Fixes some minor stylistic issues.

2. Uses binaryheap (instead of a custom ugly stack) for merging.

Regards,

Constantin S. Pan
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachments:

pgin-7.patchtext/x-patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index d48a13f..c9ff58d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6844,6 +6844,35 @@ dynamic_library_path = 'C:\tools\postgresql;H:\my_project\lib;$libdir'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-gin-parallel-workers" xreflabel="gin_parallel_workers">
+      <term><varname>gin_parallel_workers</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>gin_parallel_workers</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Upper limit of the number of parallel workers that should be used when
+        building GIN. Set to 0 to disable parallel GIN build.
+       </para>
+      </listitem>
+     </varlistentry>
+
+     <varlistentry id="guc-gin-shared-mem" xreflabel="gin_shared_mem">
+      <term><varname>gin_shared_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>gin_parallel_workers</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        The size of shared memory segment for parallel GIN buiding. The segment
+        is used to transfer partial results from workers for final merging.
+        Ignored if not using parallel GIN build.
+       </para>
+      </listitem>
+     </varlistentry>
+
      </variablelist>
     </sect2>
    </sect1>
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index fade0cb..4325d88 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -49,7 +49,7 @@ Features
   * Write-Ahead Logging (WAL).  (Recoverability from crashes.)
   * User-defined opclasses.  (The scheme is similar to GiST.)
   * Optimized index creation (Makes use of maintenance_work_mem to accumulate
-    postings in memory.)
+    postings in memory and gin_parallel_workers to speed the process up.)
   * Text search support via an opclass
   * Soft upper limit on the returned results set using a GUC variable:
     gin_fuzzy_search_limit
@@ -324,6 +324,24 @@ page-deletions safe; it stamps the deleted pages with an XID and keeps the
 deleted pages around with the right-link intact until all concurrent scans
 have finished.)
 
+Parallel Building
+-----------------
+
+The most expensive part of GIN index building is accumulating the rbtree. GIN
+supports using parallel workers which divide the work between each other. This
+speeds up the accumulating stage almost perfectly, but slows down the dumping
+of the result a little, since the backend now needs to merge lists that
+correspond to the same key but come from multiple workers.
+
+When it is time to build a GIN index on a relation, the backend checks the value of
+gin_parallel_workers config variable and tries to launch the corresponding
+number of parallel workers. The backend also sets up a shared memory segment of
+configurable size (gin_shared_mem). The workers scan the relation, dividing it
+by blocks, until they fill the maintenance_work_mem, then send the accumulated
+tree to the backend through the shared memory segment. The backend merges the
+trees and inserts the entries to the index. The process repeats until the
+relation is fully scanned.
+
 Compatibility
 -------------
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cd21e0e..d5cbf6c 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -16,26 +16,95 @@
 
 #include "access/gin_private.h"
 #include "access/xloginsert.h"
+#include "access/parallel.h"
+#include "access/xact.h"
 #include "catalog/index.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/smgr.h"
 #include "storage/indexfsm.h"
+#include "storage/spin.h"
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "lib/binaryheap.h"
 
+/* GUC parameter */
+int gin_parallel_workers = 0;
+int gin_shared_mem = 0;
 
-typedef struct
+#define KEY_TASK 42
+#define KEY_SHM_ORIGIN 43
+#define KEY_SHM_PER_WORKER 44
+#define GIN_BLOCKS_PER_WORKER 4
+
+/*
+ * The results are passed through the shared message queue as a sequence of
+ * messages of the following structure: GinShmemEntry, Key, List
+ */
+typedef struct GinShmemEntry
+{
+	GinNullCategory	category;
+	OffsetNumber	attnum;
+	int				nlist;
+} GinShmemEntry;
+
+/*
+ * This structure is used by the backend to track the current state of the
+ * workers.
+ */
+typedef struct WorkerResult
+{
+	void			*mq;
+	shm_mq_handle	*mqh;
+	bool			 end_of_tree;
+	bool			 end_of_forest;
+} WorkerResult;
+
+typedef struct GinEntry
+{
+	int				 worker;
+	Datum			 key;
+	GinNullCategory	 category;
+	OffsetNumber	 attnum;
+	ItemPointerData	*list;
+	uint32			 nlist;
+} GinEntry;
+
+/*
+ * This shared structure describes the GIN build task for the parallel workers.
+ * We use OIDs here because workers are separate processes and pointers may
+ * become meaningless for them. The "lock" is used to protect the "scanned" and
+ * "reltuples" fields as the workers modify them.
+ */
+typedef struct GinBuildTask
+{
+	int		to_scan;
+	int		scanned;
+	slock_t	lock;
+	Oid		heap_oid;
+	Oid		index_oid;
+	double	reltuples;
+} GinBuildTask;
+
+typedef struct GinBuildState
 {
-	GinState	ginstate;
-	double		indtuples;
-	GinStatsData buildStats;
-	MemoryContext tmpCtx;
-	MemoryContext funcCtx;
+	GinState		 ginstate;
+	double			 indtuples;
+	GinStatsData	 buildStats;
+	MemoryContext	 tmpCtx;
+	MemoryContext	 funcCtx;
 	BuildAccumulator accum;
+
+	ParallelContext	*pcxt;
+	volatile GinBuildTask *task;
+	WorkerResult	*results; /* NULL in workers, array in backend */
+
+	/* these only get used by workers */
+	shm_mq			*mq;
+	shm_mq_handle	*mqh;
 } GinBuildState;
 
-
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -265,6 +334,334 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	MemoryContextReset(buildstate->funcCtx);
 }
 
+/*
+ * Dump one GIN entry from the worker to the backend (through the shared
+ * message queue).
+ */
+static void
+ginDumpEntry(GinState *ginstate,
+			 shm_mq_handle *mqh,
+			 OffsetNumber attnum,
+			 Datum key,
+			 GinNullCategory category,
+			 ItemPointerData *list,
+			 int nlist)
+{
+	int keylen, listlen;
+
+	bool isnull;
+	Form_pg_attribute att;
+	GinShmemEntry e;
+
+	/*
+	 * The message consists of 2 or 3 parts. The IO vector allows us to send
+	 * them as one message though the parts are located at unrelated addresses.
+	 */
+	shm_mq_iovec iov[3];
+	int iovlen = 0;
+
+	char *buf = NULL;
+
+	e.category = category;
+	e.attnum = attnum;
+	e.nlist = nlist;
+
+	Assert(nlist > 0);
+
+	isnull = category == GIN_CAT_NULL_KEY;
+	att = ginstate->origTupdesc->attrs[attnum - 1];
+
+	if (e.category == GIN_CAT_NORM_KEY)
+	{
+		keylen = datumEstimateSpace(key, isnull, att->attbyval, att->attlen);
+		Assert(keylen > 0);
+		listlen = e.nlist * sizeof(ItemPointerData);
+	}
+	else
+	{
+		keylen = 0;
+	}
+
+	listlen = e.nlist * sizeof(ItemPointerData);
+
+	iov[iovlen].data = (char *)&e;
+	iov[iovlen++].len = sizeof(e);
+
+	if (keylen > 0)
+	{
+		char *cursor;
+		buf = palloc(keylen);
+		cursor = buf;
+		datumSerialize(key, isnull, att->attbyval, att->attlen, &cursor);
+		iov[iovlen].data = buf;
+		iov[iovlen++].len = keylen;
+	}
+
+	iov[iovlen].data = (char *)list;
+	iov[iovlen++].len = listlen;
+
+	if (shm_mq_sendv(mqh, iov, iovlen, false) != SHM_MQ_SUCCESS)
+		elog(ERROR,
+			 "worker %d failed to send a result entry to the backend",
+			 ParallelWorkerNumber);
+
+	if (buf)
+		pfree(buf);
+}
+
+/*
+ * Send the accumulated tree from the worker to the backend (through the shared
+ * message queue).
+ */
+static void
+ginSendTree(GinBuildState *buildstate)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	MemoryContext oldCtx;
+
+	Assert(IsParallelWorker());
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+							  &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		ginDumpEntry(&buildstate->ginstate,
+					 buildstate->mqh, attnum, key,
+					 category, list, nlist);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+
+	/* send an empty message as an "end-of-tree" marker */
+	if (shm_mq_send(buildstate->mqh, 0, NULL, false) != SHM_MQ_SUCCESS)
+	{
+		elog(ERROR,
+			 "worker %d failed to send the end-of-tree marker to the backend",
+			 ParallelWorkerNumber);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
+/*
+ * Resets the 'end-of-tree' marker for all workers.
+ */
+static void
+resetEndOfTree(GinBuildState *buildstate)
+{
+	int i;
+	int wnum = buildstate->pcxt ? buildstate->pcxt->nworkers_launched : 0;
+	for (i = 0; i < wnum; ++i)
+	{
+		WorkerResult *r = buildstate->results + i;
+		r->end_of_tree = false;
+	}
+}
+
+/*
+ * Get the next entry from the message queue (or from the rbtree directly, if
+ * backend, i.e. worker < 0).
+ */
+static GinEntry *
+getNextEntry(GinBuildState *buildstate, int worker)
+{
+	GinEntry e, *ecopy;
+	e.worker = worker;
+
+	if (worker < 0)
+	{
+		e.list = ginGetBAEntry(&buildstate->accum,
+							 &e.attnum,
+							 &e.key,
+							 &e.category,
+							 &e.nlist);
+		if (!e.list)
+			return NULL;
+	}
+	else
+	{
+		GinShmemEntry *shentry;
+		char *cursor;
+		Size msglen;
+		void *msg;
+		WorkerResult *r = buildstate->results + worker;
+
+		if (r->end_of_forest || r->end_of_tree)
+			return NULL;
+
+		if (shm_mq_receive(r->mqh, &msglen, &msg, false) != SHM_MQ_SUCCESS)
+		{
+			r->end_of_forest = true;
+			return NULL;
+		}
+
+		if (msglen == 0) /* end-of-tree */
+		{
+			r->end_of_tree = true;
+			return NULL;
+		}
+
+		cursor = msg;
+		Assert(cursor != NULL);
+		shentry = (GinShmemEntry*)cursor;
+		cursor += sizeof(shentry);
+
+		e.key = 0;
+		if (shentry->category == GIN_CAT_NORM_KEY)
+		{
+			bool isnull;
+			e.key = datumRestore(&cursor, &isnull);
+			Assert(!isnull);
+		}
+		e.nlist = shentry->nlist;
+		e.list = palloc(sizeof(ItemPointerData) * e.nlist);
+		memcpy(e.list, cursor, sizeof(ItemPointerData) * e.nlist);
+		e.attnum = shentry->attnum;
+		e.category = shentry->category;
+	}
+
+	ecopy = palloc(sizeof(e));
+	memcpy(ecopy, &e, sizeof(e));
+	return ecopy;
+}
+
+static int
+compare(GinBuildState *buildstate, GinEntry *a, GinEntry *b)
+{
+	return ginCompareAttEntries(&buildstate->ginstate,
+							    a->attnum, a->key, a->category,
+								b->attnum, b->key, b->category);
+}
+
+static int binheap_compare(Datum a, Datum b, void *arg)
+{
+	return -compare((GinBuildState *)arg, (GinEntry *)a, (GinEntry *)b);
+}
+
+/*
+ * Merge the trees from all ready (but unfinished) workers (and from myself).
+ * Return the number of entries inserted.
+ */
+static int
+mergeTrees(GinBuildState *buildstate)
+{
+	GinEntry *minentry = NULL;
+	binaryheap *binheap = NULL;
+	int i;
+	int inserted = 0;
+	int wnum = buildstate->pcxt ? buildstate->pcxt->nworkers_launched : 0;
+
+	ginBeginBAScan(&buildstate->accum);
+
+	binheap = binaryheap_allocate(wnum + 1, binheap_compare, buildstate);
+
+	/* Populate the binheap */
+	for (i = -1; i < wnum; i++)
+	{
+		GinEntry *e = getNextEntry(buildstate, i);
+		if (e)
+			binaryheap_add(binheap, PointerGetDatum(e));
+	}
+
+	while (!binaryheap_empty(binheap))
+	{
+		GinEntry *candidate = (GinEntry *)DatumGetPointer(binaryheap_remove_first(binheap));
+
+		{
+			/* refill from the same message queue */
+			GinEntry *e;
+			e = getNextEntry(buildstate, candidate->worker);
+			if (e)
+				binaryheap_add(binheap, PointerGetDatum(e));
+		}
+
+		if (minentry)
+		{
+			int cmp = compare(buildstate, candidate, minentry);
+			if (cmp)
+			{
+				/* Merge finished, insert the entry into the index. */
+				Assert(cmp > 0);
+				ginEntryInsert(&buildstate->ginstate,
+							   minentry->attnum, minentry->key, minentry->category,
+							   minentry->list, minentry->nlist,
+							   &buildstate->buildStats);
+				inserted++;
+				pfree(minentry->list);
+				pfree(minentry);
+				minentry = candidate;
+			} else {
+				/* Merge the candidate with minentry. */
+				int newnlist;
+
+				ItemPointerData *oldlist = minentry->list;
+				minentry->list = ginMergeItemPointers(minentry->list, minentry->nlist,
+													  candidate->list, candidate->nlist,
+													  &newnlist);
+				minentry->nlist = newnlist;
+				pfree(candidate->list);   
+				pfree(candidate);
+				pfree(oldlist);           
+			}
+		}
+		else
+		{
+			minentry = candidate;
+		}
+
+		if (minentry)
+		{
+			ginEntryInsert(&buildstate->ginstate,
+						   minentry->attnum, minentry->key, minentry->category,
+						   minentry->list, minentry->nlist,
+						   &buildstate->buildStats);
+			inserted++;
+		}
+	}
+
+	Assert(binaryheap_empty(binheap));
+
+	binaryheap_free(binheap);
+	ginInitBA(&buildstate->accum);
+	return inserted;
+}
+
+/*
+ * The common function used by both backend and worker to dump the accumulator
+ * when it gets full. 'last' should be true on the last call, to make the
+ * backend merge everything from the workers even if the backend has no more
+ * results from itself.
+ */
+static void
+ginDumpCommon(GinBuildState *buildstate, bool last)
+{
+	if (IsParallelWorker())
+		ginSendTree(buildstate);
+	else
+	{
+		int inserted;
+		do
+		{
+			resetEndOfTree(buildstate);
+			inserted = mergeTrees(buildstate);
+		} while (last && (inserted > 0));
+		/*
+		 * If it is not the 'last' dump, then the backend can merge the next
+		 * incoming trees on the next dump. But if it is the 'last' dump, the
+		 * backend has to repeat merging until all workers are finished.
+		 */
+	}
+}
+
 static void
 ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 				 bool *isnull, bool tupleIsAlive, void *state)
@@ -282,53 +679,244 @@ ginBuildCallback(Relation index, HeapTuple htup, Datum *values,
 
 	/* If we've maxed out our available memory, dump everything to the index */
 	if (buildstate->accum.allocatedMemory >= (Size)maintenance_work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-								  &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-			ginEntryInsert(&buildstate->ginstate, attnum, key, category,
-						   list, nlist, &buildstate->buildStats);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginDumpCommon(buildstate, false);
 
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * Claim "max_blocks" or less blocks. Return the actual number of claimed
+ * blocks and set "first" to point to the first block of the claimed range.
+ * 0 return value means the task has been finished.
+ */
+static int
+claimSomeBlocks(volatile GinBuildTask *task, int max_blocks, int *first)
+{
+	int blocks = 0;
+
+	SpinLockAcquire(&task->lock);
+
+	if (task->scanned >= task->to_scan)
+	{
+		SpinLockRelease(&task->lock);
+		return 0;
+	}
+
+	*first = task->scanned;
+	blocks = max_blocks;
+	if (blocks > task->to_scan - task->scanned)
+		blocks = task->to_scan - task->scanned;
+	task->scanned += blocks;
+
+	SpinLockRelease(&task->lock);
+	return blocks;
+}
+
+static void
+reportReltuples(volatile GinBuildTask *task, double reltuples)
+{
+	SpinLockAcquire(&task->lock);
+	task->reltuples += reltuples;
+	SpinLockRelease(&task->lock);
+}
+
+static double
+ginbuildCommon(GinBuildState *buildstate, Relation heap, Relation index, IndexInfo *indexInfo)
+{
+	double reltuples = 0;
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate->tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context",
+											   ALLOCSET_DEFAULT_MINSIZE,
+											   ALLOCSET_DEFAULT_INITSIZE,
+											   ALLOCSET_DEFAULT_MAXSIZE);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate->funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+												"Gin build temporary context for user-defined function",
+												ALLOCSET_DEFAULT_MINSIZE,
+												ALLOCSET_DEFAULT_INITSIZE,
+												ALLOCSET_DEFAULT_MAXSIZE);
+
+	buildstate->accum.ginstate = &buildstate->ginstate;
+	ginInitBA(&buildstate->accum);
+
+	/*
+	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
+	 * prefers to receive tuples in TID order.
+	 */
+	while (true)
+	{
+		double subtuples;
+		int first, blocks;
+
+		blocks = claimSomeBlocks(buildstate->task, GIN_BLOCKS_PER_WORKER, &first);
+		if (blocks == 0)
+			break;
+
+		subtuples = IndexBuildHeapRangeScan(heap, index, indexInfo, false, false,
+											first, blocks,
+											ginBuildCallback, (void *)buildstate);
+		reltuples += subtuples;
+	}
+
+	/* dump remaining entries to the index */
+	ginDumpCommon(buildstate, true);
+
+	MemoryContextDelete(buildstate->funcCtx);
+	MemoryContextDelete(buildstate->tmpCtx);
+
+	/*
+	 * Update metapage stats
+	 */
+	buildstate->buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
+	ginUpdateStats(index, &buildstate->buildStats);
+
+	return reltuples;
+}
+
+/*
+ * The idea behind parallel GIN build is letting the parallel workers scan the
+ * relation and build rbtree in parallel. Each block is scanned by one of the
+ * workers.
+ */
+static void
+ginbuildWorker(dsm_segment *seg, shm_toc *toc)
+{
+	GinBuildState buildstate;
+
+	Relation heap;
+	Relation index;
+	IndexInfo *indexInfo;
+
+	double reltuples;
+
+	char *shm_origin;
+	int mqsize;
+
+	buildstate.task = (GinBuildTask*)shm_toc_lookup(toc, KEY_TASK);
+
+	shm_origin = (char *)shm_toc_lookup(toc, KEY_SHM_ORIGIN);
+	mqsize = *(int*)shm_toc_lookup(toc, KEY_SHM_PER_WORKER);
+	buildstate.mq = (shm_mq *)(shm_origin + ParallelWorkerNumber * mqsize);
+	shm_mq_set_sender(buildstate.mq, MyProc);
+	buildstate.mqh = shm_mq_attach(buildstate.mq, seg, NULL);
+	shm_mq_wait_for_attach(buildstate.mqh);
+
+	/*
+	 * NoLock here because the backend has already opened the heap and the
+	 * index. We reopen them in the worker, because we cannot just pass the
+	 * Relation pointers through the shared memory, thus we rely on OIDs.
+	 */
+	heap = heap_open(buildstate.task->heap_oid, NoLock);
+	index = index_open(buildstate.task->index_oid, NoLock);
+	indexInfo = BuildIndexInfo(index);
+
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	reltuples = ginbuildCommon(&buildstate, heap, index, indexInfo);
+
+	index_close(index, NoLock);
+	heap_close(heap, NoLock);
+
+	reportReltuples(buildstate.task, reltuples);
+
+	shm_mq_detach(buildstate.mq);
+}
+
+/* Initialize the parallel gin building task in the shared memory. */
+static void
+initTask(volatile GinBuildTask *task, Relation heap, Relation index)
+{
+	task->to_scan = RelationGetNumberOfBlocks(heap);
+	task->scanned = 0;
+	SpinLockInit(&task->lock);
+	task->heap_oid = RelationGetRelid(heap);
+	task->index_oid = RelationGetRelid(index);
+	task->reltuples = 0;
+}
+
+/* Launch 'wnum' parallel workers to build 'index' on 'heap'. */
+static void
+launchWorkers(GinBuildState *buildstate, int wnum, Relation heap, Relation index)
+{
+	int i;
+	void *origin;
+	int *mqsize;
+
+	EnterParallelMode();
+	buildstate->pcxt = CreateParallelContext(ginbuildWorker, wnum);
+	{
+		int size = 0, keys = 0;
+		keys++; size += sizeof(GinBuildTask);
+		keys++; size += gin_shared_mem * 1024;
+		keys++; size += sizeof(int); /* for mqsize */
+
+		shm_toc_estimate_chunk(&buildstate->pcxt->estimator, size);
+		shm_toc_estimate_keys(&buildstate->pcxt->estimator, keys);
+	}
+	InitializeParallelDSM(buildstate->pcxt);
+
+	buildstate->task = (GinBuildTask*)shm_toc_allocate(buildstate->pcxt->toc, sizeof(GinBuildTask));
+	shm_toc_insert(buildstate->pcxt->toc, KEY_TASK, (GinBuildTask*)buildstate->task);
+
+	origin = shm_toc_allocate(buildstate->pcxt->toc, gin_shared_mem * 1024);
+	shm_toc_insert(buildstate->pcxt->toc, KEY_SHM_ORIGIN, origin);
+
+	mqsize = (int *)shm_toc_allocate(buildstate->pcxt->toc, sizeof(int));
+	*mqsize = gin_shared_mem * 1024 / buildstate->pcxt->nworkers;
+	shm_toc_insert(buildstate->pcxt->toc, KEY_SHM_PER_WORKER, mqsize);
+
+	initTask(buildstate->task, heap, index);
+
+	buildstate->results = palloc(buildstate->pcxt->nworkers * sizeof(WorkerResult));
+	for (i = 0; i < buildstate->pcxt->nworkers; i++)
+	{
+		WorkerResult *r = buildstate->results + i;
+		r->mq = shm_mq_create((char *)origin + i * (*mqsize), *mqsize);
+		shm_mq_set_receiver(r->mq, MyProc);
+		r->mqh = shm_mq_attach(r->mq, buildstate->pcxt->seg, NULL);
+		r->end_of_tree = false;
+		r->end_of_forest = false;
+	}
+
+	LaunchParallelWorkers(buildstate->pcxt);
+}
+
+static void
+finishWorkers(GinBuildState *buildstate, GinBuildTask *task)
+{
+	WaitForParallelWorkersToFinish(buildstate->pcxt);
+	/* copy the task out of the context before destroing it */
+	memcpy(task, (GinBuildTask *)buildstate->task, sizeof(GinBuildTask));
+	DestroyParallelContext(buildstate->pcxt);
+	ExitParallelMode();
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
+	GinBuildState buildstate;
+
 	IndexBuildResult *result;
-	double		reltuples;
-	GinBuildState buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
-	ItemPointerData *list;
-	Datum		key;
-	GinNullCategory category;
-	uint32		nlist;
-	MemoryContext oldCtx;
-	OffsetNumber attnum;
+	double reltuples = 0;
+	int wnum = gin_parallel_workers;
 
 	if (RelationGetNumberOfBlocks(index) != 0)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	initGinState(&buildstate.ginstate, index);
-	buildstate.indtuples = 0;
-	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
-
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -363,60 +951,40 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	UnlockReleaseBuffer(RootBuffer);
 	END_CRIT_SECTION();
 
+	initGinState(&buildstate.ginstate, index);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
 	/* count the root as first entry page */
 	buildstate.buildStats.nEntryPages++;
 
-	/*
-	 * create a temporary memory context that is used to hold data not yet
-	 * dumped out to the index
-	 */
-	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
-											  "Gin build temporary context",
-											  ALLOCSET_DEFAULT_MINSIZE,
-											  ALLOCSET_DEFAULT_INITSIZE,
-											  ALLOCSET_DEFAULT_MAXSIZE);
-
-	/*
-	 * create a temporary memory context that is used for calling
-	 * ginExtractEntries(), and can be reset after each tuple
-	 */
-	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
-					 "Gin build temporary context for user-defined function",
-											   ALLOCSET_DEFAULT_MINSIZE,
-											   ALLOCSET_DEFAULT_INITSIZE,
-											   ALLOCSET_DEFAULT_MAXSIZE);
-
-	buildstate.accum.ginstate = &buildstate.ginstate;
-	ginInitBA(&buildstate.accum);
+	if ((wnum > 0) && RelationUsesLocalBuffers(heap))
+	{
+		elog(DEBUG1, "not using parallel GIN build on temporary table %s\n", RelationGetRelationName(heap));
+		wnum = 0;
+	}
 
-	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
-	 */
-	reltuples = IndexBuildHeapScan(heap, index, indexInfo, false,
-								   ginBuildCallback, (void *) &buildstate);
+	buildstate.pcxt = NULL;
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
-	}
-	MemoryContextSwitchTo(oldCtx);
+		GinBuildTask	task;
+		double			backend_reltuples = 0;
 
-	MemoryContextDelete(buildstate.funcCtx);
-	MemoryContextDelete(buildstate.tmpCtx);
+		if (wnum > 0)
+		{
+			launchWorkers(&buildstate, wnum, heap, index);
+			backend_reltuples += ginbuildCommon(&buildstate, heap, index, indexInfo);
+			finishWorkers(&buildstate, &task);
+		}
+		else
+		{
+			buildstate.task = &task;
+			initTask(buildstate.task, heap, index);
+			backend_reltuples += ginbuildCommon(&buildstate, heap, index, indexInfo);
+		}
 
-	/*
-	 * Update metapage stats
-	 */
-	buildstate.buildStats.nTotalPages = RelationGetNumberOfBlocks(index);
-	ginUpdateStats(index, &buildstate.buildStats);
+		reltuples = backend_reltuples + task.reltuples;
+	}
 
 	/*
 	 * Return statistics
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 65a6cd4..6afbb8a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2775,6 +2775,31 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"gin_parallel_workers",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("Maximum number of parallel workers for GIN buiding."),
+			NULL,
+		},
+		&gin_parallel_workers,
+		0, 0, MAX_BACKENDS,
+		NULL, NULL, NULL
+	},
+
+	{
+		{"gin_shared_mem",
+			PGC_USERSET,
+			RESOURCES_ASYNCHRONOUS,
+			gettext_noop("The size of shared memory segment for parallel GIN buiding."),
+			NULL,
+			GUC_UNIT_KB
+		},
+		&gin_shared_mem,
+		16 * 1024, 1024, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5536012..15bf5ca 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -541,6 +541,8 @@
 #xmloption = 'content'
 #gin_fuzzy_search_limit = 0
 #gin_pending_list_limit = 4MB
+#gin_parallel_workers = 0		# 0 disables parallel gin build
+#gin_shared_mem = 16MB
 
 # - Locale and Formatting -
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e5b2e10..91e5b27 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -68,6 +68,8 @@ typedef char GinTernaryValue;
 /* GUC parameters */
 extern PGDLLIMPORT int GinFuzzySearchLimit;
 extern int	gin_pending_list_limit;
+extern int	gin_parallel_workers;
+extern int	gin_shared_mem;
 
 /* ginutil.c */
 extern void ginGetStats(Relation index, GinStatsData *stats);
diff --git a/src/test/regress/expected/gin.out b/src/test/regress/expected/gin.out
index a3911a6..fc6bac0 100644
--- a/src/test/regress/expected/gin.out
+++ b/src/test/regress/expected/gin.out
@@ -23,6 +23,88 @@ select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
                       0
 (1 row)
 
+-- Test parallel building
+drop index gin_test_idx;
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+set gin_parallel_workers = 1;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+drop index gin_test_idx;
+set gin_parallel_workers = 2;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
+drop index gin_test_idx;
+set gin_parallel_workers = 8;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[2];
+ count 
+-------
+ 20002
+(1 row)
+
+select count(*) from gin_test_tbl where i @> array[8];
+ count 
+-------
+     3
+(1 row)
+
 -- Test vacuuming
 delete from gin_test_tbl where i @> array[2];
 vacuum gin_test_tbl;
diff --git a/src/test/regress/sql/gin.sql b/src/test/regress/sql/gin.sql
index c566e9b..8b0eb45 100644
--- a/src/test/regress/sql/gin.sql
+++ b/src/test/regress/sql/gin.sql
@@ -19,6 +19,33 @@ vacuum gin_test_tbl; -- flush the fastupdate buffers
 
 select gin_clean_pending_list('gin_test_idx'); -- nothing to flush
 
+-- Test parallel building
+
+drop index gin_test_idx;
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+set gin_parallel_workers = 1;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+drop index gin_test_idx;
+set gin_parallel_workers = 2;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
+drop index gin_test_idx;
+set gin_parallel_workers = 8;
+create index gin_test_idx on gin_test_tbl using gin (i);
+select count(*) from gin_test_tbl where i @> array[1,2];
+select count(*) from gin_test_tbl where i @> array[2];
+select count(*) from gin_test_tbl where i @> array[8];
+
 -- Test vacuuming
 delete from gin_test_tbl where i @> array[2];
 vacuum gin_test_tbl;
#28Robert Haas
robertmhaas@gmail.com
In reply to: Constantin S. Pan (#27)
Re: [PATCH] speeding up GIN build with parallel workers

On Fri, Apr 8, 2016 at 10:04 AM, Constantin S. Pan <kvapen@gmail.com> wrote:

Here is a new version of the patch, which:

1. Fixes some minor stylistic issues.

2. Uses binaryheap (instead of a custom ugly stack) for merging.

I think we need to push this patch out to 9.7. This code has had a
little review here and there from various people, but clearly not
enough to push it into the tree at the very last minute. I don't
think we even have enough discussion to conclude that things like the
gin_shared_mem GUCs are good ideas rather than bad ones.

Also, I personally find this code to be extremely low on comments.
There's barely any explanation of what the overall methodology is
here. Most of the functions do not have a header comment explaining
what they do. The code hasn't been run through pgindent. There's no
check to see whether the computation that will be done inside the GIN
index is parallel-safe; what if it's an expression index on an unsafe
function? Opening the heap and index with no lock in the worker is
wrong; the worker should use the same lock mode as the leader.

That's just on a quick read-through; I'm sure there's more. I'm going
to move this to the next CommitFest. Hopefully someone will volunteer
to do some serious review of this, because the feature sounds cool.
Possibly that person will even be me. But it will not be today.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Jeff Janes (#4)
Re: Proposal: speeding up GIN build with parallel workers

On 01/17/2016 10:03 PM, Jeff Janes wrote:

On Fri, Jan 15, 2016 at 3:29 PM, Peter Geoghegan <pg@heroku.com> wrote:

On Fri, Jan 15, 2016 at 2:38 PM, Constantin S. Pan <kvapen@gmail.com> wrote:

I have a draft implementation which divides the whole process between
N parallel workers, see the patch attached. Instead of a full scan of
the relation, I give each worker a range of blocks to read.

I am currently working on a patch that allows B-Tree index builds to
be performed in parallel. I think I'm a week or two away from posting
it.

Even without parallelism, wouldn't it be better if GIN indexes were
built using tuplesort? I know way way less about the gin am than the
nbtree am, but I imagine that a prominent cost for GIN index builds is
constructing the main B-Tree (the one that's constructed over key
values) itself. Couldn't tuplesort.c be adapted to cover this case?
That would be much faster in general, particularly with the recent
addition of abbreviated keys, while also leaving a clear path forward
to performing the build in parallel.

I think it would take a lot of changes to tuple sort to make this be a
almost-always win.

In the general case each GIN key occurs in many tuples, and the
in-memory rbtree is good at compressing the tid list for each key to
maximize the amount of data that can be in memory (although perhaps it
could be even better, as it doesn't use varbyte encoding on the tid
list the way the posting lists on disk do--it could do so in the bulk
build case, where it receives the tid in order, but not feasibly in
the pending-list clean-up case, where it doesn't get them in order)

When I was testing building an index on a column of text identifiers
where there were a couple million identifiers occurring a few dozen
times each, building it as a gin index (using btree_gin so that plain
text columns could be indexed) was quite a bit faster than building it
as a regular btree index using tuple sort. I didn't really
investigate this difference, but I assume it was due to the better
memory usage leading to less IO.

I've been wondering about this too, using tuplesort to build GIN
indexes, for a long time. Surely the highly-optimized quicksort+merge
code in tuplesort.c is faster than maintaining a red-black tree? Or so I
thought.

I wrote a quick prototype of using tuplesort.c for GIN index build. I
tested it with a 500 MB table of integer arrays, so that the sort fit
completely in memory. It's a lot slower than the current code. It turns
out eliminating the duplicates early is really really important.

So we want to keep the red-black tree, to eliminate the duplicates. Or
add that capability to tuplesort.c, which might speed up Sort+Unique and
aggregates in general, but that's a big effort.

However, I still wonder, if we shouldn't use a merge approach when the
tree doesn't fit in memory, like tuplesort.c does. Currently, when the
tree is full, we flush it out to the index, by inserting all the
entries. That can get quite expensive, I/O-wise. It also generates more
WAL, compared to writing each page only once.

If we flushed the tree to a tape instead, then we could perhaps use the
machinery that Peter's parallel B-tree patch is adding to tuplesort.c,
to merge the tapes. I'm not sure if that works out, but I think it's
worth some experimentation.

But I do think this with aggregation would be worth it despite the
amount of work involved. In addition to automatically benefiting from
parallel sorts and any other future sort improvements, I think it
would also generate better indexes. I have a theory that a problem
with very large gin indexes is that repeatedly building
maintenance_work_mem worth of rbtree entries and then merging them to
disk yields highly fragmented btrees, in which logical adjacent
key-space does not occupy physically nearby leaf pages, which then can
causes problems either with access that follows a pattern (like range
scans, except gin indexes can't really do those anyway) or further
bulk operations.

Yeah, there's that too.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#29)
Re: Proposal: speeding up GIN build with parallel workers

On Wed, Sep 14, 2016 at 3:48 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

If we flushed the tree to a tape instead, then we could perhaps use the
machinery that Peter's parallel B-tree patch is adding to tuplesort.c, to
merge the tapes. I'm not sure if that works out, but I think it's worth some
experimentation.

Marking as returned with feedback per mainly this comment.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers