Parallel CREATE INDEX for GIN indexes

Started by Tomas Vondraover 1 year ago71 messages

tomas.vondra@enterprisedb.com

over 1 year ago

9 attachment(s)

Hi,

In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
when working on that I was thinking how difficult would it be to do
something similar to do that for other index types, like GIN. I even had
that on my list of ideas to pitch to potential contributors, as I was
fairly sure it's doable and reasonably isolated / well-defined.

However, I was not aware of any takers, so a couple days ago on a slow
weekend I took a stab at it. And yes, it's doable - attached is a fairly
complete, tested and polished version of the feature, I think. It turned
out to be a bit more complex than I expected, for reasons that I'll get
into when discussing the patches.

First, let's talk about the benefits - how much faster is that than the
single-process build we have for GIN indexes? I do have a table with the
archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
(raw dump is about 28GB). This does include simple text data (message
body), JSONB (headers) and tsvector (full-text on message body).

If I do CREATE index with different number of workers (0 means serial
build), I get this timings (in seconds):

workers trgm tsvector jsonb jsonb (hash)
-----------------------------------------------------
0 1240 378 104 57
1 773 196 59 85
2 548 163 51 78
3 423 153 45 75
4 362 142 43 75
5 323 134 40 70
6 295 130 39 73

Perhaps an easier to understand result is this table with relative
timing compared to serial build:

workers trgm tsvector jsonb jsonb (hash)
-----------------------------------------------------
1 62% 52% 57% 149%
2 44% 43% 49% 136%
3 34% 40% 43% 132%
4 29% 38% 41% 131%
5 26% 35% 39% 123%
6 24% 34% 38% 129%

This shows the benefits are pretty nice, depending on the opclass. For
most indexes it's maybe ~3-4x faster, which is nice, and I don't think
it's possible to do much better - the actual index inserts can happen
from a single process only, which is the main limit.

For some of the opclasses it can regress (like the jsonb_path_ops). I
don't think that's a major issue. Or more precisely, I'm not surprised
by it. It'd be nice to be able to disable the parallel builds in these
cases somehow, but I haven't thought about that.

I do plan to do some tests with btree_gin, but I don't expect that to
behave significantly differently.

There are small variations in the index size, when built in the serial
way and the parallel way. It's generally within ~5-10%, and I believe
it's due to the serial build adding the TIDs incrementally, while the
build adds them in much larger chunks (possibly even in one chunk with
all the TIDs for the key). I believe the same size variation can happen
if the index gets built in a different way, e.g. by inserting the data
in a different order, etc. I did a number of tests to check if the index
produces the correct results, and I haven't found any issues. So I think
this is OK, and neither a problem nor an advantage of the patch.

Now, let's talk about the code - the series has 7 patches, with 6
non-trivial parts doing changes in focused and easier to understand
pieces (I hope so).

1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch

This is the initial feature, adding the "basic" version, implemented as
pretty much 1:1 copy of the BRIN parallel build and minimal changes to
make it work for GIN (mostly about how to store intermediate results).

The basic idea is that the workers do the regular build, but instead of
flushing the data into the index after hitting the memory limit, it gets
written into a shared tuplesort and sorted by the index key. And the
leader then reads this sorted data, accumulates the TID for a given key
and inserts that into the index in one go.

2) v20240502-0002-Use-mergesort-in-the-leader-process.patch

The approach implemented by 0001 works, but there's a little bit of
issue - if there are many distinct keys (e.g. for trigrams that can
happen very easily), the workers will hit the memory limit with only
very short TID lists for most keys. For serial build that means merging
the data into a lot of random places, and in parallel build it means the
leader will have to merge a lot of tiny lists from many sorted rows.

Which can be quite annoying and expensive, because the leader does so
using qsort() in the serial part. It'd be better to ensure most of the
sorting happens in the workers, and the leader can do a mergesort. But
the mergesort must not happen too often - merging many small lists is
not cheaper than a single qsort (especially when the lists overlap).

So this patch changes the workers to process the data in two phases. The
first works as before, but the data is flushed into a local tuplesort.
And then each workers sorts the results it produced, and combines them
into results with much larger TID lists, and those results are written
to the shared tuplesort. So the leader only gets very few lists to
combine for a given key - usually just one list per worker.

3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch

In 0002 the workers still do an explicit qsort() on the TID list before
writing the data into the shared tuplesort. But we can do better - the
workers can do a merge sort too. To help with this, we add the first TID
to the tuplesort tuple, and sort by that too - it helps the workers to
process the data in an order that allows simple concatenation instead of
the full mergesort.

Note: There's a non-obvious issue due to parallel scans always being
"sync scans", which may lead to very "wide" TID ranges when the scan
wraps around. More about that later.

4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch

The parallel build passes data between processes using temporary files,
which means it may need significant amount of disk space. For BRIN this
was not a major concern, because the summaries tend to be pretty small.

But for GIN that's not the case, and the two-phase processing introduced
by 0002 make it worse, because the worker essentially creates another
copy of the intermediate data. It does not need to copy the key, so
maybe it's not exactly 2x the space requirement, but in the worst case
it's not far from that.

But there's a simple way how to improve this - the TID lists tend to be
very compressible, and GIN already implements a very light-weight TID
compression, so this patch does just that - when building the tuple to
be written into the tuplesort, we just compress the TIDs.

5) v20240502-0005-Collect-and-print-compression-stats.patch

This patch simply collects some statistics about the compression, to
show how much it reduces the amounts of data in the various phases. The
data I've seen so far usually show ~75% compression in the first phase,
and ~30% compression in the second phase.

That is, in the first phase we save ~25% of space, in the second phase
we save ~70% of space. An example of the log messages from this patch,
for one worker (of two) in the trigram phase says:

LOG: _gin_parallel_scan_and_build raw 10158870494 compressed 7519211584
ratio 74.02%
LOG: _gin_process_worker_data raw 4593563782 compressed 1314800758
ratio 28.62%

Put differently, a single-phase version without compression (as in 0001)
would need ~10GB of disk space per worker. With compression, we need
only about ~8.8GB for both phases (or ~7.5GB for the first phase alone).

I do think these numbers look pretty good. The numbers are different for
other opclasses (trigrams are rather extreme in how much space they
need), but the overall behavior is the same.

6) v20240502-0006-Enforce-memory-limit-when-combining-tuples.patch

Until this part, there's no limit on memory used by combining results
for a single index key - it'll simply use as much memory as needed to
combine all the TID lists. Which may not be a huge issue because each
TID is only 6B, and we can accumulate a lot of those in a couple MB. And
a parallel CREATE INDEX usually runs with a fairly significant values of
maintenance_work_mem (in fact it requires it to even allow parallel).
But still, there should be some memory limit.

It however is not as simple as dumping current state into the index,
because the TID lists produced by the workers may overlap, so the tail
of the list may still receive TIDs from some future TID list. And that's
a problem because ginEntryInsert() expects to receive TIDs in order, and
if that's not the case it may fail with "could not split GIN page".

But we already have the first TID for each sort tuple (and we consider
it when sorting the data), and this is useful for deducing how far we
can flush the data, and keep just the minimal part of the TID list that
may change by merging.

So this patch implements that - it introduces the concept of "freezing"
the head of the TID list up to "first TID" from the next tuple, and uses
that to write data into index if needed because of memory limit.

We don't want to do that too often, so it only happens if we hit the
memory limit and there's at least a certain number (1024) of TIDs.

7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch

There's one more efficiency problem - the parallel scans are required to
be synchronized, i.e. the scan may start half-way through the table, and
then wrap around. Which however means the TID list will have a very wide
range of TID values, essentially the min and max of for the key.

Without 0006 this would cause frequent failures of the index build, with
the error I already mentioned:

ERROR: could not split GIN page; all old items didn't fit

tracking the "safe" TID horizon addresses that. But there's still an
issue with efficiency - having such a wide TID list forces the mergesort
to actually walk the lists, because this wide list overlaps with every
other list produced by the worker. And that's much more expensive than
just simply concatenating them, which is what happens without the wrap
around (because in that case the worker produces non-overlapping lists).

One way to fix this would be to allow parallel scans to not be sync
scans, but that seems fairly tricky and I'm not sure if that can be
done. The BRIN parallel build had a similar issue, and it was just
simpler to deal with this in the build code.

So 0007 does something similar - it tracks if the TID value goes
backward in the callback, and if it does it dumps the state into the
tuplesort before processing the first tuple from the beginning of the
table. Which means we end up with two separate "narrow" TID list, not
one very wide one.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From c1dafb0209ce225e9e825c07ddcb1416ce2435fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:36 +0200
Subject: [PATCH v20240502 3/8] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 130 +++++++++++++++++++++--------
 src/include/access/gin_tuple.h     |   9 +-
 2 files changed, 104 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8011e0b5ad5..5762c9520d8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1110,12 +1110,6 @@ _gin_parallel_heapscan(GinBuildState *state)
 	return state->bs_reltuples;
 }
 
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
 /*
  * State used to combine accumulate TIDs from multiple GinTuples for the same
  * key value.
@@ -1147,17 +1141,21 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 }
 
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1220,6 +1218,45 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferStoreTuple
+ *		Add data from a GinTuple into the GinBuffer.
+ *
+ * If the buffer is empty, we simply initialize it with data from the tuple.
+ * Otherwise data from the tuple (the TID list) is added to the TID data in
+ * the buffer, either by simply appending the TIDs or doing merge sort.
+ *
+ * The data (for the same key) is expected to be processed sorted by first
+ * TID. But this does not guarantee the lists do not overlap, especially in
+ * the leader, because the workers process interleaving data. But even in
+ * a single worker, lists can overlap - parallel scans require sync-scans,
+ * and if the scan starts somewhere in the table and then wraps around, it
+ * may contain very wide lists (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases
+ * where it can simply concatenate the lists, and when full mergesort is
+ * needed. And does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make
+ * it more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After ab
+ * overlap, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
+ * I'm not sure how much we can do to prevent that, short of disabling sync
+ * scans (which for parallel scans is currently not possible). One option
+ * would be to keep two lists of TIDs, and see if the new list can be
+ * concatenated with one of them. The idea is that there's only one wide
+ * list (because the wraparound happens only once), and then do the
+ * mergesort only once at the very end.
+ *
+ * XXX Alternatively, we could simply detect the case when the lists can't
+ * be appended, and flush the current list out. The wrap around happens only
+ * once, so there's going to be only such wide list, and it'll be sorted
+ * first (because it has the lowest TID for the key). So we'd do this at
+ * most once per key.
+ */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 {
@@ -1246,7 +1283,12 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* copy the new TIDs into the buffer, combine using merge-sort */
+	/*
+	 * Copy the new TIDs into the buffer, combine with existing data (if any)
+	 * using merge-sort. The mergesort is already smart about cases where it
+	 * can simply concatenate the two lists, and when it actually needs to
+	 * merge the data in an expensive way.
+	 */
 	{
 		int			nnew;
 		ItemPointer new;
@@ -1261,21 +1303,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
 
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1299,6 +1329,11 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 
+	if (buffer->items)
+	{
+		pfree(buffer->items);
+		buffer->items = NULL;
+	}
 	/* XXX should do something with extremely large array of items? */
 }
 
@@ -1390,7 +1425,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1400,14 +1435,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1510,7 +1548,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1524,7 +1562,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1534,7 +1575,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1835,6 +1876,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -1919,6 +1961,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * assumption that if we get two keys that are two different representations
  * of a logically equal value, it'll get merged by the index build.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * FIXME Is the assumption we can just memcmp() actually valid? Won't this
  * trigger the "could not split GIN page; all old items didn't fit" error
  * when trying to update the TID list?
@@ -1947,20 +1995,34 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
+		/*
+		 * works for both byval and byref types with fixed lenght, because for
+		 * byval we set keylen to sizeof(Datum)
+		 */
 		if (a->typlen > 0)
-			return memcmp(&keya, &keyb, a->keylen);
+		{
+			int			r = memcmp(&keya, &keyb, a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		}
 
 		if (a->typlen < 0)
 		{
+			int			r;
+
 			if (a->keylen < b->keylen)
 				return -1;
 
 			if (a->keylen > b->keylen)
 				return 1;
 
-			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+			r = memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index b5304b73ff1..c3641edd5fc 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,7 +10,13 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
-
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -19,6 +25,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.44.0

v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 6e6fc47199850c55069bf380079f61029f5a1b66 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240502 4/8] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 5762c9520d8..607ce9b34d6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -187,7 +187,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1265,7 +1267,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1306,6 +1309,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1806,6 +1812,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1818,6 +1833,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1831,6 +1851,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1854,12 +1879,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1909,37 +1956,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -1952,6 +2002,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -1992,8 +2064,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		/*
 		 * works for both byval and byref types with fixed lenght, because for
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0ee1b9ab19b..97b8770ed5f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1023,6 +1023,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.44.0

v20240502-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240502-0005-Collect-and-print-compression-stats.patchDownload

From 6d726c68fef8969b8c6acba0f6647dd1db83174e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240502 5/8] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 36 +++++++++++++++++++++++++-----
 src/include/access/gin.h           |  2 ++
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 607ce9b34d6..acfc4e56838 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -539,7 +540,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1530,6 +1531,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %lu compressed %lu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1556,7 +1566,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1583,7 +1593,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1598,6 +1608,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %lu compressed %lu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1669,7 +1684,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1763,6 +1778,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1840,7 +1856,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -1971,6 +1988,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.44.0

v20240502-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240502-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From ace40a448684ffb1fac1e4630b5657d8ffd3d27d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:49 +0200
Subject: [PATCH v20240502 6/8] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 245 +++++++++++++++++++++++++++--
 src/include/access/gin.h           |   1 +
 2 files changed, 237 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index acfc4e56838..1d9557692a3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1130,8 +1130,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1166,7 +1170,21 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 static GinBuffer *
 GinBufferInit(void)
 {
-	return palloc0(sizeof(GinBuffer));
+	GinBuffer  *buffer = (GinBuffer *) palloc0(sizeof(GinBuffer));
+
+	/*
+	 * How many items can we fit into the memory limit? 64kB seems more than
+	 * enough and we don't want a limit that's too high. OTOH maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound,
+	 * but it should be enough to make the merges cheap because it quickly
+	 * finds reaches the end of the second list and can just memcpy the rest
+	 * without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
+	return buffer;
 }
 
 static bool
@@ -1221,6 +1239,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wrap around case, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data from a GinTuple into the GinBuffer.
@@ -1259,6 +1325,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * once, so there's going to be only such wide list, and it'll be sorted
  * first (because it has the lowest TID for the key). So we'd do this at
  * most once per key.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1287,26 +1358,82 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/*
 	 * Copy the new TIDs into the buffer, combine with existing data (if any)
 	 * using merge-sort. The mergesort is already smart about cases where it
 	 * can simply concatenate the two lists, and when it actually needs to
 	 * merge the data in an expensive way.
+	 *
+	 * XXX We could check if (buffer->nitems > buffer->nfrozen) and only do
+	 * the mergesort in that case. ginMergeItemPointers does some palloc
+	 * internally, and this way we could eliminate that. But let's keep the
+	 * code simple for now.
 	 */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1332,6 +1459,7 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
@@ -1344,6 +1472,23 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /* XXX probably would be better to have a memory context for the buffer */
 static void
 GinBufferFree(GinBuffer *buffer)
@@ -1402,7 +1547,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit();
 
 	/*
@@ -1442,6 +1592,36 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1465,6 +1645,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims %ld", state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1525,7 +1707,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit();
 
 	/* sort the raw per-worker data */
@@ -1578,6 +1766,43 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1613,6 +1838,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims %ld", state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.44.0

v20240502-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240502-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From a569086cd2dc70ee0b9152980eb2301d99f8c580 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:55 +0200
Subject: [PATCH v20240502 7/8] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 89 ++++++++++++++++++------------
 1 file changed, 55 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 1d9557692a3..acdf45416fe 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -142,6 +142,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -474,6 +475,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * FIXME Another way to deal with the wrap around of sync scans would be to
+ * detect when tid wraps around and just flush the state.
+ */
 static void
 ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 						 bool *isnull, bool tupleIsAlive, void *state)
@@ -484,6 +526,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* flush contents before wrapping around */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -518,40 +570,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * XXX probably should use 32MB, not work_mem, as used during planning?
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -587,6 +606,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -2007,6 +2027,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.44.0

gin-parallel-absolute.pngimage/png; name=gin-parallel-absolute.pngDownload

gin-parallel-relative.pngimage/png; name=gin-parallel-relative.pngDownload

v20240502-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240502-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 35cf84ee568df2cb7eb4027dcf84fefd02e45509 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:26 +0200
Subject: [PATCH v20240502 1/8] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/ginbulk.c           |    7 +
 src/backend/access/gin/gininsert.c         | 1335 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  154 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   28 +
 src/include/utils/tuplesort.h              |    6 +
 src/tools/pgindent/typedefs.list           |    4 +
 9 files changed, 1528 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index a522801c2f7..12eeff04a6c 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -153,6 +153,13 @@ ginInsertBAEntry(BuildAccumulator *accum,
 	GinEntryAccumulator *ea;
 	bool		isNew;
 
+	/*
+	 * FIXME prevents writes of uninitialized bytes reported by valgrind in
+	 * writetup (likely that build_gin_tuple copies some fields that are only
+	 * initialized for a certain category, or something similar)
+	 */
+	memset(&eatmp, 0, sizeof(GinEntryAccumulator));
+
 	/*
 	 * For the moment, fill only the fields of eatmp that will be looked at by
 	 * cmpEntryAccumulator or ginCombineData.
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..4d6d152403e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,124 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +142,49 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process. (Actually, only the leader process has a
+	 * GinBuildState.)
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +463,95 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * XXX idea - Instead of writing the entries directly into the shared
+	 * tuplesort, write it into a local one, do the sort in the worker, and
+	 * combine the results. For large tables with many different keys that's
+	 * going to work better than the current approach where we don't get many
+	 * matches in work_mem (maybe this should use 32MB, which is what we use
+	 * when planning, but even that may not be great). Which means we are
+	 * likely to have many entries with a single TID, forcing the leader to do
+	 * a qsort() when merging the data, often amounting to ~50% of the serial
+	 * part. By doing the qsort() in a worker, leader then can do a mergesort
+	 * (likely cheaper). Also, it means the amount of data worker->leader is
+	 * going to be lower thanks to deduplication.
+	 *
+	 * Disadvantage: It needs more disk space, possibly up to 2x, because each
+	 * worker creates a tuplestore, then "transforms it" into the shared
+	 * tuplestore (hopefully less data, but not guaranteed).
+	 *
+	 * It's however possible to partition the data into multiple tuplesorts
+	 * per worker (by hashing). We don't need perfect sorting, and we can even
+	 * live with "equal" keys having multiple hashes (if there are multiple
+	 * binary representations of the value).
+	 */
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX probably should use 32MB, not work_mem, as used during planning?
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +569,14 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * XXX Make sure to initialize a bunch of fields, not to trip valgrind.
+	 * Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +617,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. That makes sense
+	 * for btree, but not for GIN, which can do with much less memory. So
+	 * maybe make that somehow less strict, optionally?
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		/* scan the relation and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +841,1001 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * State used to combine accumulate TIDs from multiple GinTuples for the same
+ * key value.
+ *
+ * XXX Similar purpose to BuildAccumulator, but much simpler.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	ItemPointerData *items;
+} GinBuffer;
+
+/* XXX should do more checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+#endif
+}
+
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+static GinBuffer *
+GinBufferInit(void)
+{
+	return palloc0(sizeof(GinBuffer));
+}
+
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data. Compare
+ * scalar fields first, before the actual key.
+ *
+ * XXX The key is compared using memcmp, which means that if a key has
+ * multiple binary representations, we may end up treating them as
+ * different here. But that's OK, the index will merge them anyway.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	if (tup->keylen != buffer->keylen)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * Compare the key value, depending on the type information.
+	 *
+	 * XXX Not sure this works correctly for byval types that don't need the
+	 * whole Datum. What if there is garbage in the padding bytes?
+	 */
+	if (buffer->typbyval)
+		return (buffer->key == *(Datum *) tup->data);
+
+	/* byref values simply uses memcmp for comparison */
+	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
+}
+
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/* XXX not really needed, but easier to trigger NULL deref etc. */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+
+	/* XXX should do something with extremely large array of items? */
+}
+
+/*
+ * XXX Maybe check size of the TID arrays, and return false if it's too
+ * large (more thant maintenance_work_mem or something?).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME probably should have local memory contexts similar to what
+ * _brin_parallel_merge  does.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * XXX Maybe we should sort by key first, then by category?
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* insert the last item */
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * allocate space for the whole GIN tuple
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, &key, typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval > 0)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done simply by "memcmp", based on the
+ * assumption that if we get two keys that are two different representations
+ * of a logically equal value, it'll get merged by the index build.
+ *
+ * FIXME Is the assumption we can just memcmp() actually valid? Won't this
+ * trigger the "could not split GIN page; all old items didn't fit" error
+ * when trying to update the TID list?
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		if (a->typlen > 0)
+			return memcmp(&keya, &keyb, a->keylen);
+
+		if (a->typlen < 0)
+		{
+			if (a->keylen < b->keylen)
+				return -1;
+
+			if (a->keylen > b->keylen)
+				return 1;
+
+			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+		}
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..dd22b44aca9 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..c9ea769afb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa36..55cc55969e5 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,35 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+
+Tuplesortstate *
+tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	base->nKeys = 1;			/* Only the index key */
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +855,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tup, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tup, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1058,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1869,68 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..b5304b73ff1
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,28 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index e7941a1f09f..35fa5ae2442 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,8 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +459,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tup, Size len);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tup, Size len);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +469,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e10ff28ee54..0ee1b9ab19b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1004,11 +1004,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1021,9 +1023,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.44.0

v20240502-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240502-0002-Use-mergesort-in-the-leader-process.patchDownload

From bc51aa8ff10b0a53f65564ace2a14da671363341 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:32 +0200
Subject: [PATCH v20240502 2/8] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 171 +++++++++++++++++++++++++----
 1 file changed, 148 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4d6d152403e..8011e0b5ad5 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -161,6 +161,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate is used only within a worker for the first merge pass
+	 * that happens in the worker. In principle it doesn't need to be part of
+	 * the build state and we could pass it around directly, but it's more
+	 * convenient this way.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -533,7 +541,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1127,7 +1135,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1136,7 +1143,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
 #endif
 }
 
@@ -1240,28 +1246,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* copy the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
 }
@@ -1302,6 +1302,21 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * XXX Maybe check size of the TID arrays, and return false if it's too
  * large (more thant maintenance_work_mem or something?).
@@ -1375,7 +1390,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1392,7 +1407,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1402,6 +1417,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1440,6 +1458,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short. But combining many tiny lists is expensive,
+ * so we try to do as much as possible in the workers and only then pass the
+ * results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1471,6 +1585,10 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1508,7 +1626,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1517,6 +1635,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.44.0

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Parallel CREATE INDEX for GIN indexes

On Thu, 2 May 2024 at 17:19, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
when working on that I was thinking how difficult would it be to do
something similar to do that for other index types, like GIN. I even had
that on my list of ideas to pitch to potential contributors, as I was
fairly sure it's doable and reasonably isolated / well-defined.

However, I was not aware of any takers, so a couple days ago on a slow
weekend I took a stab at it. And yes, it's doable - attached is a fairly
complete, tested and polished version of the feature, I think. It turned
out to be a bit more complex than I expected, for reasons that I'll get
into when discussing the patches.

This is great. I've been thinking about approximately the same issue
recently, too, but haven't had time to discuss/implement any of this
yet. I think some solutions may even be portable to the btree parallel
build: it also has key deduplication (though to a much smaller degree)
and could benefit from deduplication during the scan/ssup load phase,
rather than only during insertion.

First, let's talk about the benefits - how much faster is that than the
single-process build we have for GIN indexes? I do have a table with the
archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
(raw dump is about 28GB). This does include simple text data (message
body), JSONB (headers) and tsvector (full-text on message body).

Sidenote: Did you include the tsvector in the table to reduce time
spent during index creation? I would have used an expression in the
index definition, rather than a direct column.

If I do CREATE index with different number of workers (0 means serial
build), I get this timings (in seconds):

[...]

This shows the benefits are pretty nice, depending on the opclass. For
most indexes it's maybe ~3-4x faster, which is nice, and I don't think
it's possible to do much better - the actual index inserts can happen
from a single process only, which is the main limit.

Can we really not insert with multiple processes? It seems to me that
GIN could be very suitable for that purpose, with its clear double
tree structure distinction that should result in few buffer conflicts
if different backends work on known-to-be-very-different keys.
We'd probably need multiple read heads on the shared tuplesort, and a
way to join the generated top-level subtrees, but I don't think that
is impossible. Maybe it's work for later effort though.

Have you tested and/or benchmarked this with multi-column GIN indexes?

For some of the opclasses it can regress (like the jsonb_path_ops). I
don't think that's a major issue. Or more precisely, I'm not surprised
by it. It'd be nice to be able to disable the parallel builds in these
cases somehow, but I haven't thought about that.

Do you know why it regresses?

I do plan to do some tests with btree_gin, but I don't expect that to
behave significantly differently.

There are small variations in the index size, when built in the serial
way and the parallel way. It's generally within ~5-10%, and I believe
it's due to the serial build adding the TIDs incrementally, while the
build adds them in much larger chunks (possibly even in one chunk with
all the TIDs for the key).

I assume that was '[...] while the [parallel] build adds them [...]', right?

I believe the same size variation can happen
if the index gets built in a different way, e.g. by inserting the data
in a different order, etc. I did a number of tests to check if the index
produces the correct results, and I haven't found any issues. So I think
this is OK, and neither a problem nor an advantage of the patch.

Now, let's talk about the code - the series has 7 patches, with 6
non-trivial parts doing changes in focused and easier to understand
pieces (I hope so).

The following comments are generally based on the descriptions; I
haven't really looked much at the patches yet except to validate some
assumptions.

1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch

This is the initial feature, adding the "basic" version, implemented as
pretty much 1:1 copy of the BRIN parallel build and minimal changes to
make it work for GIN (mostly about how to store intermediate results).

The basic idea is that the workers do the regular build, but instead of
flushing the data into the index after hitting the memory limit, it gets
written into a shared tuplesort and sorted by the index key. And the
leader then reads this sorted data, accumulates the TID for a given key
and inserts that into the index in one go.

In the code, GIN insertions are still basically single btree
insertions, all starting from the top (but with many same-valued
tuples at once). Now that we have a tuplesort with the full table's
data, couldn't the code be adapted to do more efficient btree loading,
such as that seen in the nbtree code, where the rightmost pages are
cached and filled sequentially without requiring repeated searches
down the tree? I suspect we can gain a lot of time there.

I don't need you to do that, but what's your opinion on this?

2) v20240502-0002-Use-mergesort-in-the-leader-process.patch

The approach implemented by 0001 works, but there's a little bit of
issue - if there are many distinct keys (e.g. for trigrams that can
happen very easily), the workers will hit the memory limit with only
very short TID lists for most keys. For serial build that means merging
the data into a lot of random places, and in parallel build it means the
leader will have to merge a lot of tiny lists from many sorted rows.

Which can be quite annoying and expensive, because the leader does so
using qsort() in the serial part. It'd be better to ensure most of the
sorting happens in the workers, and the leader can do a mergesort. But
the mergesort must not happen too often - merging many small lists is
not cheaper than a single qsort (especially when the lists overlap).

So this patch changes the workers to process the data in two phases. The
first works as before, but the data is flushed into a local tuplesort.
And then each workers sorts the results it produced, and combines them
into results with much larger TID lists, and those results are written
to the shared tuplesort. So the leader only gets very few lists to
combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch

In 0002 the workers still do an explicit qsort() on the TID list before
writing the data into the shared tuplesort. But we can do better - the
workers can do a merge sort too. To help with this, we add the first TID
to the tuplesort tuple, and sort by that too - it helps the workers to
process the data in an order that allows simple concatenation instead of
the full mergesort.

Note: There's a non-obvious issue due to parallel scans always being
"sync scans", which may lead to very "wide" TID ranges when the scan
wraps around. More about that later.

As this note seems to imply, this seems to have a strong assumption
that data received in parallel workers is always in TID order, with
one optional wraparound. Non-HEAP TAMs may break with this assumption,
so what's the plan on that?

4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch

The parallel build passes data between processes using temporary files,
which means it may need significant amount of disk space. For BRIN this
was not a major concern, because the summaries tend to be pretty small.

But for GIN that's not the case, and the two-phase processing introduced
by 0002 make it worse, because the worker essentially creates another
copy of the intermediate data. It does not need to copy the key, so
maybe it's not exactly 2x the space requirement, but in the worst case
it's not far from that.

But there's a simple way how to improve this - the TID lists tend to be
very compressible, and GIN already implements a very light-weight TID
compression, so this patch does just that - when building the tuple to
be written into the tuplesort, we just compress the TIDs.

See note on 0002: Could we do this in the tuplesort writeback, rather
than by moving the data around multiple times?

[...]

So 0007 does something similar - it tracks if the TID value goes
backward in the callback, and if it does it dumps the state into the
tuplesort before processing the first tuple from the beginning of the
table. Which means we end up with two separate "narrow" TID list, not
one very wide one.

See note above: We may still need a merge phase, just to make sure we
handle all TAM parallel scans correctly, even if that merge join phase
wouldn't get hit in vanilla PostgreSQL.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#2)

Re: Parallel CREATE INDEX for GIN indexes

On 5/2/24 19:12, Matthias van de Meent wrote:

On Thu, 2 May 2024 at 17:19, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Hi,

In PG17 we shall have parallel CREATE INDEX for BRIN indexes, and back
when working on that I was thinking how difficult would it be to do
something similar to do that for other index types, like GIN. I even had
that on my list of ideas to pitch to potential contributors, as I was
fairly sure it's doable and reasonably isolated / well-defined.

However, I was not aware of any takers, so a couple days ago on a slow
weekend I took a stab at it. And yes, it's doable - attached is a fairly
complete, tested and polished version of the feature, I think. It turned
out to be a bit more complex than I expected, for reasons that I'll get
into when discussing the patches.

This is great. I've been thinking about approximately the same issue
recently, too, but haven't had time to discuss/implement any of this
yet. I think some solutions may even be portable to the btree parallel
build: it also has key deduplication (though to a much smaller degree)
and could benefit from deduplication during the scan/ssup load phase,
rather than only during insertion.

Perhaps, although I'm not that familiar with the details of btree
builds, and I haven't thought about it when working on this over the
past couple days.

First, let's talk about the benefits - how much faster is that than the
single-process build we have for GIN indexes? I do have a table with the
archive of all our mailing lists - it's ~1.5M messages, table is ~21GB
(raw dump is about 28GB). This does include simple text data (message
body), JSONB (headers) and tsvector (full-text on message body).

Sidenote: Did you include the tsvector in the table to reduce time
spent during index creation? I would have used an expression in the
index definition, rather than a direct column.

Yes, it's a materialized column, not computed during index creation.

If I do CREATE index with different number of workers (0 means serial
build), I get this timings (in seconds):

[...]

This shows the benefits are pretty nice, depending on the opclass. For
most indexes it's maybe ~3-4x faster, which is nice, and I don't think
it's possible to do much better - the actual index inserts can happen
from a single process only, which is the main limit.

Can we really not insert with multiple processes? It seems to me that
GIN could be very suitable for that purpose, with its clear double
tree structure distinction that should result in few buffer conflicts
if different backends work on known-to-be-very-different keys.
We'd probably need multiple read heads on the shared tuplesort, and a
way to join the generated top-level subtrees, but I don't think that
is impossible. Maybe it's work for later effort though.

Maybe, but I took it as a restriction and it seemed too difficult to
relax (or at least I assume that).

Have you tested and/or benchmarked this with multi-column GIN indexes?

I did test that, and I'm not aware of any bugs/issues. Performance-wise
it depends on which opclasses are used by the columns - if you take the
speedup for each of them independently, the speedup for the whole index
is roughly the average of that.

For some of the opclasses it can regress (like the jsonb_path_ops). I
don't think that's a major issue. Or more precisely, I'm not surprised
by it. It'd be nice to be able to disable the parallel builds in these
cases somehow, but I haven't thought about that.

Do you know why it regresses?

No, but one thing that stands out is that the index is much smaller than
the other columns/opclasses, and the compression does not save much
(only about 5% for both phases). So I assume it's the overhead of
writing writing and reading a bunch of GB of data without really gaining
much from doing that.

I do plan to do some tests with btree_gin, but I don't expect that to
behave significantly differently.

There are small variations in the index size, when built in the serial
way and the parallel way. It's generally within ~5-10%, and I believe
it's due to the serial build adding the TIDs incrementally, while the
build adds them in much larger chunks (possibly even in one chunk with
all the TIDs for the key).

I assume that was '[...] while the [parallel] build adds them [...]', right?

Right. The parallel build adds them in larger chunks.

I believe the same size variation can happen
if the index gets built in a different way, e.g. by inserting the data
in a different order, etc. I did a number of tests to check if the index
produces the correct results, and I haven't found any issues. So I think
this is OK, and neither a problem nor an advantage of the patch.

Now, let's talk about the code - the series has 7 patches, with 6
non-trivial parts doing changes in focused and easier to understand
pieces (I hope so).

The following comments are generally based on the descriptions; I
haven't really looked much at the patches yet except to validate some
assumptions.

1) v20240502-0001-Allow-parallel-create-for-GIN-indexes.patch

This is the initial feature, adding the "basic" version, implemented as
pretty much 1:1 copy of the BRIN parallel build and minimal changes to
make it work for GIN (mostly about how to store intermediate results).

The basic idea is that the workers do the regular build, but instead of
flushing the data into the index after hitting the memory limit, it gets
written into a shared tuplesort and sorted by the index key. And the
leader then reads this sorted data, accumulates the TID for a given key
and inserts that into the index in one go.

In the code, GIN insertions are still basically single btree
insertions, all starting from the top (but with many same-valued
tuples at once). Now that we have a tuplesort with the full table's
data, couldn't the code be adapted to do more efficient btree loading,
such as that seen in the nbtree code, where the rightmost pages are
cached and filled sequentially without requiring repeated searches
down the tree? I suspect we can gain a lot of time there.

I don't need you to do that, but what's your opinion on this?

I have no idea. I started working on this with only very basic idea of
how GIN works / is structured, so I simply leveraged the existing
callback and massaged it to work in the parallel case too.

2) v20240502-0002-Use-mergesort-in-the-leader-process.patch

The approach implemented by 0001 works, but there's a little bit of
issue - if there are many distinct keys (e.g. for trigrams that can
happen very easily), the workers will hit the memory limit with only
very short TID lists for most keys. For serial build that means merging
the data into a lot of random places, and in parallel build it means the
leader will have to merge a lot of tiny lists from many sorted rows.

Which can be quite annoying and expensive, because the leader does so
using qsort() in the serial part. It'd be better to ensure most of the
sorting happens in the workers, and the leader can do a mergesort. But
the mergesort must not happen too often - merging many small lists is
not cheaper than a single qsort (especially when the lists overlap).

So this patch changes the workers to process the data in two phases. The
first works as before, but the data is flushed into a local tuplesort.
And then each workers sorts the results it produced, and combines them
into results with much larger TID lists, and those results are written
to the shared tuplesort. So the leader only gets very few lists to
combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

Would that happen in the worker or leader process? Because my goal was
to do the expensive part in the worker, because that's what helps with
the parallelization.

3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch

In 0002 the workers still do an explicit qsort() on the TID list before
writing the data into the shared tuplesort. But we can do better - the
workers can do a merge sort too. To help with this, we add the first TID
to the tuplesort tuple, and sort by that too - it helps the workers to
process the data in an order that allows simple concatenation instead of
the full mergesort.

Note: There's a non-obvious issue due to parallel scans always being
"sync scans", which may lead to very "wide" TID ranges when the scan
wraps around. More about that later.

As this note seems to imply, this seems to have a strong assumption
that data received in parallel workers is always in TID order, with
one optional wraparound. Non-HEAP TAMs may break with this assumption,
so what's the plan on that?

Well, that would break the serial build too, right? Anyway, the way this
patch works can be extended to deal with that by actually sorting the
TIDs when serializing the tuplestore tuple. The consequence of that is
the combining will be more expensive, because it'll require a proper
mergesort, instead of just appending the lists.

4) v20240502-0004-Compress-TID-lists-before-writing-tuples-t.patch

The parallel build passes data between processes using temporary files,
which means it may need significant amount of disk space. For BRIN this
was not a major concern, because the summaries tend to be pretty small.

But for GIN that's not the case, and the two-phase processing introduced
by 0002 make it worse, because the worker essentially creates another
copy of the intermediate data. It does not need to copy the key, so
maybe it's not exactly 2x the space requirement, but in the worst case
it's not far from that.

But there's a simple way how to improve this - the TID lists tend to be
very compressible, and GIN already implements a very light-weight TID
compression, so this patch does just that - when building the tuple to
be written into the tuplesort, we just compress the TIDs.

See note on 0002: Could we do this in the tuplesort writeback, rather
than by moving the data around multiple times?

No idea, I've never done that ...

[...]

So 0007 does something similar - it tracks if the TID value goes
backward in the callback, and if it does it dumps the state into the
tuplesort before processing the first tuple from the beginning of the
table. Which means we end up with two separate "narrow" TID list, not
one very wide one.

See note above: We may still need a merge phase, just to make sure we
handle all TAM parallel scans correctly, even if that merge join phase
wouldn't get hit in vanilla PostgreSQL.

Well, yeah. But in fact the parallel code already does that, while the
existing serial code may fail with the "data don't fit" error.

The parallel code will do the mergesort correctly, and only emit TIDs
that we know are safe to write to the index (i.e. no future TIDs will go
before the "TID horizon").

But the serial build has nothing like that - it will sort the TIDs that
fit into the memory limit, but it also relies on not processing data out
of order (and disables sync scans to not have wrap around issues). But
if the TAM does something funny, this may break.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#3)

7 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

Here's a slightly improved version, fixing a couple bugs in handling
byval/byref values, causing issues on 32-bit machines (but not only).
And also a couple compiler warnings about string formatting.

Other than that, no changes.

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240505-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240505-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 510af00802c04b8d6d3982069c96082572a76c72 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:26 +0200
Subject: [PATCH v20240505 1/8] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/ginbulk.c           |    7 +
 src/backend/access/gin/gininsert.c         | 1340 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  154 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   29 +
 src/include/utils/tuplesort.h              |    6 +
 src/tools/pgindent/typedefs.list           |    4 +
 9 files changed, 1534 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index a522801c2f7..12eeff04a6c 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -153,6 +153,13 @@ ginInsertBAEntry(BuildAccumulator *accum,
 	GinEntryAccumulator *ea;
 	bool		isNew;
 
+	/*
+	 * FIXME prevents writes of uninitialized bytes reported by valgrind in
+	 * writetup (likely that build_gin_tuple copies some fields that are only
+	 * initialized for a certain category, or something similar)
+	 */
+	memset(&eatmp, 0, sizeof(GinEntryAccumulator));
+
 	/*
 	 * For the moment, fill only the fields of eatmp that will be looked at by
 	 * cmpEntryAccumulator or ginCombineData.
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..b353e155fc6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,124 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +142,49 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process. (Actually, only the leader process has a
+	 * GinBuildState.)
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +463,95 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * XXX idea - Instead of writing the entries directly into the shared
+	 * tuplesort, write it into a local one, do the sort in the worker, and
+	 * combine the results. For large tables with many different keys that's
+	 * going to work better than the current approach where we don't get many
+	 * matches in work_mem (maybe this should use 32MB, which is what we use
+	 * when planning, but even that may not be great). Which means we are
+	 * likely to have many entries with a single TID, forcing the leader to do
+	 * a qsort() when merging the data, often amounting to ~50% of the serial
+	 * part. By doing the qsort() in a worker, leader then can do a mergesort
+	 * (likely cheaper). Also, it means the amount of data worker->leader is
+	 * going to be lower thanks to deduplication.
+	 *
+	 * Disadvantage: It needs more disk space, possibly up to 2x, because each
+	 * worker creates a tuplestore, then "transforms it" into the shared
+	 * tuplestore (hopefully less data, but not guaranteed).
+	 *
+	 * It's however possible to partition the data into multiple tuplesorts
+	 * per worker (by hashing). We don't need perfect sorting, and we can even
+	 * live with "equal" keys having multiple hashes (if there are multiple
+	 * binary representations of the value).
+	 */
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX probably should use 32MB, not work_mem, as used during planning?
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +569,14 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * XXX Make sure to initialize a bunch of fields, not to trip valgrind.
+	 * Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +617,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. That makes sense
+	 * for btree, but not for GIN, which can do with much less memory. So
+	 * maybe make that somehow less strict, optionally?
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		/* scan the relation and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +841,1006 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * State used to combine accumulate TIDs from multiple GinTuples for the same
+ * key value.
+ *
+ * XXX Similar purpose to BuildAccumulator, but much simpler.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	ItemPointerData *items;
+} GinBuffer;
+
+/* XXX should do more checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+#endif
+}
+
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+static GinBuffer *
+GinBufferInit(void)
+{
+	return palloc0(sizeof(GinBuffer));
+}
+
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data. Compare
+ * scalar fields first, before the actual key.
+ *
+ * XXX The key is compared using memcmp, which means that if a key has
+ * multiple binary representations, we may end up treating them as
+ * different here. But that's OK, the index will merge them anyway.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	if (tup->keylen != buffer->keylen)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * Compare the key value, depending on the type information.
+	 *
+	 * XXX Not sure this works correctly for byval types that don't need the
+	 * whole Datum. What if there is garbage in the padding bytes?
+	 */
+	if (buffer->typbyval)
+		return (buffer->key == *(Datum *) tup->data);
+
+	/* byref values simply uses memcmp for comparison */
+	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
+}
+
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/* XXX not really needed, but easier to trigger NULL deref etc. */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+
+	/* XXX should do something with extremely large array of items? */
+}
+
+/*
+ * XXX Maybe check size of the TID arrays, and return false if it's too
+ * large (more thant maintenance_work_mem or something?).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME probably should have local memory contexts similar to what
+ * _brin_parallel_merge  does.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * XXX Maybe we should sort by key first, then by category?
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* insert the last item */
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * allocate space for the whole GIN tuple
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done simply by "memcmp", based on the
+ * assumption that if we get two keys that are two different representations
+ * of a logically equal value, it'll get merged by the index build.
+ *
+ * FIXME Is the assumption we can just memcmp() actually valid? Won't this
+ * trigger the "could not split GIN page; all old items didn't fit" error
+ * when trying to update the TID list?
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		/*
+		 * works for both byval and byref types with fixed lenght, because for
+		 * byval we set keylen to sizeof(Datum)
+		 */
+		if (a->typbyval)
+		{
+			return memcmp(&keya, &keyb, a->keylen);
+		}
+		else
+		{
+			if (a->keylen < b->keylen)
+				return -1;
+
+			if (a->keylen > b->keylen)
+				return 1;
+
+			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+		}
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..dd22b44aca9 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..c9ea769afb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa36..55cc55969e5 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,35 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+
+Tuplesortstate *
+tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	base->nKeys = 1;			/* Only the index key */
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +855,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tup, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tup, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1058,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1869,68 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..56aed40fb96
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index e7941a1f09f..35fa5ae2442 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,8 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +459,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tup, Size len);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tup, Size len);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +469,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index eee989bba17..9769f4d6b09 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1004,11 +1004,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1021,9 +1023,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.44.0

v20240505-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240505-0002-Use-mergesort-in-the-leader-process.patchDownload

From f43aeb97f766b24092c3758fa5d6a9f0e6676eaf Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:32 +0200
Subject: [PATCH v20240505 2/8] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 171 +++++++++++++++++++++++++----
 1 file changed, 148 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b353e155fc6..cf7a6278914 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -161,6 +161,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate is used only within a worker for the first merge pass
+	 * that happens in the worker. In principle it doesn't need to be part of
+	 * the build state and we could pass it around directly, but it's more
+	 * convenient this way.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -533,7 +541,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1127,7 +1135,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1136,7 +1143,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
 #endif
 }
 
@@ -1240,28 +1246,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* copy the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
 }
@@ -1302,6 +1302,21 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * XXX Maybe check size of the TID arrays, and return false if it's too
  * large (more thant maintenance_work_mem or something?).
@@ -1375,7 +1390,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1392,7 +1407,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1402,6 +1417,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1440,6 +1458,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short. But combining many tiny lists is expensive,
+ * so we try to do as much as possible in the workers and only then pass the
+ * results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1471,6 +1585,10 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1508,7 +1626,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1517,6 +1635,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.44.0

v20240505-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240505-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From 45e7f09ec81932c54eef891017d2a10818dd25b6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:36 +0200
Subject: [PATCH v20240505 3/8] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 124 +++++++++++++++++++++--------
 src/include/access/gin_tuple.h     |   8 ++
 2 files changed, 98 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cf7a6278914..b2b44066329 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1110,12 +1110,6 @@ _gin_parallel_heapscan(GinBuildState *state)
 	return state->bs_reltuples;
 }
 
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
 /*
  * State used to combine accumulate TIDs from multiple GinTuples for the same
  * key value.
@@ -1147,17 +1141,21 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 }
 
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1220,6 +1218,45 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferStoreTuple
+ *		Add data from a GinTuple into the GinBuffer.
+ *
+ * If the buffer is empty, we simply initialize it with data from the tuple.
+ * Otherwise data from the tuple (the TID list) is added to the TID data in
+ * the buffer, either by simply appending the TIDs or doing merge sort.
+ *
+ * The data (for the same key) is expected to be processed sorted by first
+ * TID. But this does not guarantee the lists do not overlap, especially in
+ * the leader, because the workers process interleaving data. But even in
+ * a single worker, lists can overlap - parallel scans require sync-scans,
+ * and if the scan starts somewhere in the table and then wraps around, it
+ * may contain very wide lists (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases
+ * where it can simply concatenate the lists, and when full mergesort is
+ * needed. And does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make
+ * it more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After ab
+ * overlap, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
+ * I'm not sure how much we can do to prevent that, short of disabling sync
+ * scans (which for parallel scans is currently not possible). One option
+ * would be to keep two lists of TIDs, and see if the new list can be
+ * concatenated with one of them. The idea is that there's only one wide
+ * list (because the wraparound happens only once), and then do the
+ * mergesort only once at the very end.
+ *
+ * XXX Alternatively, we could simply detect the case when the lists can't
+ * be appended, and flush the current list out. The wrap around happens only
+ * once, so there's going to be only such wide list, and it'll be sorted
+ * first (because it has the lowest TID for the key). So we'd do this at
+ * most once per key.
+ */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 {
@@ -1246,7 +1283,12 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* copy the new TIDs into the buffer, combine using merge-sort */
+	/*
+	 * Copy the new TIDs into the buffer, combine with existing data (if any)
+	 * using merge-sort. The mergesort is already smart about cases where it
+	 * can simply concatenate the two lists, and when it actually needs to
+	 * merge the data in an expensive way.
+	 */
 	{
 		int			nnew;
 		ItemPointer new;
@@ -1261,21 +1303,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
 
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1299,6 +1329,11 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 
+	if (buffer->items)
+	{
+		pfree(buffer->items);
+		buffer->items = NULL;
+	}
 	/* XXX should do something with extremely large array of items? */
 }
 
@@ -1390,7 +1425,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1400,14 +1435,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1510,7 +1548,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1524,7 +1562,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1534,7 +1575,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1835,6 +1876,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -1919,6 +1961,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * assumption that if we get two keys that are two different representations
  * of a logically equal value, it'll get merged by the index build.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * FIXME Is the assumption we can just memcmp() actually valid? Won't this
  * trigger the "could not split GIN page; all old items didn't fit" error
  * when trying to update the TID list?
@@ -1953,19 +2001,27 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 		 */
 		if (a->typbyval)
 		{
-			return memcmp(&keya, &keyb, a->keylen);
+			int			r = memcmp(&keya, &keyb, a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 		else
 		{
+			int			r;
+
 			if (a->keylen < b->keylen)
 				return -1;
 
 			if (a->keylen > b->keylen)
 				return 1;
 
-			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+			r = memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 56aed40fb96..8efa33a8d31 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -12,6 +12,13 @@
 
 #include "storage/itemptr.h"
 
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -20,6 +27,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.44.0

v20240505-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240505-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 08f8ecd7a21370cc452a6185781428729ad58330 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240505 4/8] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2b44066329..b84fb8f12b6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -187,7 +187,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1265,7 +1267,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1306,6 +1309,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1806,6 +1812,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1818,6 +1833,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1831,6 +1851,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1854,12 +1879,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1909,37 +1956,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -1952,6 +2002,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -1992,8 +2064,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		/*
 		 * works for both byval and byref types with fixed lenght, because for
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9769f4d6b09..6b67756ebb9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1023,6 +1023,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.44.0

v20240505-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240505-0005-Collect-and-print-compression-stats.patchDownload

From eaefe4ed07054fd43428f06de35496aecbbadb4a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240505 5/8] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 36 +++++++++++++++++++++++++-----
 src/include/access/gin.h           |  2 ++
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b84fb8f12b6..2206c47dfb1 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -539,7 +540,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1530,6 +1531,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1556,7 +1566,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1583,7 +1593,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1598,6 +1608,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1669,7 +1684,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1763,6 +1778,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1840,7 +1856,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -1971,6 +1988,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.44.0

v20240505-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240505-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From fb14d8f86276dc08bec2e93c3191832613a6d56a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:49 +0200
Subject: [PATCH v20240505 6/8] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 245 +++++++++++++++++++++++++++--
 src/include/access/gin.h           |   1 +
 2 files changed, 237 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2206c47dfb1..f4a4b8f00e9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1130,8 +1130,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1166,7 +1170,21 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 static GinBuffer *
 GinBufferInit(void)
 {
-	return palloc0(sizeof(GinBuffer));
+	GinBuffer  *buffer = (GinBuffer *) palloc0(sizeof(GinBuffer));
+
+	/*
+	 * How many items can we fit into the memory limit? 64kB seems more than
+	 * enough and we don't want a limit that's too high. OTOH maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound,
+	 * but it should be enough to make the merges cheap because it quickly
+	 * finds reaches the end of the second list and can just memcpy the rest
+	 * without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
+	return buffer;
 }
 
 static bool
@@ -1221,6 +1239,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wrap around case, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data from a GinTuple into the GinBuffer.
@@ -1259,6 +1325,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * once, so there's going to be only such wide list, and it'll be sorted
  * first (because it has the lowest TID for the key). So we'd do this at
  * most once per key.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1287,26 +1358,82 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/*
 	 * Copy the new TIDs into the buffer, combine with existing data (if any)
 	 * using merge-sort. The mergesort is already smart about cases where it
 	 * can simply concatenate the two lists, and when it actually needs to
 	 * merge the data in an expensive way.
+	 *
+	 * XXX We could check if (buffer->nitems > buffer->nfrozen) and only do
+	 * the mergesort in that case. ginMergeItemPointers does some palloc
+	 * internally, and this way we could eliminate that. But let's keep the
+	 * code simple for now.
 	 */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1332,6 +1459,7 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
@@ -1344,6 +1472,23 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /* XXX probably would be better to have a memory context for the buffer */
 static void
 GinBufferFree(GinBuffer *buffer)
@@ -1402,7 +1547,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit();
 
 	/*
@@ -1442,6 +1592,36 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1465,6 +1645,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1525,7 +1707,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit();
 
 	/* sort the raw per-worker data */
@@ -1578,6 +1766,43 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1613,6 +1838,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.44.0

v20240505-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240505-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From f8af37eb2278d3a97b148458161c30122cdbf9e1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:55 +0200
Subject: [PATCH v20240505 7/8] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 89 ++++++++++++++++++------------
 1 file changed, 55 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f4a4b8f00e9..7705eddfa70 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -142,6 +142,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -474,6 +475,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * FIXME Another way to deal with the wrap around of sync scans would be to
+ * detect when tid wraps around and just flush the state.
+ */
 static void
 ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 						 bool *isnull, bool tupleIsAlive, void *state)
@@ -484,6 +526,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* flush contents before wrapping around */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -518,40 +570,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * XXX probably should use 32MB, not work_mem, as used during planning?
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -587,6 +606,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -2007,6 +2027,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.44.0

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#3)

Re: Parallel CREATE INDEX for GIN indexes

Hello Tomas,

2) v20240502-0002-Use-mergesort-in-the-leader-process.patch

The approach implemented by 0001 works, but there's a little bit of
issue - if there are many distinct keys (e.g. for trigrams that can
happen very easily), the workers will hit the memory limit with only
very short TID lists for most keys. For serial build that means merging
the data into a lot of random places, and in parallel build it means the
leader will have to merge a lot of tiny lists from many sorted rows.

Which can be quite annoying and expensive, because the leader does so
using qsort() in the serial part. It'd be better to ensure most of the
sorting happens in the workers, and the leader can do a mergesort. But
the mergesort must not happen too often - merging many small lists is
not cheaper than a single qsort (especially when the lists overlap).

So this patch changes the workers to process the data in two phases. The
first works as before, but the data is flushed into a local tuplesort.
And then each workers sorts the results it produced, and combines them
into results with much larger TID lists, and those results are written
to the shared tuplesort. So the leader only gets very few lists to
combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

Would that happen in the worker or leader process? Because my goal was
to do the expensive part in the worker, because that's what helps with
the parallelization.

I guess both of you are talking about worker process, if here are
something in my mind:

*btbuild* also let the WORKER dump the tuples into Sharedsort struct
and let the LEADER merge them directly. I think this aim of this design
is it is potential to save a mergeruns. In the current patch, worker dump
to local tuplesort and mergeruns it and then leader run the merges
again. I admit the goal of this patch is reasonable, but I'm feeling we
need to adapt this way conditionally somehow. and if we find the way, we
can apply it to btbuild as well.

--
Best Regards
Andy Fan

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch

In 0002 the workers still do an explicit qsort() on the TID list before
writing the data into the shared tuplesort. But we can do better - the
workers can do a merge sort too. To help with this, we add the first TID
to the tuplesort tuple, and sort by that too - it helps the workers to
process the data in an order that allows simple concatenation instead of
the full mergesort.

Note: There's a non-obvious issue due to parallel scans always being
"sync scans", which may lead to very "wide" TID ranges when the scan
wraps around. More about that later.

This is really amazing.

7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch

There's one more efficiency problem - the parallel scans are required to
be synchronized, i.e. the scan may start half-way through the table, and
then wrap around. Which however means the TID list will have a very wide
range of TID values, essentially the min and max of for the key.

Without 0006 this would cause frequent failures of the index build, with
the error I already mentioned:

ERROR: could not split GIN page; all old items didn't fit

I have two questions here and both of them are generall gin index questions
rather than the patch here.

1. What does the "wrap around" mean in the "the scan may start half-way
through the table, and then wrap around". Searching "wrap" in
gin/README gets nothing.

2. I can't understand the below error.

ERROR: could not split GIN page; all old items didn't fit

When the posting list is too long, we have posting tree strategy. so in
which sistuation we could get this ERROR.

issue with efficiency - having such a wide TID list forces the mergesort
to actually walk the lists, because this wide list overlaps with every
other list produced by the worker.

If we split the blocks among worker 1-block by 1-block, we will have a
serious issue like here. If we can have N-block by N-block, and N-block
is somehow fill the work_mem which makes the dedicated temp file, we
can make things much better, can we?

--
Best Regards
Andy Fan

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#6)

Re: Parallel CREATE INDEX for GIN indexes

On 5/9/24 12:14, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

3) v20240502-0003-Remove-the-explicit-pg_qsort-in-workers.patch

In 0002 the workers still do an explicit qsort() on the TID list before
writing the data into the shared tuplesort. But we can do better - the
workers can do a merge sort too. To help with this, we add the first TID
to the tuplesort tuple, and sort by that too - it helps the workers to
process the data in an order that allows simple concatenation instead of
the full mergesort.

Note: There's a non-obvious issue due to parallel scans always being
"sync scans", which may lead to very "wide" TID ranges when the scan
wraps around. More about that later.

This is really amazing.

7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch

There's one more efficiency problem - the parallel scans are required to
be synchronized, i.e. the scan may start half-way through the table, and
then wrap around. Which however means the TID list will have a very wide
range of TID values, essentially the min and max of for the key.

Without 0006 this would cause frequent failures of the index build, with
the error I already mentioned:

ERROR: could not split GIN page; all old items didn't fit

I have two questions here and both of them are generall gin index questions
rather than the patch here.

1. What does the "wrap around" mean in the "the scan may start half-way
through the table, and then wrap around". Searching "wrap" in
gin/README gets nothing.

The "wrap around" is about the scan used to read data from the table
when building the index. A "sync scan" may start e.g. at TID (1000,0)
and read till the end of the table, and then wraps and returns the
remaining part at the beginning of the table for blocks 0-999.

This means the callback would not see a monotonically increasing
sequence of TIDs.

Which is why the serial build disables sync scans, allowing simply
appending values to the sorted list, and even with regular flushes of
data into the index we can simply append data to the posting lists.

2. I can't understand the below error.

ERROR: could not split GIN page; all old items didn't fit

When the posting list is too long, we have posting tree strategy. so in
which sistuation we could get this ERROR.

AFAICS the index build relies on the assumption that we only append data
to the TID list on a leaf page, and when the page gets split, the "old"
part will always fit. Which may not be true, if there was a wrap around
and we're adding low TID values to the list on the leaf page.

FWIW the error in dataBeginPlaceToPageLeaf looks like this:

if (!append || ItemPointerCompare(&maxOldItem, &remaining) >= 0)
elog(ERROR, "could not split GIN page; all old items didn't fit");

It can fail simply because of the !append part.

I'm not sure why dataBeginPlaceToPageLeaf() relies on this assumption,
or with GIN details in general, and I haven't found any explanation. But
AFAIK this is why the serial build disables sync scans.

issue with efficiency - having such a wide TID list forces the mergesort
to actually walk the lists, because this wide list overlaps with every
other list produced by the worker.

If we split the blocks among worker 1-block by 1-block, we will have a
serious issue like here. If we can have N-block by N-block, and N-block
is somehow fill the work_mem which makes the dedicated temp file, we
can make things much better, can we?

I don't understand the question. The blocks are distributed to workers
by the parallel table scan, and it certainly does not do that block by
block. But even it it did, that's not a problem for this code.

The problem is that if the scan wraps around, then one of the TID lists
for a given worker will have the min TID and max TID, so it will overlap
with every other TID list for the same key in that worker. And when the
worker does the merging, this list will force a "full" merge sort for
all TID lists (for that key), which is very expensive.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#5)

Re: Parallel CREATE INDEX for GIN indexes

On 5/9/24 11:44, Andy Fan wrote:

Hello Tomas,

2) v20240502-0002-Use-mergesort-in-the-leader-process.patch

The approach implemented by 0001 works, but there's a little bit of
issue - if there are many distinct keys (e.g. for trigrams that can
happen very easily), the workers will hit the memory limit with only
very short TID lists for most keys. For serial build that means merging
the data into a lot of random places, and in parallel build it means the
leader will have to merge a lot of tiny lists from many sorted rows.

Which can be quite annoying and expensive, because the leader does so
using qsort() in the serial part. It'd be better to ensure most of the
sorting happens in the workers, and the leader can do a mergesort. But
the mergesort must not happen too often - merging many small lists is
not cheaper than a single qsort (especially when the lists overlap).

So this patch changes the workers to process the data in two phases. The
first works as before, but the data is flushed into a local tuplesort.
And then each workers sorts the results it produced, and combines them
into results with much larger TID lists, and those results are written
to the shared tuplesort. So the leader only gets very few lists to
combine for a given key - usually just one list per worker.

Hmm, I was hoping we could implement the merging inside the tuplesort
itself during its own flush phase, as it could save significantly on
IO, and could help other users of tuplesort with deduplication, too.

Would that happen in the worker or leader process? Because my goal was
to do the expensive part in the worker, because that's what helps with
the parallelization.

I guess both of you are talking about worker process, if here are
something in my mind:

*btbuild* also let the WORKER dump the tuples into Sharedsort struct
and let the LEADER merge them directly. I think this aim of this design
is it is potential to save a mergeruns. In the current patch, worker dump
to local tuplesort and mergeruns it and then leader run the merges
again. I admit the goal of this patch is reasonable, but I'm feeling we
need to adapt this way conditionally somehow. and if we find the way, we
can apply it to btbuild as well.

I'm a bit confused about what you're proposing here, or how is that
related to what this patch is doing and/or to the what Matthias
mentioned in his e-mail from last week.

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

1) Worker gets data from table, split that into index items and add
those into a "private" tuplesort, and finally sorts that. So a worker
may see a key many times, with different TIDs, so the tuplesort may
contain many items for the same key, with distinct TID lists:

key1: 1, 2, 3, 4
key1: 5, 6, 7
key1: 8, 9, 10
key2: 1, 2, 3
...

2) Worker reads the sorted data, and combines TIDs for the same key into
larger TID lists, depending on work_mem etc. and writes the result into
a shared tuplesort. So the worker may write this:

key1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
key2: 1, 2, 3

3) Leader reads this, combines TID lists from all workers (using a
higher memory limit, probably), and writes the result into the index.

The step (2) is optional - it would work without it. But it helps, as it
moves a potentially expensive sort into the workers (and thus the
parallel part of the build), and it also makes it cheaper, because in a
single worker the lists do not overlap and thus can be simply appended.
Which in the leader is not the case, forcing an expensive mergesort.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space). But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch, and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#3)

Re: Parallel CREATE INDEX for GIN indexes

On 5/2/24 20:22, Tomas Vondra wrote:

For some of the opclasses it can regress (like the jsonb_path_ops). I
don't think that's a major issue. Or more precisely, I'm not surprised
by it. It'd be nice to be able to disable the parallel builds in these
cases somehow, but I haven't thought about that.

Do you know why it regresses?

No, but one thing that stands out is that the index is much smaller than
the other columns/opclasses, and the compression does not save much
(only about 5% for both phases). So I assume it's the overhead of
writing writing and reading a bunch of GB of data without really gaining
much from doing that.

I finally got to look into this regression, but I think I must have done
something wrong before because I can't reproduce it. This is the timings
I get now, if I rerun the benchmark:

workers trgm tsvector jsonb jsonb (hash)
-------------------------------------------------------
0 1225 404 104 56
1 772 180 57 60
2 549 143 47 52
3 426 127 43 50
4 364 116 40 48
5 323 111 38 46
6 292 111 37 45

and the speedup, relative to serial build:

workers trgm tsvector jsonb jsonb (hash)
--------------------------------------------------------
1 63% 45% 54% 108%
2 45% 35% 45% 94%
3 35% 31% 41% 89%
4 30% 29% 38% 86%
5 26% 28% 37% 83%
6 24% 28% 35% 81%

So there's a small regression for the jsonb_path_ops opclass, but only
with one worker. After that, it gets a bit faster than serial build.
While not a great speedup, it's far better than the earlier results that
showed maybe 40% regression.

I don't know what I did wrong before - maybe I had a build with an extra
debug info or something like that? No idea why would that affect only
one of the opclasses. But this time I made doubly sure the results are
correct etc.

Anyway, I'm fairly happy with these results. I don't think it's
surprising there are cases where parallel build does not help much.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: Parallel CREATE INDEX for GIN indexes

On Thu, 9 May 2024 at 15:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

1) Worker gets data from table, split that into index items and add
those into a "private" tuplesort, and finally sorts that. So a worker
may see a key many times, with different TIDs, so the tuplesort may
contain many items for the same key, with distinct TID lists:

key1: 1, 2, 3, 4
key1: 5, 6, 7
key1: 8, 9, 10
key2: 1, 2, 3
...

This step is actually split in several components/phases, too.
As opposed to btree, which directly puts each tuple's data into the
tuplesort, this GIN approach actually buffers the tuples in memory to
generate these TID lists for data keys, and flushes these pairs of Key
+ TID list into the tuplesort when its own memory limit is exceeded.
That means we essentially double the memory used for this data: One
GIN deform buffer, and one in-memory sort buffer in the tuplesort.
This is fine for now, but feels duplicative, hence my "let's allow
tuplesort to merge the key+TID pairs into pairs of key+TID list"
comment.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

Yes, but with a slightly more extensive approach than that even, see above.

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space).

*must not be larger than the sum; not "must be smaller than the sum" [^0].
For btree tuples with posting lists this is guaranteed to be true: The
added size of a btree tuple with a posting list (containing at least 2
values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
(12 on 32-bit systems). The smallest btree tuple with data is also 16
bytes (or 12 bytes on 32-bit systems), so this works out nicely.

But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch,

Agreed.

and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

Same here: While I do have some ideas on where and how to implement
this, I'm not planning on working on that soon.

Kind regards,

Matthias van de Meent

[^0] There's some overhead in the tuplesort serialization too, so
there is some leeway there, too.

#11

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#10)

Re: Parallel CREATE INDEX for GIN indexes

On 5/9/24 17:51, Matthias van de Meent wrote:

On Thu, 9 May 2024 at 15:13, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

1) Worker gets data from table, split that into index items and add
those into a "private" tuplesort, and finally sorts that. So a worker
may see a key many times, with different TIDs, so the tuplesort may
contain many items for the same key, with distinct TID lists:

key1: 1, 2, 3, 4
key1: 5, 6, 7
key1: 8, 9, 10
key2: 1, 2, 3
...

This step is actually split in several components/phases, too.
As opposed to btree, which directly puts each tuple's data into the
tuplesort, this GIN approach actually buffers the tuples in memory to
generate these TID lists for data keys, and flushes these pairs of Key
+ TID list into the tuplesort when its own memory limit is exceeded.
That means we essentially double the memory used for this data: One
GIN deform buffer, and one in-memory sort buffer in the tuplesort.
This is fine for now, but feels duplicative, hence my "let's allow
tuplesort to merge the key+TID pairs into pairs of key+TID list"
comment.

True, although the "GIN deform buffer" (flushed by the callback if using
too much memory) likely does most of the merging already. If it only
happened in the tuplesort merge, we'd likely have far more tuples and
overhead associated with that. So we certainly won't get rid of either
of these things.

You're right the memory limits are a bit unclear, and need more thought.
I certainly have not thought very much about not using more than the
specified maintenance_work_mem amount. This includes the planner code
determining the number of workers to use - right now it simply does the
same thing as for btree/brin, i.e. assumes each workers uses 32MB of
memory and checks how many workers fit into maintenance_work_mem.

That was a bit bogus even for BRIN, because BRIN sorts only summaries,
which is typically tiny - perhaps a couple kB, much less than 32MB. But
it's still just one sort, and some opclasses may be much larger (like
bloom, for example). So I just went with it.

But for GIN it's more complicated, because we have two tuplesorts (not
sure if both can use the memory at the same time) and the GIN deform
buffer. Which probably means we need to have a per-worker allowance
considering all these buffers.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

Yes, but with a slightly more extensive approach than that even, see above.

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space).

*must not be larger than the sum; not "must be smaller than the sum" [^0].

Yeah, I wrote that wrong.

For btree tuples with posting lists this is guaranteed to be true: The
added size of a btree tuple with a posting list (containing at least 2
values) vs one without is the maxaligned size of 2 TIDs, or 16 bytes
(12 on 32-bit systems). The smallest btree tuple with data is also 16
bytes (or 12 bytes on 32-bit systems), so this works out nicely.

But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch,

Agreed.

and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

Same here: While I do have some ideas on where and how to implement
this, I'm not planning on working on that soon.

Understood. I don't have a very good intuition on how significant the
benefit could be, which is one of the reasons why I have not prioritized
this very much.

I did a quick experiment, to measure how expensive it is to build the
second worker tuplesort - for the pg_trgm index build with 2 workers, it
takes ~30seconds. The index build takes ~550s in total, so 30s is ~5%.
If we eliminated all of this work we'd save this, but in reality some of
it will still be necessary.

Perhaps it's more significant for other indexes / slower storage, but it
does not seem like a *must have* for v1.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#8)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I guess both of you are talking about worker process, if here are
something in my mind:

*btbuild* also let the WORKER dump the tuples into Sharedsort struct
and let the LEADER merge them directly. I think this aim of this design
is it is potential to save a mergeruns. In the current patch, worker dump
to local tuplesort and mergeruns it and then leader run the merges
again. I admit the goal of this patch is reasonable, but I'm feeling we
need to adapt this way conditionally somehow. and if we find the way, we
can apply it to btbuild as well.

I'm a bit confused about what you're proposing here, or how is that
related to what this patch is doing and/or to the what Matthias
mentioned in his e-mail from last week.

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

What's in my mind is:

1. WORKER-1

Tempfile 1:

key1: 1
key3: 2
...

Tempfile 2:

key5: 3
key7: 4
...

2. WORKER-2

Tempfile 1:

Key2: 1
Key6: 2
...

Tempfile 2:
Key3: 3
Key6: 4
..

In the above example: if we do the the merge in LEADER, only 1 mergerun
is needed. reading 4 tempfile 8 tuples in total and write 8 tuples.

If we adds another mergerun into WORKER, the result will be:

WORKER1: reading 2 tempfile 4 tuples and write 1 tempfile (called X) 4
tuples.
WORKER2: reading 2 tempfile 4 tuples and write 1 tempfile (called Y) 4
tuples.

LEADER: reading 2 tempfiles (X & Y) including 8 tuples and write it
into final tempfile.

So the intermedia result X & Y requires some extra effort. so I think
the "extra mergerun in worker" is *not always* a win, and my proposal is
if we need to distinguish the cases in which one we should add the
"extra mergerun in worker" step.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

This sounds like the same question as I talk above, However my proposal
is to distinguish which cost is bigger between "the cost saving from
merging TIDs in WORKERS" and "cost paid because of the extra copy",
then we do that only when we are sure we can benefits from it, but I
know it is hard and not sure if it is doable.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

OK, I get it now. So we talked about lots of merge so far at different
stage and for different sets of tuples.

1. "GIN deform buffer" did the TIDs merge for the same key for the tuples
in one "deform buffer" batch, as what the current master is doing.

2. "in memory buffer sort" stage, currently there is no TID merge so
far and Matthias suggest that.

3. Merge the TIDs for the same keys in LEADER vs in WORKER first +
LEADER then. this is what your 0002 commit does now and I raised some
concerns as above.

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space). But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch,

If GIN deform buffer is big enough (like greater than the in memory
buffer sort) shall we have any gain because of this, since the
scope is the tuples in in-memory-buffer-sort.

and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

Agree with this. I am more interested with understanding the whole
design and the scope to fix in this patch, and then I can do some code
review and testing, as for now, I still in the "understanding design and
scope" stage. If I'm too slow about this patch, please feel free to
commit it any time and I don't expect I can find any valueable
improvement and bugs. I probably needs another 1 ~ 2 weeks to study
this patch.

--
Best Regards
Andy Fan

#13

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#12)

Re: Parallel CREATE INDEX for GIN indexes

On 5/10/24 07:53, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

I guess both of you are talking about worker process, if here are
something in my mind:

*btbuild* also let the WORKER dump the tuples into Sharedsort struct
and let the LEADER merge them directly. I think this aim of this design
is it is potential to save a mergeruns. In the current patch, worker dump
to local tuplesort and mergeruns it and then leader run the merges
again. I admit the goal of this patch is reasonable, but I'm feeling we
need to adapt this way conditionally somehow. and if we find the way, we
can apply it to btbuild as well.

I'm a bit confused about what you're proposing here, or how is that
related to what this patch is doing and/or to the what Matthias
mentioned in his e-mail from last week.

Let me explain the relevant part of the patch, and how I understand the
improvement suggested by Matthias. The patch does the work in three phases:

What's in my mind is:

1. WORKER-1

Tempfile 1:

key1: 1
key3: 2
...

Tempfile 2:

key5: 3
key7: 4
...

2. WORKER-2

Tempfile 1:

Key2: 1
Key6: 2
...

Tempfile 2:
Key3: 3
Key6: 4
..

In the above example: if we do the the merge in LEADER, only 1 mergerun
is needed. reading 4 tempfile 8 tuples in total and write 8 tuples.

If we adds another mergerun into WORKER, the result will be:

WORKER1: reading 2 tempfile 4 tuples and write 1 tempfile (called X) 4
tuples.
WORKER2: reading 2 tempfile 4 tuples and write 1 tempfile (called Y) 4
tuples.

LEADER: reading 2 tempfiles (X & Y) including 8 tuples and write it
into final tempfile.

So the intermedia result X & Y requires some extra effort. so I think
the "extra mergerun in worker" is *not always* a win, and my proposal is
if we need to distinguish the cases in which one we should add the
"extra mergerun in worker" step.

The thing you're forgetting is that the mergesort in the worker is
*always* a simple append, because the lists are guaranteed to be
non-overlapping, so it's very cheap. The lists from different workers
are however very likely to overlap, and hence a "full" mergesort is
needed, which is way more expensive.

And not only that - without the intermediate merge, there will be very
many of those lists the leader would have to merge.

If we do the append-only merges in the workers first, we still need to
merge them in the leader, of course, but we have few lists to merge
(only about one per worker).

Of course, this means extra I/O on the intermediate tuplesort, and it's
not difficult to imagine cases with no benefit, or perhaps even a
regression. For example, if the keys are unique, the in-worker merge
step can't really do anything. But that seems quite unlikely IMHO.

Also, if this overhead was really significant, we would not see the nice
speedups I measured during testing.

The trouble with (2) is that it "just copies" data from one tuplesort
into another, increasing the disk space requirements. In an extreme
case, when nothing can be combined, it pretty much doubles the amount of
disk space, and makes the build longer.

This sounds like the same question as I talk above, However my proposal
is to distinguish which cost is bigger between "the cost saving from
merging TIDs in WORKERS" and "cost paid because of the extra copy",
then we do that only when we are sure we can benefits from it, but I
know it is hard and not sure if it is doable.

Yeah. I'm not against picking the right execution strategy during the
index build, but it's going to be difficult, because we really don't
have the information to make a reliable decision.

We can't even use the per-column stats, because it does not say much
about the keys extracted by GIN, I think. And we need to do the decision
at the very beginning, before we write the first batch of data either to
the local or shared tuplesort.

But maybe we could wait until we need to flush the first batch of data
(in the callback), and make the decision then? In principle, if we only
flush once at the end, the intermediate sort is not needed at all (fairy
unlikely for large data sets, though).

Well, in principle, maybe we could even start writing into the local
tuplesort, and then "rethink" after a while and switch to the shared
one. We'd still need to copy data we've already written to the local
tuplesort, but hopefully that'd be just a fraction compared to doing
that for the whole table.

What I think Matthias is suggesting, is that this "TID list merging"
could be done directly as part of the tuplesort in step (1). So instead
of just moving the "sort tuples" from the appropriate runs, it could
also do an optional step of combining the tuples and writing this
combined tuple into the tuplesort result (for that worker).

OK, I get it now. So we talked about lots of merge so far at different
stage and for different sets of tuples.

1. "GIN deform buffer" did the TIDs merge for the same key for the tuples
in one "deform buffer" batch, as what the current master is doing.

2. "in memory buffer sort" stage, currently there is no TID merge so
far and Matthias suggest that.

3. Merge the TIDs for the same keys in LEADER vs in WORKER first +
LEADER then. this is what your 0002 commit does now and I raised some
concerns as above.

Matthias also mentioned this might be useful when building btree indexes
with key deduplication.

AFAICS this might work, although it probably requires for the "combined"
tuple to be smaller than the sum of the combined tuples (in order to fit
into the space). But at least in the GIN build in the workers this is
likely true, because the TID lists do not overlap (and thus not hurting
the compressibility).

That being said, I still see this more as an optimization than something
required for the patch,

If GIN deform buffer is big enough (like greater than the in memory
buffer sort) shall we have any gain because of this, since the
scope is the tuples in in-memory-buffer-sort.

I don't think this is very likely. The only case when the GIN deform
tuple is "big enough" is when we don't need to flush in the callback,
but that is going to happen only for "small" tables. And for those we
should not really do parallel builds. And even if we do, the overhead
would be pretty insignificant.

and I don't think I'll have time to work on this
anytime soon. The patch is not extremely complex, but it's not trivial
either. But if someone wants to take a stab at extending tuplesort to
allow this, I won't object ...

Agree with this. I am more interested with understanding the whole
design and the scope to fix in this patch, and then I can do some code
review and testing, as for now, I still in the "understanding design and
scope" stage. If I'm too slow about this patch, please feel free to
commit it any time and I don't expect I can find any valueable
improvement and bugs. I probably needs another 1 ~ 2 weeks to study
this patch.

Sure, happy to discuss and answer questions.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#7)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

7) v20240502-0007-Detect-wrap-around-in-parallel-callback.patch

There's one more efficiency problem - the parallel scans are required to
be synchronized, i.e. the scan may start half-way through the table, and
then wrap around. Which however means the TID list will have a very wide
range of TID values, essentially the min and max of for the key.

I have two questions here and both of them are generall gin index questions
rather than the patch here.

1. What does the "wrap around" mean in the "the scan may start half-way
through the table, and then wrap around". Searching "wrap" in
gin/README gets nothing.

The "wrap around" is about the scan used to read data from the table
when building the index. A "sync scan" may start e.g. at TID (1000,0)
and read till the end of the table, and then wraps and returns the
remaining part at the beginning of the table for blocks 0-999.

This means the callback would not see a monotonically increasing
sequence of TIDs.

Which is why the serial build disables sync scans, allowing simply
appending values to the sorted list, and even with regular flushes of
data into the index we can simply append data to the posting lists.

Thanks for the hints, I know the sync strategy comes from syncscan.c
now.

Without 0006 this would cause frequent failures of the index build, with
the error I already mentioned:

ERROR: could not split GIN page; all old items didn't fit

2. I can't understand the below error.

ERROR: could not split GIN page; all old items didn't fit

if (!append || ItemPointerCompare(&maxOldItem, &remaining) >= 0)
elog(ERROR, "could not split GIN page; all old items didn't fit");

It can fail simply because of the !append part.

Got it, Thanks!

If we split the blocks among worker 1-block by 1-block, we will have a
serious issue like here. If we can have N-block by N-block, and N-block
is somehow fill the work_mem which makes the dedicated temp file, we
can make things much better, can we?

I don't understand the question. The blocks are distributed to workers
by the parallel table scan, and it certainly does not do that block by
block. But even it it did, that's not a problem for this code.

OK, I get ParallelBlockTableScanWorkerData.phsw_chunk_size is designed
for this.

The problem is that if the scan wraps around, then one of the TID lists
for a given worker will have the min TID and max TID, so it will overlap
with every other TID list for the same key in that worker. And when the
worker does the merging, this list will force a "full" merge sort for
all TID lists (for that key), which is very expensive.

OK.

Thanks for all the answers, they are pretty instructive!

--
Best Regards
Andy Fan

#15

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#14)

Re: Parallel CREATE INDEX for GIN indexes

On 5/13/24 10:19, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

...

I don't understand the question. The blocks are distributed to workers
by the parallel table scan, and it certainly does not do that block by
block. But even it it did, that's not a problem for this code.

OK, I get ParallelBlockTableScanWorkerData.phsw_chunk_size is designed
for this.

The problem is that if the scan wraps around, then one of the TID lists
for a given worker will have the min TID and max TID, so it will overlap
with every other TID list for the same key in that worker. And when the
worker does the merging, this list will force a "full" merge sort for
all TID lists (for that key), which is very expensive.

OK.

Thanks for all the answers, they are pretty instructive!

Thanks for the questions, it forces me to articulate the arguments more
clearly. I guess it'd be good to put some of this into a README or at
least a comment at the beginning of gininsert.c or somewhere close.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#15)

Re: Parallel CREATE INDEX for GIN indexes

Hi Tomas,

I have completed my first round of review, generally it looks good to
me, more testing need to be done in the next days. Here are some tiny
comments from my side, just FYI.

1. Comments about GinBuildState.bs_leader looks not good for me.

/*
* bs_leader is only present when a parallel index build is performed, and
* only in the leader process. (Actually, only the leader process has a
* GinBuildState.)
*/
GinLeader *bs_leader;

In the worker function _gin_parallel_build_main:
initGinState(&buildstate.ginstate, indexRel); is called, and the
following members in workers at least: buildstate.funcCtx,
buildstate.accum and so on. So is the comment "only the leader process
has a GinBuildState" correct?

2. progress argument is not used?
_gin_parallel_scan_and_build(GinBuildState *state,
GinShared *ginshared, Sharedsort *sharedsort,
Relation heap, Relation index,
int sortmem, bool progress)

3. In function tuplesort_begin_index_gin, comments about nKeys takes me
some time to think about why 1 is correct(rather than
IndexRelationGetNumberOfKeyAttributes) and what does the "only the index
key" means.

base->nKeys = 1; /* Only the index key */

finally I think it is because gin index stores each attribute value into
an individual index entry for a multi-column index, so each index entry
has only 1 key. So we can comment it as the following?

"Gin Index stores the value of each attribute into different index entry
for mutli-column index, so each index entry has only 1 key all the
time." This probably makes it easier to understand.

4. GinBuffer: The comment "Similar purpose to BuildAccumulator, but much
simpler." makes me think why do we need a simpler but
similar structure, After some thoughts, they are similar at accumulating
TIDs only. GinBuffer is designed for "same key value" (hence
GinBufferCanAddKey). so IMO, the first comment is good enough and the 2
comments introduce confuses for green hand and is potential to remove
it.

/*
* State used to combine accumulate TIDs from multiple GinTuples for the same
* key value.
*
* XXX Similar purpose to BuildAccumulator, but much simpler.
*/
typedef struct GinBuffer

5. GinBuffer: ginMergeItemPointers always allocate new memory for the
new items and hence we have to pfree old memory each time. However it is
not necessary in some places, for example the new items can be appended
to Buffer->items (and this should be a common case). So could we
pre-allocate some spaces for items and reduce the number of pfree/palloc
and save some TID items copy in the desired case?

6. GinTuple.ItemPointerData first; /* first TID in the array */

is ItemPointerData.ip_blkid good enough for its purpose? If so, we can
save the memory for OffsetNumber for each GinTuple.

Item 5) and 6) needs some coding and testing. If it is OK to do, I'd
like to take it as an exercise in this area. (also including the item
1~4.)

--
Best Regards
Andy Fan

#17

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#16)

11 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi Andy,

Thanks for the review. Here's an updated patch series, addressing most
of the points you've raised - I've kept them in "fixup" patches for now,
should be merged into 0001.

More detailed responses below.

On 5/28/24 11:29, Andy Fan wrote:

Hi Tomas,

I have completed my first round of review, generally it looks good to
me, more testing need to be done in the next days. Here are some tiny
comments from my side, just FYI.

1. Comments about GinBuildState.bs_leader looks not good for me.

/*
* bs_leader is only present when a parallel index build is performed, and
* only in the leader process. (Actually, only the leader process has a
* GinBuildState.)
*/
GinLeader *bs_leader;

In the worker function _gin_parallel_build_main:
initGinState(&buildstate.ginstate, indexRel); is called, and the
following members in workers at least: buildstate.funcCtx,
buildstate.accum and so on. So is the comment "only the leader process
has a GinBuildState" correct?

Yeah, this is misleading. I don't remember what exactly was my reasoning
for this wording, I've removed the comment.

2. progress argument is not used?
_gin_parallel_scan_and_build(GinBuildState *state,
GinShared *ginshared, Sharedsort *sharedsort,
Relation heap, Relation index,
int sortmem, bool progress)

I've modified the code to use the progress flag, but now that I look at
it I'm a bit unsure I understand the purpose of this. I've modeled this
after what the btree does, and I see that there are two places calling
_bt_parallel_scan_and_sort:

1) _bt_leader_participate_as_worker: progress=true

2) _bt_parallel_build_main: progress=false

Isn't that a bit weird? AFAIU the progress will be updated only by the
leader, but will that progress be correct? And doesn't that means the if
the leader does not participate as a worker, the progress won't be updated?

FWIW The parallel BRIN code has the same issue - it's not using the
progress flag in _brin_parallel_scan_and_build.

3. In function tuplesort_begin_index_gin, comments about nKeys takes me
some time to think about why 1 is correct(rather than
IndexRelationGetNumberOfKeyAttributes) and what does the "only the index
key" means.

base->nKeys = 1; /* Only the index key */

finally I think it is because gin index stores each attribute value into
an individual index entry for a multi-column index, so each index entry
has only 1 key. So we can comment it as the following?

"Gin Index stores the value of each attribute into different index entry
for mutli-column index, so each index entry has only 1 key all the
time." This probably makes it easier to understand.

OK, I see what you mean. The other tuplesort_begin_ functions nearby
have similar comments, but you're right GIN is a bit special in that it
"splits" multi-column indexes into individual index entries. I've added
a comment (hopefully) clarifying this.

4. GinBuffer: The comment "Similar purpose to BuildAccumulator, but much
simpler." makes me think why do we need a simpler but
similar structure, After some thoughts, they are similar at accumulating
TIDs only. GinBuffer is designed for "same key value" (hence
GinBufferCanAddKey). so IMO, the first comment is good enough and the 2
comments introduce confuses for green hand and is potential to remove
it.

/*
* State used to combine accumulate TIDs from multiple GinTuples for the same
* key value.
*
* XXX Similar purpose to BuildAccumulator, but much simpler.
*/
typedef struct GinBuffer

I've updated the comment explaining the differences a bit clearer.

5. GinBuffer: ginMergeItemPointers always allocate new memory for the
new items and hence we have to pfree old memory each time. However it is
not necessary in some places, for example the new items can be appended
to Buffer->items (and this should be a common case). So could we
pre-allocate some spaces for items and reduce the number of pfree/palloc
and save some TID items copy in the desired case?

Perhaps, but that seems rather independent of this patch.

Also, I'm not sure how much would this optimization matter in practice.
The merge should happens fairly rarely, when we decide to store the TIDs
into the index. And then it's also subject to the caching built into the
memory contexts, limiting the malloc costs. We'll still pay for the
memcpy, of course.

Anyway, it's an optimization that would affect existing callers of
ginMergeItemPointers. I don't plan to tweak this in this patch.

6. GinTuple.ItemPointerData first; /* first TID in the array */

is ItemPointerData.ip_blkid good enough for its purpose? If so, we can
save the memory for OffsetNumber for each GinTuple.

Item 5) and 6) needs some coding and testing. If it is OK to do, I'd
like to take it as an exercise in this area. (also including the item
1~4.)

It might save 2 bytes in the struct, but that's negligible compared to
the memory usage overall (we only keep one GinTuple, but many TIDs and
so on), and we allocate the space in power-of-2 pattern anyway (which
means the 2B won't matter).

Moreover, using just the block number would make it harder to compare
the TIDs (now we can just call ItemPointerCompare).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240619-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240619-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From fec86016bdd9c52aae9750b0ff53e6087b6b774b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20240619 01/11] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/ginbulk.c           |    7 +
 src/backend/access/gin/gininsert.c         | 1340 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  154 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   29 +
 src/include/utils/tuplesort.h              |    6 +
 src/tools/pgindent/typedefs.list           |    4 +
 9 files changed, 1534 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index 7f89cd5e826..8dbb5c6b065 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -153,6 +153,13 @@ ginInsertBAEntry(BuildAccumulator *accum,
 	GinEntryAccumulator *ea;
 	bool		isNew;
 
+	/*
+	 * FIXME prevents writes of uninitialized bytes reported by valgrind in
+	 * writetup (likely that build_gin_tuple copies some fields that are only
+	 * initialized for a certain category, or something similar)
+	 */
+	memset(&eatmp, 0, sizeof(GinEntryAccumulator));
+
 	/*
 	 * For the moment, fill only the fields of eatmp that will be looked at by
 	 * cmpEntryAccumulator or ginCombineData.
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..b353e155fc6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,124 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +142,49 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process. (Actually, only the leader process has a
+	 * GinBuildState.)
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +463,95 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * XXX idea - Instead of writing the entries directly into the shared
+	 * tuplesort, write it into a local one, do the sort in the worker, and
+	 * combine the results. For large tables with many different keys that's
+	 * going to work better than the current approach where we don't get many
+	 * matches in work_mem (maybe this should use 32MB, which is what we use
+	 * when planning, but even that may not be great). Which means we are
+	 * likely to have many entries with a single TID, forcing the leader to do
+	 * a qsort() when merging the data, often amounting to ~50% of the serial
+	 * part. By doing the qsort() in a worker, leader then can do a mergesort
+	 * (likely cheaper). Also, it means the amount of data worker->leader is
+	 * going to be lower thanks to deduplication.
+	 *
+	 * Disadvantage: It needs more disk space, possibly up to 2x, because each
+	 * worker creates a tuplestore, then "transforms it" into the shared
+	 * tuplestore (hopefully less data, but not guaranteed).
+	 *
+	 * It's however possible to partition the data into multiple tuplesorts
+	 * per worker (by hashing). We don't need perfect sorting, and we can even
+	 * live with "equal" keys having multiple hashes (if there are multiple
+	 * binary representations of the value).
+	 */
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX probably should use 32MB, not work_mem, as used during planning?
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +569,14 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * XXX Make sure to initialize a bunch of fields, not to trip valgrind.
+	 * Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +617,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. That makes sense
+	 * for btree, but not for GIN, which can do with much less memory. So
+	 * maybe make that somehow less strict, optionally?
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		/* scan the relation and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +841,1006 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * State used to combine accumulate TIDs from multiple GinTuples for the same
+ * key value.
+ *
+ * XXX Similar purpose to BuildAccumulator, but much simpler.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	ItemPointerData *items;
+} GinBuffer;
+
+/* XXX should do more checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+#endif
+}
+
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+static GinBuffer *
+GinBufferInit(void)
+{
+	return palloc0(sizeof(GinBuffer));
+}
+
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data. Compare
+ * scalar fields first, before the actual key.
+ *
+ * XXX The key is compared using memcmp, which means that if a key has
+ * multiple binary representations, we may end up treating them as
+ * different here. But that's OK, the index will merge them anyway.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	if (tup->keylen != buffer->keylen)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * Compare the key value, depending on the type information.
+	 *
+	 * XXX Not sure this works correctly for byval types that don't need the
+	 * whole Datum. What if there is garbage in the padding bytes?
+	 */
+	if (buffer->typbyval)
+		return (buffer->key == *(Datum *) tup->data);
+
+	/* byref values simply uses memcmp for comparison */
+	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
+}
+
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/* XXX not really needed, but easier to trigger NULL deref etc. */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+
+	/* XXX should do something with extremely large array of items? */
+}
+
+/*
+ * XXX Maybe check size of the TID arrays, and return false if it's too
+ * large (more thant maintenance_work_mem or something?).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME probably should have local memory contexts similar to what
+ * _brin_parallel_merge  does.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * XXX Maybe we should sort by key first, then by category?
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* insert the last item */
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * allocate space for the whole GIN tuple
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done simply by "memcmp", based on the
+ * assumption that if we get two keys that are two different representations
+ * of a logically equal value, it'll get merged by the index build.
+ *
+ * FIXME Is the assumption we can just memcmp() actually valid? Won't this
+ * trigger the "could not split GIN page; all old items didn't fit" error
+ * when trying to update the TID list?
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		/*
+		 * works for both byval and byref types with fixed lenght, because for
+		 * byval we set keylen to sizeof(Datum)
+		 */
+		if (a->typbyval)
+		{
+			return memcmp(&keya, &keyb, a->keylen);
+		}
+		else
+		{
+			if (a->keylen < b->keylen)
+				return -1;
+
+			if (a->keylen > b->keylen)
+				return 1;
+
+			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+		}
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..dd22b44aca9 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..c9ea769afb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa36..060fb64164f 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,35 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+
+Tuplesortstate *
+tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	base->nKeys = 1;			/* Only the index key */
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +855,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1058,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1869,68 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..56aed40fb96
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..659d551247a 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,8 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +459,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +469,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cde6..af86c22093e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1015,11 +1015,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1032,9 +1034,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.45.2

v20240619-0002-fixup-pass-progress-flag.patchtext/x-patch; charset=UTF-8; name=v20240619-0002-fixup-pass-progress-flag.patchDownload

From f8132e1c3499958bcbe106b4fe3e98c8ccd23a08 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 13:16:54 +0200
Subject: [PATCH v20240619 02/11] fixup: pass progress flag

---
 src/backend/access/gin/gininsert.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b353e155fc6..025556d7538 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1478,7 +1478,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	scan = table_beginscan_parallel(heap,
 									ParallelTableScanFromGinShared(ginshared));
 
-	reltuples = table_index_build_scan(heap, index, indexInfo, true, true,
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* insert the last item */
-- 
2.45.2

v20240619-0003-fixup-remove-inaccurate-comment.patchtext/x-patch; charset=UTF-8; name=v20240619-0003-fixup-remove-inaccurate-comment.patchDownload

From 25f240930e8c414618303eebba2fa50fd779ee96 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 13:18:17 +0200
Subject: [PATCH v20240619 03/11] fixup: remove inaccurate comment

---
 src/backend/access/gin/gininsert.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 025556d7538..71a0750da51 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -149,8 +149,7 @@ typedef struct
 
 	/*
 	 * bs_leader is only present when a parallel index build is performed, and
-	 * only in the leader process. (Actually, only the leader process has a
-	 * GinBuildState.)
+	 * only in the leader process.
 	 */
 	GinLeader  *bs_leader;
 	int			bs_worker_id;
-- 
2.45.2

v20240619-0004-fixup-clarify-tuplesort_begin_index_gin.patchtext/x-patch; charset=UTF-8; name=v20240619-0004-fixup-clarify-tuplesort_begin_index_gin.patchDownload

From d5d74b732a860f477219f28a18d7a1aece26a182 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 13:23:48 +0200
Subject: [PATCH v20240619 04/11] fixup: clarify tuplesort_begin_index_gin

---
 src/backend/utils/sort/tuplesortvariants.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 060fb64164f..0f0c1d71027 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -606,7 +606,13 @@ tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
 			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
 #endif
 
-	base->nKeys = 1;			/* Only the index key */
+	/*
+	 * Only one sort column, the index key.
+	 *
+	 * Multi-column GIN indexes store the value of each attribute separate
+	 * index entries, so each entry has a single sort key.
+	 */
+	base->nKeys = 1;
 
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
-- 
2.45.2

v20240619-0005-fixup-clarify-GinBuffer-comment.patchtext/x-patch; charset=UTF-8; name=v20240619-0005-fixup-clarify-GinBuffer-comment.patchDownload

From 45a551b5567b01406f26a5ff58d20ce949f79273 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 13:26:42 +0200
Subject: [PATCH v20240619 05/11] fixup: clarify GinBuffer comment

---
 src/backend/access/gin/gininsert.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71a0750da51..7751ab5b0be 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1108,10 +1108,11 @@ tid_cmp(const void *a, const void *b)
 }
 
 /*
- * State used to combine accumulate TIDs from multiple GinTuples for the same
- * key value.
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key.
  *
- * XXX Similar purpose to BuildAccumulator, but much simpler.
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
  */
 typedef struct GinBuffer
 {
-- 
2.45.2

v20240619-0006-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240619-0006-Use-mergesort-in-the-leader-process.patchDownload

From 7e719ff590e27261ff344d7549c021d1450485ff Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:32 +0200
Subject: [PATCH v20240619 06/11] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 171 +++++++++++++++++++++++++----
 1 file changed, 148 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7751ab5b0be..49251f86b6a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -160,6 +160,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate is used only within a worker for the first merge pass
+	 * that happens in the worker. In principle it doesn't need to be part of
+	 * the build state and we could pass it around directly, but it's more
+	 * convenient this way.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -532,7 +540,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1127,7 +1135,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1136,7 +1143,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
 #endif
 }
 
@@ -1240,28 +1246,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* copy the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
 }
@@ -1302,6 +1302,21 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/* XXX probably would be better to have a memory context for the buffer */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * XXX Maybe check size of the TID arrays, and return false if it's too
  * large (more thant maintenance_work_mem or something?).
@@ -1375,7 +1390,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1392,7 +1407,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1402,6 +1417,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1440,6 +1458,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short. But combining many tiny lists is expensive,
+ * so we try to do as much as possible in the workers and only then pass the
+ * results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1471,6 +1585,10 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1508,7 +1626,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1517,6 +1635,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.45.2

v20240619-0007-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240619-0007-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From 3caade89109c4080a5c7673828576f7a1053bb7e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:36 +0200
Subject: [PATCH v20240619 07/11] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 124 +++++++++++++++++++++--------
 src/include/access/gin_tuple.h     |   8 ++
 2 files changed, 98 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 49251f86b6a..735de3459f1 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1109,12 +1109,6 @@ _gin_parallel_heapscan(GinBuildState *state)
 	return state->bs_reltuples;
 }
 
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
 /*
  * Buffer used to accumulate TIDs from multiple GinTuples for the same key.
  *
@@ -1147,17 +1141,21 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 }
 
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1220,6 +1218,45 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferStoreTuple
+ *		Add data from a GinTuple into the GinBuffer.
+ *
+ * If the buffer is empty, we simply initialize it with data from the tuple.
+ * Otherwise data from the tuple (the TID list) is added to the TID data in
+ * the buffer, either by simply appending the TIDs or doing merge sort.
+ *
+ * The data (for the same key) is expected to be processed sorted by first
+ * TID. But this does not guarantee the lists do not overlap, especially in
+ * the leader, because the workers process interleaving data. But even in
+ * a single worker, lists can overlap - parallel scans require sync-scans,
+ * and if the scan starts somewhere in the table and then wraps around, it
+ * may contain very wide lists (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases
+ * where it can simply concatenate the lists, and when full mergesort is
+ * needed. And does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make
+ * it more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After ab
+ * overlap, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
+ * I'm not sure how much we can do to prevent that, short of disabling sync
+ * scans (which for parallel scans is currently not possible). One option
+ * would be to keep two lists of TIDs, and see if the new list can be
+ * concatenated with one of them. The idea is that there's only one wide
+ * list (because the wraparound happens only once), and then do the
+ * mergesort only once at the very end.
+ *
+ * XXX Alternatively, we could simply detect the case when the lists can't
+ * be appended, and flush the current list out. The wrap around happens only
+ * once, so there's going to be only such wide list, and it'll be sorted
+ * first (because it has the lowest TID for the key). So we'd do this at
+ * most once per key.
+ */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 {
@@ -1246,7 +1283,12 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* copy the new TIDs into the buffer, combine using merge-sort */
+	/*
+	 * Copy the new TIDs into the buffer, combine with existing data (if any)
+	 * using merge-sort. The mergesort is already smart about cases where it
+	 * can simply concatenate the two lists, and when it actually needs to
+	 * merge the data in an expensive way.
+	 */
 	{
 		int			nnew;
 		ItemPointer new;
@@ -1261,21 +1303,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
 
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1299,6 +1329,11 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 
+	if (buffer->items)
+	{
+		pfree(buffer->items);
+		buffer->items = NULL;
+	}
 	/* XXX should do something with extremely large array of items? */
 }
 
@@ -1390,7 +1425,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1400,14 +1435,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1510,7 +1548,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1524,7 +1562,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1534,7 +1575,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1835,6 +1876,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -1919,6 +1961,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * assumption that if we get two keys that are two different representations
  * of a logically equal value, it'll get merged by the index build.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * FIXME Is the assumption we can just memcmp() actually valid? Won't this
  * trigger the "could not split GIN page; all old items didn't fit" error
  * when trying to update the TID list?
@@ -1953,19 +2001,27 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 		 */
 		if (a->typbyval)
 		{
-			return memcmp(&keya, &keyb, a->keylen);
+			int			r = memcmp(&keya, &keyb, a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 		else
 		{
+			int			r;
+
 			if (a->keylen < b->keylen)
 				return -1;
 
 			if (a->keylen > b->keylen)
 				return 1;
 
-			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+			r = memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 56aed40fb96..8efa33a8d31 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -12,6 +12,13 @@
 
 #include "storage/itemptr.h"
 
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -20,6 +27,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.45.2

v20240619-0008-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240619-0008-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 9d6c72df5d59fa405968c441a3702def5803b062 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240619 08/11] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 735de3459f1..9b640bfe5f6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -186,7 +186,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1265,7 +1267,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1306,6 +1309,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /* XXX probably would be better to have a memory context for the buffer */
@@ -1806,6 +1812,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1818,6 +1833,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1831,6 +1851,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1854,12 +1879,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1909,37 +1956,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -1952,6 +2002,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -1992,8 +2064,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		/*
 		 * works for both byval and byref types with fixed lenght, because for
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index af86c22093e..621b77febee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1034,6 +1034,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.45.2

v20240619-0009-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240619-0009-Collect-and-print-compression-stats.patchDownload

From 1b6cb18b11f63da1acc53b3c85b0b8e38de979d1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240619 09/11] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 36 +++++++++++++++++++++++++-----
 src/include/access/gin.h           |  2 ++
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9b640bfe5f6..47007aa63b4 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -189,7 +189,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -538,7 +539,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1530,6 +1531,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1556,7 +1566,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1583,7 +1593,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1598,6 +1608,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1669,7 +1684,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1763,6 +1778,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1840,7 +1856,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -1971,6 +1988,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240619-0010-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240619-0010-Enforce-memory-limit-when-combining-tuples.patchDownload

From e08b66cc678ee013b7c5bb1acf26c5287a71f909 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:49 +0200
Subject: [PATCH v20240619 10/11] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 245 +++++++++++++++++++++++++++--
 src/include/access/gin.h           |   1 +
 2 files changed, 237 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 47007aa63b4..8d16e484093 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1130,8 +1130,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1166,7 +1170,21 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 static GinBuffer *
 GinBufferInit(void)
 {
-	return palloc0(sizeof(GinBuffer));
+	GinBuffer  *buffer = (GinBuffer *) palloc0(sizeof(GinBuffer));
+
+	/*
+	 * How many items can we fit into the memory limit? 64kB seems more than
+	 * enough and we don't want a limit that's too high. OTOH maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound,
+	 * but it should be enough to make the merges cheap because it quickly
+	 * finds reaches the end of the second list and can just memcpy the rest
+	 * without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
+	return buffer;
 }
 
 static bool
@@ -1221,6 +1239,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wrap around case, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data from a GinTuple into the GinBuffer.
@@ -1259,6 +1325,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * once, so there's going to be only such wide list, and it'll be sorted
  * first (because it has the lowest TID for the key). So we'd do this at
  * most once per key.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1287,26 +1358,82 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/*
 	 * Copy the new TIDs into the buffer, combine with existing data (if any)
 	 * using merge-sort. The mergesort is already smart about cases where it
 	 * can simply concatenate the two lists, and when it actually needs to
 	 * merge the data in an expensive way.
+	 *
+	 * XXX We could check if (buffer->nitems > buffer->nfrozen) and only do
+	 * the mergesort in that case. ginMergeItemPointers does some palloc
+	 * internally, and this way we could eliminate that. But let's keep the
+	 * code simple for now.
 	 */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1332,6 +1459,7 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
@@ -1344,6 +1472,23 @@ GinBufferReset(GinBuffer *buffer)
 	/* XXX should do something with extremely large array of items? */
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /* XXX probably would be better to have a memory context for the buffer */
 static void
 GinBufferFree(GinBuffer *buffer)
@@ -1402,7 +1547,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit();
 
 	/*
@@ -1442,6 +1592,36 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1465,6 +1645,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1525,7 +1707,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit();
 
 	/* sort the raw per-worker data */
@@ -1578,6 +1766,43 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 *
+		 * XXX The buffer may also be empty, but in that case we skip this.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1613,6 +1838,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240619-0011-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240619-0011-Detect-wrap-around-in-parallel-callback.patchDownload

From 123ef5a67b3548b7226c0c6bcc8d3f758d96ee82 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:55 +0200
Subject: [PATCH v20240619 11/11] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 89 ++++++++++++++++++------------
 1 file changed, 55 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8d16e484093..10d65abbf47 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -142,6 +142,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -473,6 +474,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * FIXME Another way to deal with the wrap around of sync scans would be to
+ * detect when tid wraps around and just flush the state.
+ */
 static void
 ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 						 bool *isnull, bool tupleIsAlive, void *state)
@@ -483,6 +525,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* flush contents before wrapping around */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -517,40 +569,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * XXX probably should use 32MB, not work_mem, as used during planning?
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -586,6 +605,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -2007,6 +2027,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX shouldn't this initialize the other fiedls, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.45.2

#18

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#17)

7 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Here's a cleaned up patch series, merging the fixup patches into 0001.

I've also removed the memset() from ginInsertBAEntry(). This was meant
to fix valgrind reports, but I believe this was just a symptom of
incorrect handling of byref data types, which was fixed in 2024/05/02
patch version.

The other thing I did is cleanup of FIXME and XXX comments. There were a
couple stale/obsolete comments, discussing issues that have been already
fixed (like the scan wrapping around).

A couple things to fix remain, but all of them are minor. And there's
also a couple XXX comments, often describing thing that is then done in
one of the following patches.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240620-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240620-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 60d0ed63c06b4f16826c805f8811fd8cbd42541d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20240620 1/7] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1357 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  158 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   29 +
 src/include/utils/tuplesort.h              |    6 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1548 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..cdadd389185 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,124 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +142,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +462,98 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaprr. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it's better to
+	 * keep using work_mem here.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +571,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +620,92 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
+	 */
+	if (state->bs_leader)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
+	{
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +845,1019 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key.
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	ItemPointerData *items;
+} GinBuffer;
+
+/* XXX should do more checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+#endif
+}
+
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+static GinBuffer *
+GinBufferInit(void)
+{
+	return palloc0(sizeof(GinBuffer));
+}
+
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data. Compare
+ * scalar fields first, before the actual key.
+ *
+ * XXX The key is compared using memcmp, which means that if a key has
+ * multiple binary representations, we may end up treating them as
+ * different here. But that's OK, the index will merge them anyway.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	if (tup->keylen != buffer->keylen)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * Compare the key value, depending on the type information.
+	 *
+	 * XXX Does this work correctly for byval types that don't need the whole
+	 * Datum value. What if there is garbage in the padding bytes?
+	 */
+	if (buffer->typbyval)
+		return (buffer->key == *(Datum *) tup->data);
+
+	/* byref values simply uses memcmp for comparison */
+	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
+}
+
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/* XXX Might be better to have a separate memory context for the buffer. */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+
+	/*
+	 * XXX Should we do something if the array of TIDs gets too large? It may
+	 * grow too much, and we'll not free it until the worker finishes
+	 * building. But it's better to not let the array grow arbitrarily large,
+	 * and enforce work_mem as memory limit by flushing the buffer into the
+	 * tuplestore.
+	 */
+}
+
+/*
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* insert the last item */
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done simply by "memcmp", based on the
+ * assumption that if we get two keys that are two different representations
+ * of a logically equal value, it'll get merged by the index build.
+ *
+ * FIXME Is the assumption we can just memcmp() actually valid? Won't this
+ * trigger the "could not split GIN page; all old items didn't fit" error
+ * when trying to update the TID list?
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		/*
+		 * works for both byval and byref types with fixed lenght, because for
+		 * byval we set keylen to sizeof(Datum)
+		 */
+		if (a->typbyval)
+		{
+			return memcmp(&keya, &keyb, a->keylen);
+		}
+		else
+		{
+			if (a->keylen < b->keylen)
+				return -1;
+
+			if (a->keylen > b->keylen)
+				return 1;
+
+			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+		}
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..dd22b44aca9 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..c9ea769afb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa36..3d5b5ce0155 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,41 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+
+Tuplesortstate *
+tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Only one sort column, the index key.
+	 *
+	 * Multi-column GIN indexes store the value of each attribute separate
+	 * index entries, so each entry has a single sort key.
+	 */
+	base->nKeys = 1;
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +861,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1064,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1875,66 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..2cd5e716a9a
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..659d551247a 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,8 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +459,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +469,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cde6..af86c22093e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1015,11 +1015,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1032,9 +1034,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.45.2

v20240620-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240620-0002-Use-mergesort-in-the-leader-process.patchDownload

From 8408fa11a4ac56709815e5dfbde6d25d8ac4257d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:32 +0200
Subject: [PATCH v20240620 2/7] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 191 ++++++++++++++++++++++++-----
 1 file changed, 159 insertions(+), 32 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cdadd389185..8d46b53c43b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -160,6 +160,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate is used only within a worker for the first merge pass
+	 * that happens in the worker. In principle it doesn't need to be part of
+	 * the build state and we could pass it around directly, but it's more
+	 * convenient this way.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -463,23 +471,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 }
 
 /*
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaprr. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -535,7 +543,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1132,7 +1140,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1141,7 +1148,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
 #endif
 }
 
@@ -1245,28 +1251,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* copy the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
 }
@@ -1316,6 +1316,23 @@ GinBufferReset(GinBuffer *buffer)
 	 */
 }
 
+/*
+ * Release all memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * XXX This could / should also enforce a memory limit by checking the size of
  * the TID array, and returning false if it's too large (more thant work_mem,
@@ -1392,7 +1409,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1409,7 +1426,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1419,6 +1436,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1457,6 +1477,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short. But combining many tiny lists is expensive,
+ * so we try to do as much as possible in the workers and only then pass the
+ * results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit();
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1488,6 +1604,10 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1525,7 +1645,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1534,6 +1654,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.45.2

v20240620-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240620-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From 976349a5abfe23d672650a611c3377638db06b46 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 20 Jun 2024 22:26:26 +0200
Subject: [PATCH v20240620 3/7] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 119 +++++++++++++++++++----------
 src/include/access/gin_tuple.h     |   8 ++
 2 files changed, 86 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8d46b53c43b..77e5efc58dd 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1114,12 +1114,6 @@ _gin_parallel_heapscan(GinBuildState *state)
 	return state->bs_reltuples;
 }
 
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
 /*
  * Buffer used to accumulate TIDs from multiple GinTuples for the same key.
  *
@@ -1152,17 +1146,21 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 }
 
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1225,6 +1223,33 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferStoreTuple
+ *		Add data from a GinTuple into the GinBuffer.
+ *
+ * If the buffer is empty, we simply initialize it with data from the tuple.
+ * Otherwise data from the tuple (the TID list) is added to the TID data in
+ * the buffer, either by simply appending the TIDs or doing merge sort.
+ *
+ * The data (for the same key) is expected to be processed sorted by first
+ * TID. But this does not guarantee the lists do not overlap, especially in
+ * the leader, because the workers process interleaving data. But even in
+ * a single worker, lists can overlap - parallel scans require sync-scans,
+ * and if the scan starts somewhere in the table and then wraps around, it
+ * may contain very wide lists (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases
+ * where it can simply concatenate the lists, and when full mergesort is
+ * needed. And does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make
+ * it more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
+ */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 {
@@ -1251,7 +1276,12 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* copy the new TIDs into the buffer, combine using merge-sort */
+	/*
+	 * Copy the new TIDs into the buffer, combine with existing data (if any)
+	 * using merge-sort. The mergesort is already smart about cases where it
+	 * can simply concatenate the two lists, and when it actually needs to
+	 * merge the data in an expensive way.
+	 */
 	{
 		int			nnew;
 		ItemPointer new;
@@ -1266,21 +1296,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
 
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /* XXX Might be better to have a separate memory context for the buffer. */
@@ -1307,13 +1325,11 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 
-	/*
-	 * XXX Should we do something if the array of TIDs gets too large? It may
-	 * grow too much, and we'll not free it until the worker finishes
-	 * building. But it's better to not let the array grow arbitrarily large,
-	 * and enforce work_mem as memory limit by flushing the buffer into the
-	 * tuplestore.
-	 */
+	if (buffer->items)
+	{
+		pfree(buffer->items);
+		buffer->items = NULL;
+	}
 }
 
 /*
@@ -1409,7 +1425,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1419,14 +1435,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1529,7 +1548,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1543,7 +1562,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1553,7 +1575,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1854,6 +1876,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -1938,6 +1961,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * assumption that if we get two keys that are two different representations
  * of a logically equal value, it'll get merged by the index build.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * FIXME Is the assumption we can just memcmp() actually valid? Won't this
  * trigger the "could not split GIN page; all old items didn't fit" error
  * when trying to update the TID list?
@@ -1972,19 +2001,27 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 		 */
 		if (a->typbyval)
 		{
-			return memcmp(&keya, &keyb, a->keylen);
+			int			r = memcmp(&keya, &keyb, a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 		else
 		{
+			int			r;
+
 			if (a->keylen < b->keylen)
 				return -1;
 
 			if (a->keylen > b->keylen)
 				return 1;
 
-			return memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+			r = memcmp(DatumGetPointer(keya), DatumGetPointer(keyb), a->keylen);
+
+			/* if the key is the same, consider the first TID in the array */
+			return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 		}
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 2cd5e716a9a..73280066f2c 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -12,6 +12,13 @@
 
 #include "storage/itemptr.h"
 
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -20,6 +27,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.45.2

v20240620-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240620-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 391633a7f115bf83250866cec913fbd5b55559fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240620 4/7] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 77e5efc58dd..bbc3a5d4d38 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -186,7 +186,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1258,7 +1260,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1299,6 +1302,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /* XXX Might be better to have a separate memory context for the buffer. */
@@ -1806,6 +1812,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1818,6 +1833,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1831,6 +1851,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1854,12 +1879,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1909,37 +1956,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -1952,6 +2002,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -1992,8 +2064,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		/*
 		 * works for both byval and byref types with fixed lenght, because for
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index af86c22093e..621b77febee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1034,6 +1034,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.45.2

v20240620-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240620-0005-Collect-and-print-compression-stats.patchDownload

From 9bbe94b8bf27f4e0fa726d86cff6902414470494 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240620 5/7] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 36 +++++++++++++++++++++++++-----
 src/include/access/gin.h           |  2 ++
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bbc3a5d4d38..e3c177c7c31 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -189,7 +189,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -541,7 +542,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1530,6 +1531,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1556,7 +1566,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1583,7 +1593,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1598,6 +1608,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1669,7 +1684,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1763,6 +1778,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1840,7 +1856,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -1971,6 +1988,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240620-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240620-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From 97bd4b96a96189102ee2973f927d07e69e5792a3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Thu, 20 Jun 2024 22:36:23 +0200
Subject: [PATCH v20240620 6/7] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 243 +++++++++++++++++++++++++++--
 src/include/access/gin.h           |   1 +
 2 files changed, 235 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e3c177c7c31..b6a40dd7ddc 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1135,8 +1135,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	ItemPointerData *items;
 } GinBuffer;
 
@@ -1171,7 +1175,21 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 static GinBuffer *
 GinBufferInit(void)
 {
-	return palloc0(sizeof(GinBuffer));
+	GinBuffer  *buffer = (GinBuffer *) palloc0(sizeof(GinBuffer));
+
+	/*
+	 * How many items can we fit into the memory limit? 64kB seems more than
+	 * enough and we don't want a limit that's too high. OTOH maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound,
+	 * but it should be enough to make the merges cheap because it quickly
+	 * reaches the end of the second list and can just memcpy the rest without
+	 * walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
+	return buffer;
 }
 
 static bool
@@ -1226,6 +1244,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (memcmp(tup->data, DatumGetPointer(buffer->key), buffer->keylen) == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data from a GinTuple into the GinBuffer.
@@ -1252,6 +1318,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * there should be no overlapping lists, and thus no mergesort. After a
  * wraparound, there probably can be many - the one list will be very wide,
  * with a very low and high TID, and all other lists will overlap with it.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1280,26 +1351,82 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/*
 	 * Copy the new TIDs into the buffer, combine with existing data (if any)
 	 * using merge-sort. The mergesort is already smart about cases where it
 	 * can simply concatenate the two lists, and when it actually needs to
 	 * merge the data in an expensive way.
+	 *
+	 * XXX We could check if (buffer->nitems > buffer->nfrozen) and only do
+	 * the mergesort in that case. ginMergeItemPointers does some palloc
+	 * internally, and this way we could eliminate that. But let's keep the
+	 * code simple for now.
 	 */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1328,6 +1455,7 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
@@ -1339,8 +1467,27 @@ GinBufferReset(GinBuffer *buffer)
 	}
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * Release all memory associated with the GinBuffer (including TID array).
+ *
+ * XXX Might be easier if they had a memory context for the buffer.
  */
 static void
 GinBufferFree(GinBuffer *buffer)
@@ -1400,7 +1547,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit();
 
 	/*
@@ -1442,6 +1594,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1465,6 +1645,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1525,7 +1707,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit();
 
 	/* sort the raw per-worker data */
@@ -1578,6 +1766,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1613,6 +1836,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240620-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240620-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From d3635b15c1acadeec0d2d74ee18fc295b6c1a912 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20240620 7/7] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 96 ++++++++++++++++++------------
 1 file changed, 57 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b6a40dd7ddc..cc0e63ff7f8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -142,6 +142,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -497,6 +498,49 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * distinct hash values).
  */
 static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ */
+static void
 ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 						 bool *isnull, bool tupleIsAlive, void *state)
 {
@@ -506,6 +550,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -520,40 +574,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * keep using work_mem here.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -590,6 +611,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1314,11 +1336,6 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * By keeping the first TID in the GinTuple and sorting by that, we make
  * it more likely the lists won't overlap very often.
  *
- * XXX How frequent can the overlaps be? If the scan does not wrap around,
- * there should be no overlapping lists, and thus no mergesort. After a
- * wraparound, there probably can be many - the one list will be very wide,
- * with a very low and high TID, and all other lists will overlap with it.
- *
  * XXX Maybe we could/should allocate the buffer once and then keep it
  * without palloc/pfree. That won't help when just calling the mergesort,
  * as that does palloc internally, but if we detected the append case,
@@ -2005,6 +2022,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.45.2

#19

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Tomas Vondra (#18)

7 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Here's a bit more cleaned up version, clarifying a lot of comments,
removing a bunch of obsolete comments, or comments speculating about
possible solutions, that sort of thing. I've also removed couple more
XXX comments, etc.

The main change however is that the sorting no longer relies on memcmp()
to compare the values. I did that because it was enough for the initial
WIP patches, and it worked till now - but the comments explained this
may not be a good idea if the data type allows the same value to have
multiple binary representations, or something like that.

I don't have a practical example to show an issue, but I guess if using
memcmp() was safe we'd be doing it in a bunch of places already, and
AFAIK we're not. And even if it happened to be OK, this is a probably
not the place where to start doing it.

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v20240624-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240624-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 1c1a40a806acc46d0683783b1a77ddf1e5682309 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20240624 1/7] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1449 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  199 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1685 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..d8767b0fe81 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,125 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +143,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +463,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +583,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +632,93 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +858,1098 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4ca..dd22b44aca9 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb54..c9ea769afb5 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa36..ed6084960b8 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +899,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1102,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1913,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..6f529a5aaf0
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..0ed71ae922a 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 61ad417cde6..af86c22093e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1015,11 +1015,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1032,9 +1034,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.45.2

v20240624-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240624-0002-Use-mergesort-in-the-leader-process.patchDownload

From 30830df85273b4c647aba06de6f27f415cf7ce97 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20240624 2/7] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d8767b0fe81..1fa40e3ff72 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -161,6 +161,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -471,23 +479,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -547,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1145,7 +1153,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1175,8 +1182,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1294,11 +1299,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1326,28 +1327,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1411,6 +1406,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1492,7 +1505,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1509,7 +1522,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1519,6 +1532,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1557,6 +1573,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1589,6 +1701,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1625,7 +1742,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1634,6 +1751,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.45.2

v20240624-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240624-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From ec1798ee481d4a3ccc466eb5e5c14a5a80fd87fd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20240624 3/7] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 107 +++++++++++++++++------------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 1fa40e3ff72..df33e5947d8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1160,19 +1160,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1189,8 +1197,10 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
 	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX actually with the mergesort in GinBufferStoreTuple, we
+	 * should not need 'false' here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1295,8 +1305,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1342,33 +1370,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
 
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1505,7 +1509,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1515,14 +1519,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1625,7 +1632,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1639,7 +1646,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1649,7 +1659,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1954,6 +1964,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2037,6 +2048,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2049,6 +2066,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2070,10 +2088,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf0..55dd8544b21 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.45.2

v20240624-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240624-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 338062afaea504377ccf414cb82b3f0dbe87c997 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240624 4/7] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index df33e5947d8..5b75e04d951 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -187,7 +187,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1337,7 +1339,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1373,6 +1376,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1891,6 +1897,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1903,6 +1918,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1916,6 +1936,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1942,12 +1967,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1997,37 +2044,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2040,6 +2090,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2085,8 +2157,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index af86c22093e..621b77febee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1034,6 +1034,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.45.2

v20240624-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240624-0005-Collect-and-print-compression-stats.patchDownload

From a712705ca2cb2f37dc776a9bb44a98130fd0f1b7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240624 5/7] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 42 +++++++++++++++++++++++-------
 src/include/access/gin.h           |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 5b75e04d951..bb993dfdf80 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -553,7 +554,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1198,9 +1199,9 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
-	 * XXX actually with the mergesort in GinBufferStoreTuple, we
-	 * should not need 'false' here. See AssertCheckItemPointers.
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
 	AssertCheckItemPointers(buffer, false);
 #endif
@@ -1614,6 +1615,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1640,7 +1650,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1667,7 +1677,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1682,6 +1692,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1754,7 +1769,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1848,6 +1863,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1925,7 +1941,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2059,6 +2076,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240624-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240624-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From 2633c013c4921a38c0ff16dc9119f4212ceb2c80 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20240624 6/7] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bb993dfdf80..cc380f03593 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1154,8 +1154,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1222,6 +1226,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound
+	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1303,6 +1319,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1331,6 +1395,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1359,21 +1428,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1412,11 +1532,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1484,7 +1622,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1526,6 +1669,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1549,6 +1720,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1609,7 +1782,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1662,6 +1841,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1697,6 +1911,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.45.2

v20240624-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240624-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From 1a8851891f8c3e7e760aa3a6f21ff2e5467f5f59 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20240624 7/7] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 132 ++++++++++++++---------------
 1 file changed, 63 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cc380f03593..4483eedcbe2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -143,6 +143,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -474,6 +475,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -498,6 +540,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -514,6 +561,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -532,40 +589,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -602,6 +626,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1231,8 +1256,8 @@ GinBufferInit(Relation index)
 	 * with too many TIDs. and 64kB seems more than enough. But maybe this
 	 * should be tied to maintenance_work_mem or something like that?
 	 *
-	 * XXX This is not enough to prevent repeated merges after a wraparound
-	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
 	 * because it quickly reaches the end of the second list and can just
 	 * memcpy the rest without walking it item by item.
 	 */
@@ -1964,39 +1989,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2081,6 +2074,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.45.2

#20

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#19)

3 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas,

I am in a incompleted review process but I probably doesn't have much
time on this because of my internal tasks. So I just shared what I
did and the non-good-result patch.

What I tried to do is:
1) remove all the "sort" effort for the state->bs_sort_state tuples since
its input comes from state->bs_worker_state which is sorted already.

2). remove *partial* "sort" operations between accum.rbtree to
state->bs_worker_state because the tuple in accum.rbtree is sorted
already.

Both of them can depend on the same API changes.

1. 
struct Tuplesortstate
{
..
+ bool input_presorted; /*  the tuples are presorted. */
+ new_tapes;  // writes the tuples in memory into a new 'run'. 
}

and user can set it during tuplesort_begin_xx(, presorte=true);

2. in tuplesort_puttuple, if memory is full but presorted is
true, we can (a) avoid the sort. (b) resuse the existing 'runs'
to reduce the effort of 'mergeruns' unless new_tapes is set to
true. once it switch to a new tapes, the set state->new_tapes to false
and wait 3) to change it to true again.

3. tuplesort_dumptuples(..); // dump the tuples in memory and set
new_tapes=true to tell the *this batch of input is presorted but they
are done, the next batch is just presort in its own batch*.

In the gin-parallel-build case, for the case 1), we can just use

for tuple in bs_worker_sort:
tuplesort_putgintuple(state->bs_sortstate, ..);
tuplesort_dumptuples(..);

At last we can get a). only 1 run in the worker so that the leader can
have merge less runs in mergeruns. b). reduce the sort both in
perform_sort_tuplesort and in sortstate_puttuple_common.

for the case 2). we can have:

for tuple in RBTree.tuples:
tuplesort_puttuples(tuple) ;
// this may cause a dumptuples internally when the memory is full,
// but it is OK.
tuplesort_dumptuples(..);

we can just remove the "sort" into sortstate_puttuple_common but
probably increase the 'runs' in sortstate which will increase the effort
of mergeruns at last.

But the test result is not good, maybe the 'sort' is not a key factor of
this. I do missed the perf step before doing this. or maybe my test data
is too small.

Here is the patch I used for the above activity. and I used the
following sql to test.

CREATE TABLE t(a int[], b numeric[]);

-- generate 1000 * 1000 rows.
insert into t select i, n
from normal_rand_array(1000, 90, 1::int4, 10000::int4) i,
normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n;

alter table t set (parallel_workers=4);
set debug_parallel_query to on;
set max_parallel_maintenance_workers to 4;

create index on t using gin(a);
create index on t using gin(b);

I found normal_rand_array is handy to use in this case and I
register it into https://commitfest.postgresql.org/48/5061/.

Besides the above stuff, I didn't find anything wrong in the currrent
patch, and the above stuff can be categoried into "furture improvement"
even it is worthy to.

--
Best Regards
Andy Fan

Attachments:

v20240702-0001-Add-function-normal_rand_array-function-to.patchtext/x-diffDownload

From 48c2e03fd854c8f88f781adc944f37b004db0721 Mon Sep 17 00:00:00 2001
From: Andy Fan <zhihuifan1213@163.com>
Date: Sat, 8 Jun 2024 13:21:08 +0800
Subject: [PATCH v20240702 1/3] Add function normal_rand_array function to
 contrib/tablefunc.

It can produce an array of numbers with n controllable array length and
duplicated elements in these arrays.
---
 contrib/tablefunc/Makefile                |   2 +-
 contrib/tablefunc/expected/tablefunc.out  |  26 ++++
 contrib/tablefunc/sql/tablefunc.sql       |  10 ++
 contrib/tablefunc/tablefunc--1.0--1.1.sql |   7 ++
 contrib/tablefunc/tablefunc.c             | 140 ++++++++++++++++++++++
 contrib/tablefunc/tablefunc.control       |   2 +-
 doc/src/sgml/tablefunc.sgml               |  10 ++
 src/backend/utils/adt/arrayfuncs.c        |   7 ++
 8 files changed, 202 insertions(+), 2 deletions(-)
 create mode 100644 contrib/tablefunc/tablefunc--1.0--1.1.sql

diff --git a/contrib/tablefunc/Makefile b/contrib/tablefunc/Makefile
index 191a3a1d38..f0c67308fd 100644
--- a/contrib/tablefunc/Makefile
+++ b/contrib/tablefunc/Makefile
@@ -3,7 +3,7 @@
 MODULES = tablefunc
 
 EXTENSION = tablefunc
-DATA = tablefunc--1.0.sql
+DATA = tablefunc--1.0.sql tablefunc--1.0--1.1.sql
 PGFILEDESC = "tablefunc - various functions that return tables"
 
 REGRESS = tablefunc
diff --git a/contrib/tablefunc/expected/tablefunc.out b/contrib/tablefunc/expected/tablefunc.out
index ddece79029..9f0cbbfbbe 100644
--- a/contrib/tablefunc/expected/tablefunc.out
+++ b/contrib/tablefunc/expected/tablefunc.out
@@ -12,6 +12,32 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 ERROR:  number of rows cannot be negative
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+ count |        avg         
+-------+--------------------
+    10 | 3.0000000000000000
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i;
+ count |        avg         
+-------+--------------------
+    10 | 3.0000000000000000
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i;
+ count |        avg         
+-------+--------------------
+    10 | 3.0000000000000000
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i;
+ count |        avg         
+-------+--------------------
+    10 | 3.0000000000000000
+(1 row)
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i;
+ERROR:  unsupported type 25 in normal_rand_array.
 --
 -- crosstab()
 --
diff --git a/contrib/tablefunc/sql/tablefunc.sql b/contrib/tablefunc/sql/tablefunc.sql
index 0fb8e40de2..dec57cfc66 100644
--- a/contrib/tablefunc/sql/tablefunc.sql
+++ b/contrib/tablefunc/sql/tablefunc.sql
@@ -8,6 +8,16 @@ SELECT avg(normal_rand)::int, count(*) FROM normal_rand(100, 250, 0.2);
 -- negative number of tuples
 SELECT avg(normal_rand)::int, count(*) FROM normal_rand(-1, 250, 0.2);
 
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::numeric, 8::numeric) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int4, 8::int4) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::int8, 8::int8) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 1.23::float8, 8::float8) as i;
+
+SELECT count(*), avg(COALESCE(array_length(i, 1), 0)) FROM normal_rand_array(10, 3, 'abc'::text, 'def'::text) as i;
+
 --
 -- crosstab()
 --
diff --git a/contrib/tablefunc/tablefunc--1.0--1.1.sql b/contrib/tablefunc/tablefunc--1.0--1.1.sql
new file mode 100644
index 0000000000..9d13e80ff0
--- /dev/null
+++ b/contrib/tablefunc/tablefunc--1.0--1.1.sql
@@ -0,0 +1,7 @@
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION tablefunc UPDATE TO '1.1'" to load this file. \quit
+
+CREATE FUNCTION normal_rand_array(int4, int4, anyelement, anyelement)
+RETURNS setof anyarray
+AS 'MODULE_PATHNAME','normal_rand_array'
+LANGUAGE C VOLATILE STRICT;
diff --git a/contrib/tablefunc/tablefunc.c b/contrib/tablefunc/tablefunc.c
index 7d1b5f5143..6d26aa843b 100644
--- a/contrib/tablefunc/tablefunc.c
+++ b/contrib/tablefunc/tablefunc.c
@@ -42,7 +42,9 @@
 #include "lib/stringinfo.h"
 #include "miscadmin.h"
 #include "tablefunc.h"
+#include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 
 PG_MODULE_MAGIC;
 
@@ -91,6 +93,13 @@ typedef struct
 	bool		use_carry;		/* use second generated value */
 } normal_rand_fctx;
 
+typedef struct
+{
+	int		carry_len;
+	FunctionCallInfo fcinfo;
+	FunctionCallInfo random_len_fcinfo;
+} normal_rand_array_fctx;
+
 #define xpfree(var_) \
 	do { \
 		if (var_ != NULL) \
@@ -269,6 +278,137 @@ normal_rand(PG_FUNCTION_ARGS)
 		SRF_RETURN_DONE(funcctx);
 }
 
+/*
+ * normal_rand_array - return requested number of random arrays
+ * with a Gaussian (Normal) distribution.
+ *
+ * inputs are int numvals, int mean_len, anyelement minvalue,
+ * anyelement maxvalue returns setof anyelement[]
+ */
+PG_FUNCTION_INFO_V1(normal_rand_array);
+Datum
+normal_rand_array(PG_FUNCTION_ARGS)
+{
+	FuncCallContext *funcctx;
+	uint64		call_cntr;
+	uint64		max_calls;
+	normal_rand_array_fctx *fctx;
+	MemoryContext oldcontext;
+	Datum	minvalue, maxvalue;
+	int array_mean_len;
+	Oid target_oid, random_fn_oid;
+
+	array_mean_len = PG_GETARG_INT32(1);
+	minvalue = PG_GETARG_DATUM(2);
+	maxvalue = PG_GETARG_DATUM(3);
+
+	target_oid = get_fn_expr_argtype(fcinfo->flinfo, 2);
+
+	if (target_oid == INT4OID)
+		random_fn_oid = F_RANDOM_INT4_INT4;
+	else if (target_oid == INT8OID)
+		random_fn_oid = F_RANDOM_INT8_INT8;
+	else if	(target_oid == FLOAT8OID)
+		random_fn_oid = F_RANDOM_;
+	else if (target_oid == NUMERICOID)
+		random_fn_oid = F_RANDOM_NUMERIC_NUMERIC;
+	else
+		elog(ERROR, "unsupported type %d in normal_rand_array.",
+			 target_oid);
+
+	/* stuff done only on the first call of the function */
+	if (SRF_IS_FIRSTCALL())
+	{
+		int32		num_tuples;
+		FmgrInfo	*random_len_flinfo, *random_val_flinfo;
+		FunctionCallInfo random_len_fcinfo, random_val_fcinfo;
+
+		/* create a function context for cross-call persistence */
+		funcctx = SRF_FIRSTCALL_INIT();
+
+		/*
+		 * switch to memory context appropriate for multiple function calls
+		 */
+		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+
+		/* total number of tuples to be returned */
+		num_tuples = PG_GETARG_INT32(0);
+		if (num_tuples < 0)
+			ereport(ERROR,
+					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+					 errmsg("number of rows cannot be negative")));
+		funcctx->max_calls = num_tuples;
+
+		/* allocate memory for user context */
+		fctx = (normal_rand_array_fctx *) palloc(sizeof(normal_rand_array_fctx));
+
+		random_len_fcinfo = (FunctionCallInfo) palloc0(SizeForFunctionCallInfo(2));
+		random_len_flinfo = (FmgrInfo *) palloc0(sizeof(FmgrInfo));
+		fmgr_info(F_RANDOM_INT4_INT4, random_len_flinfo);
+		InitFunctionCallInfoData(*random_len_fcinfo, random_len_flinfo, 2, InvalidOid, NULL, NULL);
+
+		random_len_fcinfo->args[0].isnull = false;
+		random_len_fcinfo->args[1].isnull = false;
+		random_len_fcinfo->args[0].value = 0;
+		random_len_fcinfo->args[1].value = array_mean_len;
+
+		random_val_fcinfo = (FunctionCallInfo) palloc0(SizeForFunctionCallInfo(2));
+		random_val_flinfo = (FmgrInfo *) palloc0(sizeof(FmgrInfo));
+		fmgr_info(random_fn_oid, random_val_flinfo);
+		InitFunctionCallInfoData(*random_val_fcinfo, random_val_flinfo, 2, InvalidOid, NULL, NULL);
+
+		random_val_fcinfo->args[0].isnull = false;
+		random_val_fcinfo->args[1].isnull = false;
+		random_val_fcinfo->args[0].value = minvalue;
+		random_val_fcinfo->args[1].value = maxvalue;
+
+		fctx->carry_len = -1;
+		fctx->fcinfo = random_val_fcinfo;
+		fctx->random_len_fcinfo = random_len_fcinfo;
+
+		funcctx->user_fctx = fctx;
+
+		MemoryContextSwitchTo(oldcontext);
+	}
+
+	/* stuff done on every call of the function */
+	funcctx = SRF_PERCALL_SETUP();
+
+	call_cntr = funcctx->call_cntr;
+	max_calls = funcctx->max_calls;
+	fctx = funcctx->user_fctx;
+
+	if (call_cntr < max_calls)	/* do when there is more left to send */
+	{
+		int		array_len;
+		int		i;
+		Datum	*results;
+
+		if (fctx->carry_len != -1)
+		{
+			array_len = fctx->carry_len;
+			fctx->carry_len = -1;
+		}
+		else
+		{
+			array_len = Int32GetDatum(FunctionCallInvoke(fctx->random_len_fcinfo));
+			fctx->carry_len = 2 * array_mean_len - array_len;
+		}
+
+		results = palloc(array_len * sizeof(Datum));
+
+		for(i = 0; i < array_len; i++)
+			results[i] = FunctionCallInvoke(fctx->fcinfo);
+
+
+		SRF_RETURN_NEXT(funcctx, PointerGetDatum(
+							construct_array_builtin(results, array_len, target_oid)));
+	}
+	else
+		/* do when there is no more left */
+		SRF_RETURN_DONE(funcctx);
+}
+
 /*
  * get_normal_pair()
  * Assigns normally distributed (Gaussian) values to a pair of provided
diff --git a/contrib/tablefunc/tablefunc.control b/contrib/tablefunc/tablefunc.control
index 7b25d16170..9cc6222a4f 100644
--- a/contrib/tablefunc/tablefunc.control
+++ b/contrib/tablefunc/tablefunc.control
@@ -1,6 +1,6 @@
 # tablefunc extension
 comment = 'functions that manipulate whole tables, including crosstab'
-default_version = '1.0'
+default_version = '1.1'
 module_pathname = '$libdir/tablefunc'
 relocatable = true
 trusted = true
diff --git a/doc/src/sgml/tablefunc.sgml b/doc/src/sgml/tablefunc.sgml
index e10fe7009d..014c36b81c 100644
--- a/doc/src/sgml/tablefunc.sgml
+++ b/doc/src/sgml/tablefunc.sgml
@@ -53,6 +53,16 @@
        </para></entry>
       </row>
 
+      <row>
+       <entry role="func_table_entry"><para role="func_signature">
+        <function>normal_rand_array</function> ( <parameter>numvals</parameter> <type>integer</type>, <parameter>meanarraylen</parameter> <type>int4</type>, <parameter>minval</parameter> <type>anyelement</type>, <parameter>maxval</parameter> <type>anyelement</type> )
+        <returnvalue>setof anyarray</returnvalue>
+       </para>
+       <para>
+        Produces a set of normally distributed random array of numbers.
+       </para></entry>
+      </row>
+
       <row>
        <entry role="func_table_entry"><para role="func_signature">
         <function>crosstab</function> ( <parameter>sql</parameter> <type>text</type> )
diff --git a/src/backend/utils/adt/arrayfuncs.c b/src/backend/utils/adt/arrayfuncs.c
index d6641b570d..7c95cc05bc 100644
--- a/src/backend/utils/adt/arrayfuncs.c
+++ b/src/backend/utils/adt/arrayfuncs.c
@@ -3397,6 +3397,12 @@ construct_array_builtin(Datum *elems, int nelems, Oid elmtype)
 			elmalign = TYPALIGN_INT;
 			break;
 
+		case FLOAT8OID:
+			elmlen = sizeof(float8);
+			elmbyval = FLOAT8PASSBYVAL;
+			elmalign = TYPALIGN_DOUBLE;
+			break;
+
 		case INT2OID:
 			elmlen = sizeof(int16);
 			elmbyval = true;
@@ -3429,6 +3435,7 @@ construct_array_builtin(Datum *elems, int nelems, Oid elmtype)
 			break;
 
 		case TEXTOID:
+		case NUMERICOID:
 			elmlen = -1;
 			elmbyval = false;
 			elmalign = TYPALIGN_INT;
-- 
2.45.1

v20240702-0002-fix-incorrect-comments.patchtext/x-diffDownload

From 3acff4722a642c43bad5cd9ac89b81989d32998e Mon Sep 17 00:00:00 2001
From: Andy Fan <zhihuifan1213@163.com>
Date: Sun, 23 Jun 2024 14:31:41 +0000
Subject: [PATCH v20240702 2/3] fix incorrect comments.

---
 src/backend/catalog/index.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 55fdde4b24..73bfe5da00 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2958,8 +2958,7 @@ index_build(Relation heapRelation,
 	Assert(PointerIsValid(indexRelation->rd_indam->ambuildempty));
 
 	/*
-	 * Determine worker process details for parallel CREATE INDEX.  Currently,
-	 * only btree has support for parallel builds.
+	 * Determine worker process details for parallel CREATE INDEX.
 	 *
 	 * Note that planner considers parallel safety for us.
 	 */
-- 
2.45.1

v20240702-0003-optimize-some-sorts-on-tuplesort.c-if-the-.patchtext/x-diffDownload

From 27949647f968fc7914a48ce9c4dae9462c2b7707 Mon Sep 17 00:00:00 2001
From: Andy Fan <zhihuifan1213@163.com>
Date: Tue, 2 Jul 2024 07:40:00 +0800
Subject: [PATCH v20240702 3/3] optimize some sorts on tuplesort.c if the input
 is sorted.

add input_presorted member in Tuplesortstate to indicate the tuples is
sorted already, it can be 'partially' sorted or 'overall' sorted.

Within input_presorted is set, we can remove the sorts during the
tuplesort_puttuple_common when the memory is full and continue to reuse
the previous 'runs' in the current tape on behalf of mergeruns do less
work, unless caller tells tuplesort.c to puttuple into the next run,
this is the user case where the inputs are just presorted in some
different batches, the side impacts is the number of 'runs' is not
decided by work_mem but decided by users calls.

I also use this optimization to Gin index parallel build, at the stage
of 'bs_worker_sort' -> 'bs_sort_state' where all the inputs are
sorted. so the 'runs' can be reduced to 1 for sure. However the
improvements are not measurable (the sort is too cheap in this
case, or the reduce of 'runs' isn't helpful due to my test case?)
---
 src/backend/access/gin/gininsert.c         |  41 +++++++--
 src/backend/utils/sort/tuplesort.c         | 102 ++++++++++++++++++---
 src/backend/utils/sort/tuplesortvariants.c |   6 +-
 src/include/utils/tuplesort.h              |   9 +-
 4 files changed, 132 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 29bca0c54c..df469e3d9e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -718,7 +718,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 		 */
 		state->bs_sortstate =
 			tuplesort_begin_index_gin(maintenance_work_mem, coordinate,
-									  TUPLESORT_NONE);
+									  TUPLESORT_NONE, false);
 
 		/* scan the relation and merge per-worker results */
 		reltuples = _gin_parallel_merge(state);
@@ -1743,7 +1743,9 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	buffer = GinBufferInit();
 
 	/* sort the raw per-worker data */
+	elog(LOG, "tuplesort_performsort(state->bs_worker_sort); started");
 	tuplesort_performsort(state->bs_worker_sort);
+	elog(LOG, "tuplesort_performsort(state->bs_worker_sort); done");
 
 	/* print some basic info */
 	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
@@ -1754,6 +1756,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	state->buildStats.sizeCompressed = 0;
 	state->buildStats.sizeRaw = 0;
 
+	elog(LOG, "start to fill to bs_statesort");
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1854,6 +1858,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinBufferReset(buffer);
 	}
 
+	elog(LOG, "finish to fill to bs_sortstate");
+
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
@@ -1894,13 +1900,21 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	coordinate->nParticipants = -1;
 	coordinate->sharedsort = sharedsort;
 
-	/* Begin "partial" tuplesort */
+	/*
+	 * Begin "partial" tuplesort, the input tuples come from RBTree, so they
+	 * are pre-sorted in batch, however since the batch is too small, we will
+	 * think it is not pre-sorted and let tuplesort_state sort the multi
+	 * batches and ..;
+	 */
 	state->bs_sortstate = tuplesort_begin_index_gin(sortmem, coordinate,
-													TUPLESORT_NONE);
+													TUPLESORT_NONE, true);
 
-	/* Local per-worker sort of raw-data */
+	/*
+	 * Local per-worker sort of raw-data, the input tuples come from
+	 * bs_sortstate, so all the tuples are presorted.
+	 */
 	state->bs_worker_sort = tuplesort_begin_index_gin(sortmem, NULL,
-													  TUPLESORT_NONE);
+													  TUPLESORT_NONE, false);
 
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
@@ -1909,6 +1923,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	scan = table_beginscan_parallel(heap,
 									ParallelTableScanFromGinShared(ginshared));
 
+	elog(LOG, "start to fill into bs_worker_start");
 	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
 									   ginBuildCallbackParallel, state, scan);
 
@@ -1948,6 +1963,8 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	elog(LOG, "end to fill into bs_worker_start");
+
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
 	 * the callback, and combine them into much larger chunks and place that
@@ -1955,8 +1972,18 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	 */
 	_gin_process_worker_data(state, state->bs_worker_sort);
 
-	/* sort the GIN tuples built by this worker */
-	tuplesort_performsort(state->bs_sortstate);
+	elog(LOG, "start to sort bs_sortstate");
+
+	/*
+	 * the tuple is sorted already in bs_worker_sort, so let's dump the left
+	 * tuples into tapes, no sort is needed.
+	 */
+	tuplesort_dump_sortedtuples(state->bs_sortstate);
+
+	/* mark the worker has finished its work. */
+	worker_freeze_result_tape(state->bs_sortstate);
+
+	elog(LOG, "end to sort bs_sortstate");
 
 	state->bs_reltuples += reltuples;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7c4d6dc106..36e0a77b8d 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -187,6 +187,9 @@ struct Tuplesortstate
 {
 	TuplesortPublic base;
 	TupSortStatus status;		/* enumerated value as shown above */
+	bool		input_presorted;	/* if the input presorted . */
+	bool		new_tapes;		/* force to selectnewtapes, used with
+								 * input_presorted, see dumptumples */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
 	bool		boundUsed;		/* true if we made use of a bounded heap */
@@ -475,7 +478,6 @@ static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(LogicalTape *tape, bool eofOK);
 static void markrunend(LogicalTape *tape);
 static int	worker_get_identifier(Tuplesortstate *state);
-static void worker_freeze_result_tape(Tuplesortstate *state);
 static void worker_nomergeruns(Tuplesortstate *state);
 static void leader_takeover_tapes(Tuplesortstate *state);
 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
@@ -643,6 +645,12 @@ qsort_tuple_int32_compare(SortTuple *a, SortTuple *b, Tuplesortstate *state)
 
 Tuplesortstate *
 tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
+{
+	return tuplesort_begin_common_ext(workMem, coordinate, sortopt, false);
+}
+
+Tuplesortstate *
+tuplesort_begin_common_ext(int workMem, SortCoordinate coordinate, int sortopt, bool presorted)
 {
 	Tuplesortstate *state;
 	MemoryContext maincontext;
@@ -742,6 +750,8 @@ tuplesort_begin_common(int workMem, SortCoordinate coordinate, int sortopt)
 	}
 
 	MemoryContextSwitchTo(oldcontext);
+	state->input_presorted = presorted;
+	state->new_tapes = true;
 
 	return state;
 }
@@ -1846,6 +1856,36 @@ tuplesort_merge_order(int64 allowedMem)
 	return mOrder;
 }
 
+/*
+ * Dump the presorted in-memory tuples into tapes and let next batch
+ * of sorted tuples dump to a new tape.
+ */
+void
+tuplesort_dump_sortedtuples(Tuplesortstate *state)
+{
+	MemoryContext oldcontext = MemoryContextSwitchTo(state->base.sortcontext);
+
+	if (state->tapeset == NULL)
+	{
+		inittapes(state, true);
+	}
+
+	dumptuples(state, true);
+
+	/* add a end-mark for this run. */
+	markrunend(state->destTape);
+
+	/* When dumptuples for the next batch, we need a new_tapes. */
+	state->new_tapes = true;
+
+	/*
+	 * record the result_tape for the sake of worker_freeze_result_tape. where
+	 * 'LogicalTapeFreeze(state->result_tape, &output);' is called.
+	 */
+	state->result_tape = state->outputTapes[0];
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Helper function to calculate how much memory to allocate for the read buffer
  * of each input tape in a merge pass.
@@ -2105,6 +2145,7 @@ mergeruns(Tuplesortstate *state)
 	 * don't bother.  (The initial input tapes are still in outputTapes.  The
 	 * number of input tapes will not increase between passes.)
 	 */
+	elog(INFO, "number of runs %d ", state->currentRun);
 	state->memtupsize = state->nOutputTapes;
 	state->memtuples = (SortTuple *) MemoryContextAlloc(state->base.maincontext,
 														state->nOutputTapes * sizeof(SortTuple));
@@ -2334,6 +2375,10 @@ mergereadnext(Tuplesortstate *state, LogicalTape *srcTape, SortTuple *stup)
  *
  * When alltuples = true, dump everything currently in memory.  (This case is
  * only used at end of input data.)
+ *
+ * When input_presorted = true and new_tapes=false, dump everything to the
+ * existing tape (rather than select a new tap) to order to reduce the number
+ * of tapes & runs;
  */
 static void
 dumptuples(Tuplesortstate *state, bool alltuples)
@@ -2372,23 +2417,41 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 				 errmsg("cannot have more than %d runs for an external sort",
 						INT_MAX)));
 
-	if (state->currentRun > 0)
-		selectnewtape(state);
+	if (!state->input_presorted)
+	{
+		if (state->currentRun > 0)
+			selectnewtape(state);
 
-	state->currentRun++;
+		state->currentRun++;
 
 #ifdef TRACE_SORT
-	if (trace_sort)
-		elog(LOG, "worker %d starting quicksort of run %d: %s",
-			 state->worker, state->currentRun,
-			 pg_rusage_show(&state->ru_start));
+		if (trace_sort)
+			elog(LOG, "worker %d starting quicksort of run %d: %s",
+				 state->worker, state->currentRun,
+				 pg_rusage_show(&state->ru_start));
 #endif
 
-	/*
-	 * Sort all tuples accumulated within the allowed amount of memory for
-	 * this run using quicksort
-	 */
-	tuplesort_sort_memtuples(state);
+		/*
+		 * Sort all tuples accumulated within the allowed amount of memory for
+		 * this run using quicksort.
+		 */
+		tuplesort_sort_memtuples(state);
+	}
+	else if (state->new_tapes)
+	{
+		if (state->currentRun > 0)
+			selectnewtape(state);
+
+		state->currentRun++;
+		/* let reuse the existing tape next time. */
+		state->new_tapes = false;
+	}
+	else
+	{
+		/*
+		 * We always using the preexisting tape to reduce the number of tapes.
+		 */
+	}
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -2423,7 +2486,14 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 	FREEMEM(state, state->tupleMem);
 	state->tupleMem = 0;
 
-	markrunend(state->destTape);
+	if (!state->input_presorted)
+	{
+		markrunend(state->destTape);
+	}
+	else
+	{
+		/* handle it in tuplesort_dump_sortedtuples */
+	}
 
 #ifdef TRACE_SORT
 	if (trace_sort)
@@ -3043,12 +3113,14 @@ worker_get_identifier(Tuplesortstate *state)
  * There should only be one final output run for each worker, which consists
  * of all tuples that were originally input into worker.
  */
-static void
+void
 worker_freeze_result_tape(Tuplesortstate *state)
 {
 	Sharedsort *shared = state->shared;
 	TapeShare	output;
 
+	elog(LOG, "No. of runs  %d ", state->currentRun);
+	elog(INFO, "No. of runs  %d ", state->currentRun);
 	Assert(WORKER(state));
 	Assert(state->result_tape != NULL);
 	Assert(state->memtupcount == 0);
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 3d5b5ce015..94e098974b 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -592,10 +592,10 @@ tuplesort_begin_index_brin(int workMem,
 
 Tuplesortstate *
 tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
-						  int sortopt)
+						  int sortopt, bool presorted)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
-												   sortopt);
+	Tuplesortstate *state = tuplesort_begin_common_ext(workMem, coordinate,
+													   sortopt, presorted);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 
 #ifdef TRACE_SORT
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 659d551247..26fbb6b757 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -375,6 +375,12 @@ typedef struct
 extern Tuplesortstate *tuplesort_begin_common(int workMem,
 											  SortCoordinate coordinate,
 											  int sortopt);
+extern Tuplesortstate *tuplesort_begin_common_ext(int workMem,
+												  SortCoordinate coordinate,
+												  int sortopt,
+												  bool input_presorted);
+extern void worker_freeze_result_tape(Tuplesortstate *state);
+
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 extern bool tuplesort_used_bound(Tuplesortstate *state);
 extern void tuplesort_puttuple_common(Tuplesortstate *state,
@@ -387,6 +393,7 @@ extern bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples,
 								 bool forward);
 extern void tuplesort_end(Tuplesortstate *state);
 extern void tuplesort_reset(Tuplesortstate *state);
+extern void tuplesort_dump_sortedtuples(Tuplesortstate *state);
 
 extern void tuplesort_get_stats(Tuplesortstate *state,
 								TuplesortInstrumentation *stats);
@@ -445,7 +452,7 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_gin(int workMem, SortCoordinate coordinate,
-												 int sortopt);
+												 int sortopt, bool presorted);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
-- 
2.45.1

#21

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Andy Fan (#20)

Re: Parallel CREATE INDEX for GIN indexes

On 7/2/24 02:07, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi Tomas,

I am in a incompleted review process but I probably doesn't have much
time on this because of my internal tasks. So I just shared what I
did and the non-good-result patch.

What I tried to do is:
1) remove all the "sort" effort for the state->bs_sort_state tuples since
its input comes from state->bs_worker_state which is sorted already.

2). remove *partial* "sort" operations between accum.rbtree to
state->bs_worker_state because the tuple in accum.rbtree is sorted
already.

Both of them can depend on the same API changes.
1. 
struct Tuplesortstate
{
..
+ bool input_presorted; /*  the tuples are presorted. */
+ new_tapes;  // writes the tuples in memory into a new 'run'. 
}
and user can set it during tuplesort_begin_xx(, presorte=true);

2. in tuplesort_puttuple, if memory is full but presorted is
true, we can (a) avoid the sort. (b) resuse the existing 'runs'
to reduce the effort of 'mergeruns' unless new_tapes is set to
true. once it switch to a new tapes, the set state->new_tapes to false
and wait 3) to change it to true again.

3. tuplesort_dumptuples(..); // dump the tuples in memory and set
new_tapes=true to tell the *this batch of input is presorted but they
are done, the next batch is just presort in its own batch*.

In the gin-parallel-build case, for the case 1), we can just use

for tuple in bs_worker_sort:
tuplesort_putgintuple(state->bs_sortstate, ..);
tuplesort_dumptuples(..);

At last we can get a). only 1 run in the worker so that the leader can
have merge less runs in mergeruns. b). reduce the sort both in
perform_sort_tuplesort and in sortstate_puttuple_common.

for the case 2). we can have:

for tuple in RBTree.tuples:
tuplesort_puttuples(tuple) ;
// this may cause a dumptuples internally when the memory is full,
// but it is OK.
tuplesort_dumptuples(..);

we can just remove the "sort" into sortstate_puttuple_common but
probably increase the 'runs' in sortstate which will increase the effort
of mergeruns at last.

But the test result is not good, maybe the 'sort' is not a key factor of
this. I do missed the perf step before doing this. or maybe my test data
is too small.

If I understand the idea correctly, you're saying that we write the data
from BuildAccumulator already sorted, so if we do that only once, it's
already sorted and we don't actually need the in-worker tuplesort.

I think that's a good idea in principle, but maybe the simplest way to
handle this is by remembering if we already flushed any data, and if we
do that for the first time at the very end of the scan, we can write
stuff directly to the shared tuplesort. That seems much simpler than
doing this inside the tuplesort code.

Or did I get the idea wrong?

FWIW I'm not sure how much this will help in practice. We only really
want to do parallel index build for fairly large tables, which makes it
less likely the data will fit into the buffer (and if we flush during
the scan, that disables the optimization).

Here is the patch I used for the above activity. and I used the
following sql to test.

CREATE TABLE t(a int[], b numeric[]);

-- generate 1000 * 1000 rows.
insert into t select i, n
from normal_rand_array(1000, 90, 1::int4, 10000::int4) i,
normal_rand_array(1000, 90, '1.00233234'::numeric, '8.239241989134'::numeric) n;

alter table t set (parallel_workers=4);
set debug_parallel_query to on;

I don't think this forces parallel index builds - this GUC only affects
queries that go through the regular planner, but index build does not do
that, it just scans the table directly.

So maybe your testing did not actually do any parallel index builds?
That might explain why you didn't see any improvements.

Maybe try this to "force" parallel index builds:

set min_parallel_table_scan = '64kB';
set maintenance_work_mem = '256MB';

set max_parallel_maintenance_workers to 4;

create index on t using gin(a);
create index on t using gin(b);

I found normal_rand_array is handy to use in this case and I
register it into https://commitfest.postgresql.org/48/5061/.

Besides the above stuff, I didn't find anything wrong in the currrent
patch, and the above stuff can be categoried into "furture improvement"
even it is worthy to.

Thanks for the review!

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#22

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#19)

Re: Parallel CREATE INDEX for GIN indexes

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a bit more cleaned up version, clarifying a lot of comments,
removing a bunch of obsolete comments, or comments speculating about
possible solutions, that sort of thing. I've also removed couple more
XXX comments, etc.

The main change however is that the sorting no longer relies on memcmp()
to compare the values. I did that because it was enough for the initial
WIP patches, and it worked till now - but the comments explained this
may not be a good idea if the data type allows the same value to have
multiple binary representations, or something like that.

I don't have a practical example to show an issue, but I guess if using
memcmp() was safe we'd be doing it in a bunch of places already, and
AFAIK we're not. And even if it happened to be OK, this is a probably
not the place where to start doing it.

I think one such example would be the values '5.00'::jsonb and
'5'::jsonb when indexed using GIN's jsonb_ops, though I'm not sure if
they're treated as having the same value inside the opclass' ordering.

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

A review of patch 0001:

---

src/backend/access/gin/gininsert.c | 1449 +++++++++++++++++++-

The nbtree code has `nbtsort.c` for its sort- and (parallel) build
state handling, which is exclusively used during index creation. As
the changes here seem to be largely related to bulk insertion, how
much effort would it be to split the bulk insertion code path into a
separate file?

I noticed that new fields in GinBuildState do get to have a
bs_*-prefix, but none of the other added or previous fields of the
modified structs in gininsert.c have such prefixes. Could this be
unified?

+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED            UINT64CONST(0xB000000000000001)
...

These overlap with BRIN's keys; can we make them unique while we're at it?

+ * mutex protects all fields before heapdesc.

I can't find the field that this `heapdesc` might refer to.

+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
...
+     if (!isconcurrent)
+        snapshot = SnapshotAny;
+    else
+        snapshot = RegisterSnapshot(GetTransactionSnapshot());

grumble: I know this is required from the index with the current APIs,
but I'm kind of annoyed that each index AM has to construct the table
scan and snapshot in their own code. I mean, this shouldn't be
meaningfully different across AMs, so every AM implementing this same
code makes me feel like we've got the wrong abstraction.

I'm not asking you to change this, but it's one more case where I'm
annoyed by the state of the system, but not quite enough yet to change
it.

---

+++ b/src/backend/utils/sort/tuplesortvariants.c

I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.

---

+++ b/src/include/access/gin_tuple.h
+ typedef struct GinTuple

I think this needs some more care: currently, each GinTuple is at
least 36 bytes in size on 64-bit systems. By using int instead of Size
(no normal indexable tuple can be larger than MaxAllocSize), and
packing the fields better we can shave off 10 bytes; or 12 bytes if
GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on
a page, so we can probably safely assume that the key size fits in
(u)int16.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#23

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Matthias van de Meent (#22)

9 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

I've hit assertion failures in my testing of the combined patches, in
AssertCheckItemPointers: it assumes it's never called when the buffer
is empty and uninitialized, but that's wrong: we don't initialize the
items array until the first tuple, which will cause the assertion to
fire. By updating the first 2 assertions of AssertCheckItemPointers, I
could get it working.

---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.

I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

Attachments:

v20240705-0003-Remove-the-explicit-pg_qsort-in-workers.patchapplication/x-patch; name=v20240705-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From b5f62379bdec78122b0831a5184aa18738bc056f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20240705 3/9] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 107 +++++++++++++++++------------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 1fa40e3ff7..df33e5947d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1160,19 +1160,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1189,8 +1197,10 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
 	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX actually with the mergesort in GinBufferStoreTuple, we
+	 * should not need 'false' here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1295,8 +1305,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1342,33 +1370,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
 
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1505,7 +1509,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1515,14 +1519,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1625,7 +1632,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1639,7 +1646,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1649,7 +1659,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1954,6 +1964,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2037,6 +2048,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2049,6 +2066,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2070,10 +2088,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf..55dd8544b2 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.40.1

v20240705-0004-Compress-TID-lists-before-writing-tuples-t.patchapplication/x-patch; name=v20240705-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 8c0572530bd96463c7308fec98fdf948503de286 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240705 4/9] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index df33e5947d..5b75e04d95 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -187,7 +187,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1337,7 +1339,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1373,6 +1376,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1891,6 +1897,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1903,6 +1918,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1916,6 +1936,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1942,12 +1967,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -1997,37 +2044,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2040,6 +2090,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2085,8 +2157,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0516852c8f..757fd5f4f5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1034,6 +1034,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.40.1

v20240705-0005-Collect-and-print-compression-stats.patchapplication/x-patch; name=v20240705-0005-Collect-and-print-compression-stats.patchDownload

From 236ed63e979381f6cfeb4850ac587873bdc27a48 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240705 5/9] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 42 +++++++++++++++++++++++-------
 src/include/access/gin.h           |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 5b75e04d95..bb993dfdf8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -553,7 +554,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1198,9 +1199,9 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
-	 * XXX actually with the mergesort in GinBufferStoreTuple, we
-	 * should not need 'false' here. See AssertCheckItemPointers.
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
 	AssertCheckItemPointers(buffer, false);
 #endif
@@ -1614,6 +1615,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1640,7 +1650,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1667,7 +1677,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1682,6 +1692,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1754,7 +1769,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1848,6 +1863,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1925,7 +1941,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2059,6 +2076,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f..2b6633d068 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.40.1

v20240705-0002-Use-mergesort-in-the-leader-process.patchapplication/x-patch; name=v20240705-0002-Use-mergesort-in-the-leader-process.patchDownload

From 04621f730d466fdb3e4ff026d2097ff632695ec5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20240705 2/9] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d8767b0fe8..1fa40e3ff7 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -161,6 +161,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -471,23 +479,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -547,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1145,7 +1153,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1175,8 +1182,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1294,11 +1299,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1326,28 +1327,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1411,6 +1406,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1492,7 +1505,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1509,7 +1522,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1519,6 +1532,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1557,6 +1573,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1589,6 +1701,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1625,7 +1742,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1634,6 +1751,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.40.1

v20240705-0001-Allow-parallel-create-for-GIN-indexes.patchapplication/x-patch; name=v20240705-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 10dffd40c87b53d8846a381be724e31319afc612 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20240705 1/9] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1449 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  199 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1685 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c..d8767b0fe8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,125 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +143,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +463,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +583,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +632,93 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +858,1098 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4c..dd22b44aca 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb5..c9ea769afb 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa3..ed6084960b 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
@@ -46,6 +47,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +77,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +87,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +589,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +899,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1102,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1913,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a50..be76d8446f 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 0000000000..6f529a5aaf
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f6201..0ed71ae922 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e6c1caf649..0516852c8f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1015,11 +1015,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1032,9 +1034,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.40.1

v20240705-0006-Enforce-memory-limit-when-combining-tuples.patchapplication/x-patch; name=v20240705-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From 95808dfcca3e251e1e935fb49cfe525b422851ef Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20240705 6/9] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index bb993dfdf8..cc380f0359 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1154,8 +1154,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1222,6 +1226,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound
+	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1303,6 +1319,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1331,6 +1395,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1359,21 +1428,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1412,11 +1532,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1484,7 +1622,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1526,6 +1669,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1549,6 +1720,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1609,7 +1782,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1662,6 +1841,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1697,6 +1911,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068..9381329fac 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.40.1

v20240705-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchapplication/x-patch; name=v20240705-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchDownload

From d283552b956733a912021b8f6ea327ae1b38aef3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 20:58:37 +0200
Subject: [PATCH v20240705 9/9] Reduce the size of GinTuple by 12 bytes

The size of a Gin tuple can't be larger than what we can allocate, which is just
shy of 1GB; this reduces the number of useful bits in size fields to 30 bits; so
int will be enough here.

Next, a key must fit in a single page (up to 32KB), so uint16 should be enough for
the keylen attribute.

Then, re-organize the fields to minimize alignment losses, while maintaining an
order that does make logical grouping sense.

Finally, use the first posting list to get the first stored ItemPointer; this
deduplicates stored data and thus improves performance again. In passing, adjust the
alignment of the first GinPostingList in GinTuple from MAXALIGN to SHORTALIGN.
---
 src/backend/access/gin/gininsert.c | 21 ++++++++++++---------
 src/include/access/gin_tuple.h     | 19 +++++++++++++++----
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 45901f0c03..5d5d793359 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1545,7 +1545,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	 * when merging non-overlapping lists, e.g. in each parallel worker.
 	 */
 	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1562,7 +1563,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
 	{
 		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
 			break;
 
 		buffer->nfrozen++;
@@ -2171,7 +2173,7 @@ _gin_build_tuple(GinBuildState *state,
 	 * alignment, to allow direct access to compressed segments (those require
 	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	/*
 	 * Allocate space for the whole GIN tuple.
@@ -2186,7 +2188,6 @@ _gin_build_tuple(GinBuildState *state,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
-	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2217,7 +2218,7 @@ _gin_build_tuple(GinBuildState *state,
 	}
 
 	/* finally, copy the TIDs into the array */
-	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
 	/* copy in the compressed data, and free the segments */
 	dlist_foreach_modify(iter, &segments)
@@ -2287,8 +2288,8 @@ _gin_parse_tuple_items(GinTuple *a)
 	int			ndecoded;
 	ItemPointer items;
 
-	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
 	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
 
@@ -2350,8 +2351,10 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 								&ssup[a->attrnum - 1]);
 
 		/* if the key is the same, consider the first TID in the array */
-		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
 	}
 
-	return ItemPointerCompare(&a->first, &b->first);
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 4ac8cfcc2b..f4dbdfd3f7 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,10 +10,12 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
+#include "access/ginblock.h"
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
 /*
+ * XXX: Update description with new architecture
  * Each worker sees tuples in CTID order, so if we track the first TID and
  * compare that when combining results in the worker, we would not need to
  * do an expensive sort in workers (the mergesort is already smart about
@@ -24,17 +26,26 @@
  */
 typedef struct GinTuple
 {
-	Size		tuplen;			/* length of the whole tuple */
-	Size		keylen;			/* bytes in data for key value */
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
 	int16		typlen;			/* typlen for key */
 	bool		typbyval;		/* typbyval for key */
-	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
-	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
 typedef struct GinBuffer GinBuffer;
 
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
-- 
2.40.1

v20240705-0007-Detect-wrap-around-in-parallel-callback.patchapplication/x-patch; name=v20240705-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From 75b252aa2fb1058dd72580024c3c45313b732655 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20240705 7/9] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 132 ++++++++++++++---------------
 1 file changed, 63 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cc380f0359..4483eedcbe 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -143,6 +143,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -474,6 +475,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -498,6 +540,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -514,6 +561,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -532,40 +589,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -602,6 +626,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1231,8 +1256,8 @@ GinBufferInit(Relation index)
 	 * with too many TIDs. and 64kB seems more than enough. But maybe this
 	 * should be tied to maintenance_work_mem or something like that?
 	 *
-	 * XXX This is not enough to prevent repeated merges after a wraparound
-	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
 	 * because it quickly reaches the end of the second list and can just
 	 * memcpy the rest without walking it item by item.
 	 */
@@ -1964,39 +1989,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2081,6 +2074,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.40.1

v20240705-0008-Use-a-single-GIN-tuplesort.patchapplication/x-patch; name=v20240705-0008-Use-a-single-GIN-tuplesort.patchDownload

From fde3e4d775fa815686f79a2131fd853e229f900c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 19:22:32 +0200
Subject: [PATCH v20240705 8/9] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read it back,
merge the GinTuples, and write it into the shared sort, to later be used by the
shared tuple sort.

The new approach is to use a single sort, merging tuples as we write them to disk.
This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize tuples unless
we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's writetup can
now decide to buffer writes until the next flushwrites() callback.
---
 src/backend/access/gin/gininsert.c         | 427 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 6 files changed, 307 insertions(+), 250 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 4483eedcbe..45901f0c03 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -162,14 +162,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -194,8 +186,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -498,16 +489,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(buildstate, attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1168,8 +1158,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * synchronized (and thus may wrap around), and when combininng values from
  * multiple workers.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached; /* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1187,7 +1183,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1202,8 +1198,7 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1223,7 +1218,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
@@ -1243,7 +1238,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1289,15 +1284,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1309,37 +1307,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1392,6 +1424,55 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer	items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else {
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1426,32 +1507,28 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * as that does palloc internally, but if we detected the append case,
  * we could do without it. Not sure how much overhead it is, though.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
-
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		GinTuple   *tuple = palloc(tup->tuplen);
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1525,6 +1602,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(NULL, buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1538,14 +1642,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX Might be better to have a separate memory context for the buffer.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1561,6 +1672,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1584,7 +1696,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1595,6 +1707,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1608,7 +1721,7 @@ GinBufferFree(GinBuffer *buffer)
  * the TID array, and returning false if it's too large (more thant work_mem,
  * for example).
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1685,6 +1798,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1713,6 +1827,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1726,7 +1841,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1734,6 +1852,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer, true);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1785,162 +1904,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* print some basic info */
-	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	/* reset before the second phase */
-	state->buildStats.sizeCompressed = 0;
-	state->buildStats.sizeRaw = 0;
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer, true);
-
-		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	/* print some basic info */
-	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1973,11 +1936,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1991,13 +1949,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2154,8 +2105,7 @@ static GinTuple *
 _gin_build_tuple(GinBuildState *state,
 				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2223,8 +2173,6 @@ _gin_build_tuple(GinBuildState *state,
 	 */
 	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
@@ -2286,12 +2234,15 @@ _gin_build_tuple(GinBuildState *state,
 		pfree(seginfo);
 	}
 
-	/* how large would the tuple be without compression? */
-	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		nitems * sizeof(ItemPointerData);
+	if (state)
+	{
+		/* how large would the tuple be without compression? */
+		state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+									 nitems * sizeof(ItemPointerData);
 
-	/* compressed size */
-	state->buildStats.sizeCompressed += tuplen;
+		/* compressed size */
+		state->buildStats.sizeCompressed += tuplen;
+	}
 
 	return tuple;
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 7c4d6dc106..6006085717 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -398,6 +398,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2276,6 +2277,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2405,6 +2408,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index ed6084960b..adbd48f009 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -31,6 +31,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -89,6 +90,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -100,6 +102,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -134,6 +137,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -210,6 +223,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -288,6 +302,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -398,6 +413,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -479,6 +495,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -525,6 +542,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -582,6 +600,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -601,6 +620,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -625,6 +645,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -653,9 +677,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -698,6 +724,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -900,17 +927,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -918,7 +945,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1938,19 +1965,61 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
-
+	unsigned int tuplen = tup->tuplen;
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple *tuple = GinBufferBuildTuple(arg->buffer);
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1976,6 +2045,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3013a44bae..149191b7df 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 55dd8544b2..4ac8cfcc2b 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -35,6 +35,16 @@ typedef struct GinTuple
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0ed71ae922..6c56e40bff 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,14 @@ typedef struct
 	 */
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
-- 
2.40.1

#24

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#22)

Re: Parallel CREATE INDEX for GIN indexes

On 7/3/24 20:36, Matthias van de Meent wrote:

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a bit more cleaned up version, clarifying a lot of comments,
removing a bunch of obsolete comments, or comments speculating about
possible solutions, that sort of thing. I've also removed couple more
XXX comments, etc.

The main change however is that the sorting no longer relies on memcmp()
to compare the values. I did that because it was enough for the initial
WIP patches, and it worked till now - but the comments explained this
may not be a good idea if the data type allows the same value to have
multiple binary representations, or something like that.

I don't have a practical example to show an issue, but I guess if using
memcmp() was safe we'd be doing it in a bunch of places already, and
AFAIK we're not. And even if it happened to be OK, this is a probably
not the place where to start doing it.

I think one such example would be the values '5.00'::jsonb and
'5'::jsonb when indexed using GIN's jsonb_ops, though I'm not sure if
they're treated as having the same value inside the opclass' ordering.

Yeah, possibly. But doing the comparison the "proper" way seems to be
working pretty well, so I don't plan to investigate this further.

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

A review of patch 0001:

---

src/backend/access/gin/gininsert.c | 1449 +++++++++++++++++++-

The nbtree code has `nbtsort.c` for its sort- and (parallel) build
state handling, which is exclusively used during index creation. As
the changes here seem to be largely related to bulk insertion, how
much effort would it be to split the bulk insertion code path into a
separate file?

Hmmm. I haven't tried doing that, but I guess it's doable. I assume we'd
want to do the move first, because it involves pre-existing code, and
then do the patch on top of that.

But what would be the benefit of doing that? I'm not sure doing it just
to make it look more like btree code is really worth it. Do you expect
the result to be clearer?

I noticed that new fields in GinBuildState do get to have a
bs_*-prefix, but none of the other added or previous fields of the
modified structs in gininsert.c have such prefixes. Could this be
unified?

Yeah, these are inconsistencies from copying the infrastructure code to
make the parallel builds work, etc.

+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED            UINT64CONST(0xB000000000000001)
...
These overlap with BRIN's keys; can we make them unique while we're at it?

We could, and I recall we had a similar discussion in the parallel BRIN
thread, right?. But I'm somewhat unsure why would we actually want/need
these keys to be unique. Surely, we don't need to mix those keys in the
single shm segment, right? So it seems more like an aesthetic thing. Or
is there some policy to have unique values for these keys?

+ * mutex protects all fields before heapdesc.

I can't find the field that this `heapdesc` might refer to.

Yeah, likely a leftover from copying a bunch of code and then removing
it without updating the comment. Will fix.

+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
...
+     if (!isconcurrent)
+        snapshot = SnapshotAny;
+    else
+        snapshot = RegisterSnapshot(GetTransactionSnapshot());
grumble: I know this is required from the index with the current APIs,
but I'm kind of annoyed that each index AM has to construct the table
scan and snapshot in their own code. I mean, this shouldn't be
meaningfully different across AMs, so every AM implementing this same
code makes me feel like we've got the wrong abstraction.

I'm not asking you to change this, but it's one more case where I'm
annoyed by the state of the system, but not quite enough yet to change
it.

Yeah, it's not great, but not something I intend to rework.

---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.

Will respond to your later message about this.

---
+++ b/src/include/access/gin_tuple.h
+ typedef struct GinTuple
I think this needs some more care: currently, each GinTuple is at
least 36 bytes in size on 64-bit systems. By using int instead of Size
(no normal indexable tuple can be larger than MaxAllocSize), and
packing the fields better we can shave off 10 bytes; or 12 bytes if
GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on
a page, so we can probably safely assume that the key size fits in
(u)int16.

Yeah, I guess using int64 is a bit excessive - you're right about that.
I'm not sure this is necessarily about "indexable tuples" (GinTuple is
not indexed, it's more an intermediate representation). But if we can
make it smaller, that probably can't hurt.

I don't have a great intuition on how beneficial this might be. For
cases with many TIDs per index key, it probably won't matter much. But
if there's many keys (so that GinTuples store only very few TIDs), it
might make a difference.

I'll try to measure the impact on the same "realistic" cases I used for
the earlier steps.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#23)

Re: Parallel CREATE INDEX for GIN indexes

On 7/5/24 21:45, Matthias van de Meent wrote:

On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

I've hit assertion failures in my testing of the combined patches, in
AssertCheckItemPointers: it assumes it's never called when the buffer
is empty and uninitialized, but that's wrong: we don't initialize the
items array until the first tuple, which will cause the assertion to
fire. By updating the first 2 assertions of AssertCheckItemPointers, I
could get it working.

Yeah, sorry. I think I broke this assert while doing the recent
cleanups. The assert used to be called only when doing a sort, and then
it certainly can't be empty. But then I moved it to be called from the
generic GinTuple assert function, and that broke this assumption.

---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.
I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""

Those results look really promising. I certainly would not have expected
such improvements - 20-30% speedup on top of the already parallel run
seems great. I'll do a bit more testing to see how much this helps for
the "more realistic" data sets.

I can't say I 100% understand how the new stuff in tuplesortvariants.c
works, but that's mostly because my knowledge of tuplesort internals is
fairly limited (and I managed to survive without that until now).

What makes me a bit unsure/uneasy is that this seems to "inject" custom
code fairly deep into the tuplesort logic. I'm not quite sure if this is
a good idea ...

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#26

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#24)

Re: Parallel CREATE INDEX for GIN indexes

On Sun, 7 Jul 2024, 18:11 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:

On 7/3/24 20:36, Matthias van de Meent wrote:

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

A review of patch 0001:

---

src/backend/access/gin/gininsert.c | 1449 +++++++++++++++++++-

The nbtree code has `nbtsort.c` for its sort- and (parallel) build
state handling, which is exclusively used during index creation. As
the changes here seem to be largely related to bulk insertion, how
much effort would it be to split the bulk insertion code path into a
separate file?

Hmmm. I haven't tried doing that, but I guess it's doable. I assume we'd
want to do the move first, because it involves pre-existing code, and
then do the patch on top of that.

But what would be the benefit of doing that? I'm not sure doing it just
to make it look more like btree code is really worth it. Do you expect
the result to be clearer?

It was mostly a consideration of file size and separation of concerns.
The sorted build path is quite different from the unsorted build,
after all.

I noticed that new fields in GinBuildState do get to have a
bs_*-prefix, but none of the other added or previous fields of the
modified structs in gininsert.c have such prefixes. Could this be
unified?

Yeah, these are inconsistencies from copying the infrastructure code to
make the parallel builds work, etc.
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED            UINT64CONST(0xB000000000000001)
...
These overlap with BRIN's keys; can we make them unique while we're at it?
We could, and I recall we had a similar discussion in the parallel BRIN
thread, right?. But I'm somewhat unsure why would we actually want/need
these keys to be unique. Surely, we don't need to mix those keys in the
single shm segment, right? So it seems more like an aesthetic thing. Or
is there some policy to have unique values for these keys?

Uniqueness would be mostly useful for debugging shared memory issues,
but indeed, in a correctly working system we wouldn't have to worry
about parallel state key type confusion.

---
+++ b/src/include/access/gin_tuple.h
+ typedef struct GinTuple
I think this needs some more care: currently, each GinTuple is at
least 36 bytes in size on 64-bit systems. By using int instead of Size
(no normal indexable tuple can be larger than MaxAllocSize), and
packing the fields better we can shave off 10 bytes; or 12 bytes if
GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on
a page, so we can probably safely assume that the key size fits in
(u)int16.
Yeah, I guess using int64 is a bit excessive - you're right about that.
I'm not sure this is necessarily about "indexable tuples" (GinTuple is
not indexed, it's more an intermediate representation).

Yes, but even if the GinTuple itself isn't stored on disk in the
index, a GinTuple's key *is* part of the the primary GIN btree's keys
somewhere down the line, and thus must fit on a page somewhere. That's
what I was referring to.

But if we can make it smaller, that probably can't hurt.

I don't have a great intuition on how beneficial this might be. For
cases with many TIDs per index key, it probably won't matter much. But
if there's many keys (so that GinTuples store only very few TIDs), it
might make a difference.

This will indeed matter most when small TID lists are common. I
suspect it's not uncommon to find us such a in situation when users
have a low maintenance_work_mem (and thus don't have much space to
buffer and combine index tuples before they're flushed), or when the
generated index keys can't be store in the available memory (such as
in my example; it didn't tune m_w_m at all, and the table I tested had
~15GB of data).

I'll try to measure the impact on the same "realistic" cases I used for
the earlier steps.

That would be greatly appreciated, thanks!

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#27

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#25)

Re: Parallel CREATE INDEX for GIN indexes

On Sun, 7 Jul 2024, 18:26 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:

On 7/5/24 21:45, Matthias van de Meent wrote:
On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.
I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""
Those results look really promising. I certainly would not have expected
such improvements - 20-30% speedup on top of the already parallel run
seems great. I'll do a bit more testing to see how much this helps for
the "more realistic" data sets.

I can't say I 100% understand how the new stuff in tuplesortvariants.c
works, but that's mostly because my knowledge of tuplesort internals is
fairly limited (and I managed to survive without that until now).

What makes me a bit unsure/uneasy is that this seems to "inject" custom
code fairly deep into the tuplesort logic. I'm not quite sure if this is
a good idea ...

I thought it was still fairly high-level: it adds (what seems to me) a
natural extension to the pre-existing "write a tuple to the tape" API,
allowing the Tuplesort (TS) implementation to further optimize its
ordered tape writes through buffering (and combining) of tuple writes.
While it does remove the current 1:1 relation of TS tape writes to
tape reads for the GIN case, there is AFAIK no code in TS that relies
on that 1:1 relation.

As to the GIN TS code itself; yes it's more complicated, mainly
because it uses several optimizations to reduce unnecessary
allocations and (de)serializations of GinTuples, and I'm aware of even
more such optimizations that can be added at some point.

As an example: I suspect the use of GinBuffer in writetup_index_gin to
be a significant resource drain in my patch because it lacks
"freezing" in the tuplesort buffer. When no duplicate key values are
encountered, the code is nearly optimal (except for a full tuple copy
to get the data into the GinBuffer's memory context), but if more than
one GinTuple has the same key in the merge phase we deserialize both
tuple's posting lists and merge the two. I suspect that merge to be
more expensive than operating on the compressed posting lists of the
GinTuples themselves, so that's something I think could be improved. I
suspect/guess it could save another 10% in select cases, and will
definitely reduce the memory footprint of the buffer.
Another thing that can be optimized is the current approach of
inserting data into the index: I think it's kind of wasteful to
decompress and later re-compress the posting lists once we start
storing the tuples on disk.

Kind regards,

Matthias van de Meent

#28

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#27)

Re: Parallel CREATE INDEX for GIN indexes

On 7/8/24 11:45, Matthias van de Meent wrote:

On Sun, 7 Jul 2024, 18:26 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:
On 7/5/24 21:45, Matthias van de Meent wrote:
On Wed, 3 Jul 2024 at 20:36, Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:
On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.
I've now spent some time on this. Attached the original patchset, plus
2 incremental patches, the first of which implement the above design
(patch no. 8).

Local tests show it's significantly faster: for the below test case
I've seen reindex time reduced from 777455ms to 626217ms, or ~20%
improvement.

After applying the 'reduce the size of GinTuple' patch, index creation
time is down to 551514ms, or about 29% faster total. This all was
tested with a fresh stock postgres configuration.

"""
CREATE UNLOGGED TABLE testdata
AS SELECT sha256(i::text::bytea)::text
FROM generate_series(1, 15000000) i;
CREATE INDEX ON testdata USING gin (sha256 gin_trgm_ops);
"""
Those results look really promising. I certainly would not have expected
such improvements - 20-30% speedup on top of the already parallel run
seems great. I'll do a bit more testing to see how much this helps for
the "more realistic" data sets.

I can't say I 100% understand how the new stuff in tuplesortvariants.c
works, but that's mostly because my knowledge of tuplesort internals is
fairly limited (and I managed to survive without that until now).

What makes me a bit unsure/uneasy is that this seems to "inject" custom
code fairly deep into the tuplesort logic. I'm not quite sure if this is
a good idea ...
I thought it was still fairly high-level: it adds (what seems to me) a
natural extension to the pre-existing "write a tuple to the tape" API,
allowing the Tuplesort (TS) implementation to further optimize its
ordered tape writes through buffering (and combining) of tuple writes.
While it does remove the current 1:1 relation of TS tape writes to
tape reads for the GIN case, there is AFAIK no code in TS that relies
on that 1:1 relation.

As to the GIN TS code itself; yes it's more complicated, mainly
because it uses several optimizations to reduce unnecessary
allocations and (de)serializations of GinTuples, and I'm aware of even
more such optimizations that can be added at some point.

As an example: I suspect the use of GinBuffer in writetup_index_gin to
be a significant resource drain in my patch because it lacks
"freezing" in the tuplesort buffer. When no duplicate key values are
encountered, the code is nearly optimal (except for a full tuple copy
to get the data into the GinBuffer's memory context), but if more than
one GinTuple has the same key in the merge phase we deserialize both
tuple's posting lists and merge the two. I suspect that merge to be
more expensive than operating on the compressed posting lists of the
GinTuples themselves, so that's something I think could be improved. I
suspect/guess it could save another 10% in select cases, and will
definitely reduce the memory footprint of the buffer.
Another thing that can be optimized is the current approach of
inserting data into the index: I think it's kind of wasteful to
decompress and later re-compress the posting lists once we start
storing the tuples on disk.

I need to experiment with this a bit more, to better understand the
behavior and pros/cons. But one thing that's not clear to me is why
would this be better than simply increasing the amount of memory for the
initial BuildAccumulator buffer ...

Wouldn't that have pretty much the same effect?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#28)

Re: Parallel CREATE INDEX for GIN indexes

On Mon, 8 Jul 2024, 13:38 Tomas Vondra, <tomas.vondra@enterprisedb.com> wrote:

On 7/8/24 11:45, Matthias van de Meent wrote:

As to the GIN TS code itself; yes it's more complicated, mainly
because it uses several optimizations to reduce unnecessary
allocations and (de)serializations of GinTuples, and I'm aware of even
more such optimizations that can be added at some point.

As an example: I suspect the use of GinBuffer in writetup_index_gin to
be a significant resource drain in my patch because it lacks
"freezing" in the tuplesort buffer. When no duplicate key values are
encountered, the code is nearly optimal (except for a full tuple copy
to get the data into the GinBuffer's memory context), but if more than
one GinTuple has the same key in the merge phase we deserialize both
tuple's posting lists and merge the two. I suspect that merge to be
more expensive than operating on the compressed posting lists of the
GinTuples themselves, so that's something I think could be improved. I
suspect/guess it could save another 10% in select cases, and will
definitely reduce the memory footprint of the buffer.
Another thing that can be optimized is the current approach of
inserting data into the index: I think it's kind of wasteful to
decompress and later re-compress the posting lists once we start
storing the tuples on disk.

I need to experiment with this a bit more, to better understand the
behavior and pros/cons. But one thing that's not clear to me is why
would this be better than simply increasing the amount of memory for the
initial BuildAccumulator buffer ...

Wouldn't that have pretty much the same effect?

I don't think so:

The BuildAccumulator buffer will probably never be guaranteed to have
space for all index entries, though it does use memory more
efficiently than Tuplesort. Therefore, it will likely have to flush
keys multiple times into sorted runs, with likely duplicate keys
existing in the tuplesort.

My patch 0008 targets the reduction of IO and CPU once
BuildAccumulator has exceeded its memory limits. It reduces the IO and
computational requirement of Tuplesort's sorted-run merging by merging
the tuples in those sorted runs in that merge process, reducing the
number of tuples, bytes stored, and number of compares required in
later operations. It enables some BuildAccumulator-like behaviour
inside the tuplesort, without needing to assign huge amounts of memory
to the BuildAccumulator by allowing efficient spilling to disk of the
incomplete index data. And, during merging, it can work with
O(n_merge_tapes * tupsz) of memory, rather than O(n_distinct_keys *
tupsz). This doesn't make BuildAccumulator totally useless, but it
does reduce the relative impact of assigning more memory.

One significant difference between the modified Tuplesort and
BuildAccumulator is that the modified Tuplesort only merges the
entries once they're getting written, i.e. flushed from the in-memory
structure; while BuildAccumulator merges entries as they're being
added to the in-memory structure.

Note that this difference causes BuildAccumulator to use memory more
efficiently during in-memory workloads (it doesn't duplicate keys in
memory), but as BuildAccumulator doesn't have spilling it doesn't
handle full indexes' worth of data (it does duplciate keys on disk).

I hope this clarifies things a bit. I'd be thrilled if we'd be able to
put BuildAccumulator-like behaviour into the in-memory portion of
Tuplesort, but that'd require a significantly deeper understanding of
the Tuplesort internals than what I currently have, especially in the
area of its memory management.

Kind regards

Matthias van de Meent
Neon (https://neon.tech)

#30

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#1)

Re: Parallel CREATE INDEX for GIN indexes

Andy Fan <zhihuifan1213@163.com> writes:

I just realize all my replies is replied to sender only recently,
probably because I upgraded the email cient and the short-cut changed
sliently, resent the lastest one only....

Suppose RBTree's output is:

batch-1 at RBTree:
1 [tid1, tid8, tid100]
2 [tid1, tid9, tid800]
...
78 [tid23, tid99, tid800]

batch-2 at RBTree
1 [tid1001, tid1203, tid1991]
...
...
97 [tid1023, tid1099, tid1800]

Since all the tuples in each batch (1, 2, .. 78) are sorted already, we
can just flush them into tuplesort as a 'run' *without any sorts*,
however within this way, it is possible to produce more 'runs' than what
you did in your patch.

Oh! Now I think I understand what you were proposing - you're saying
that when dumping the RBTree to the tuplesort, we could tell the
tuplesort this batch of tuples is already sorted, and tuplesort might
skip some of the work when doing the sort later.

I guess that's true, and maybe it'd be useful elsewhere, I still think
this could be left as a future improvement. Allowing it seems far from
trivial, and it's not quite clear if it'd be a win (it might interfere
with the existing sort code in unexpected ways).

Yes, and I agree that can be done later and I'm thinking Matthias's
proposal is more promising now.

new way: the No. of batch depends on size of RBTree's batch size.
existing way: the No. of batch depends on size of work_mem in tuplesort.
Usually the new way would cause more no. of runs which is harmful for
mergeruns. so I can't say it is an improve of not and not include it in
my previous patch.

however case 1 sounds a good canidiates for this method.

Tuples from state->bs_worker_state after the perform_sort and ctid
merge:

1 [tid1, tid8, tid100, tid1001, tid1203, tid1991]
2 [tid1, tid9, tid800]
78 [tid23, tid99, tid800]
97 [tid1023, tid1099, tid1800]

then when we move tuples to bs_sort_state, a). we don't need to sort at
all. b). we can merge all of them into 1 run which is good for mergerun
on leader as well. That's the thing I did in the previous patch.

I'm sorry, I don't understand what you're proposing. Could you maybe
elaborate in more detail?

After we called "tuplesort_performsort(state->bs_worker_sort);" in
_gin_process_worker_data, all the tuples in bs_worker_sort are sorted
already, and in the same function _gin_process_worker_data, we have
code:

while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
{

....(1)

tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);

}

and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
we have some CTID merges activity in '....(1)', the tuples are still
ordered, so the sort (in both tuplesort_putgintuple and
'tuplesort_performsort) are not necessary, what's more, in the each of
'flush-memory-to-disk' in tuplesort, it create a 'sorted-run', and in
this case, acutally we only need 1 run only since all the input tuples
in the worker is sorted. The reduction of 'sort-runs' in worker will be
helpful to leader's final mergeruns. the 'sorted-run' benefit doesn't
exist for the case-1 (RBTree -> worker_state).

If Matthias's proposal is adopted, my optimization will not be useful
anymore and Matthias's porposal looks like a more natural and effecient
way.

--
Best Regards
Andy Fan

Import Notes

Reply to msg id not found: 87zfqrjrvd.fsf@163.com

#31

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Matthias van de Meent (#29)

13 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

I got to do the detailed benchmarking on the latest version of the patch
series, so here's the results. My goal was to better understand the
impact of each patch individually - especially the two parts introduced
by Matthias, but not only - so I ran the test on a build with each fo
the 0001-0009 patches.

This is the same test I did at the very beginning, but the basic details
are that I have a 22GB table with archives of our mailing lists (1.6M
messages, roughly), and I build a couple different GIN indexes on that:

create index trgm on messages using gin (msg_body gin_trgm_ops);
create index tsvector on messages using gin (msg_body_tsvector);
create index jsonb on messages using gin (msg_headers);
create index jsonb_hash on messages using gin (msg_headers jsonb_path_ops);

The indexes are 700MB-3GB, so not huge, but also not tiny. I did the
test with a varying number of parallel workers for each patch, measuring
the execution time and a couple more metrics (using pg_stat_statements).
See the attached scripts for details, and also conf/results from the two
machines I use for these tests.

Attached is also a PDF with a summary of the tests - there are four
sections with results in total, two for each machine with different
work_mem values (more on this later).

For each configuration, there are tables/charts for three metrics:

- total CREATE INDEX duration
- relative CREATE INDEX duration (relative to serial build)
- amount of temporary files written

Hopefully it's easy to understand/interpret, but feel free to ask.
There's also CSVs with raw results, in case you choose to do your own
analysis (there's more metrics than presented here).

While doing these tests, I realized there's a bug in how the patches
handle collations - it simply grabbed the value for the indexed column,
but if that's missing (e.g. for tsvector), it fell over. Instead the
patch needs to use the default collation, so that's fixed in 0001.

The other thing I realized while working on this is that it's probably
wrong to tie parallel callback to work_mem - both conceptually, but also
for performance reasons. I did the first run with the default work_mem
(4MB), and that showed some serious regressions with the 0002 patch
(where it took ~3.5x longer than serial build). It seemed to be due to a
lot of merges of small TID lists, so I tried re-running the tests with
work_mem=32MB, and the regression pretty much disappeared.

Also, with 4MB there were almost no benefits of parallelism on the
smaller indexes (jsonb and jsonb_hash) - that's probably not unexpected,
but 32MB did improve that a little bit (still not great, though).

In practice this would not be a huge issue, because the later patches
make the regression go away - so unless we commit only the first couple
patches, the users would not be affected by this. But it's annoying, and
more importantly it's a bit bogus to use work_mem here - why should that
be appropriate? It was more a temporary hack because I didn't have a
better idea, and the comment in ginBuildCallbackParallel() questions
this too, after all.

My plan is to derive this from maintenance_work_mem, or rather the
fraction we "allocate" for each worker. The planner logic caps the
number of workers to maintenance_work_mem / 32MB, which means each
worker has >=32MB of maintenance_work_mem at it's disposal. The worker
needs to do the BuildAccumulator thing, and also the tuplesort. So it
seems reasonable to use 1/2 of the budget (>=16MB) for each of those.
Which seems good enough, IMHO. It's significantly more than 4MB, and the
32MB I used for the second round was rather arbitrary.

So for further discussion, let's focus on results in the two sections
for 32MB ...

And let's talk about the improvement by Matthias, namely:

* 0008 Use a single GIN tuplesort
* 0009 Reduce the size of GinTuple by 12 bytes

I haven't really seen any impact on duration - it seems more or less
within noise. Maybe it would be different on machines with less RAM, but
on my two systems it didn't really make a difference.

It did significantly reduce the amount of temporary data written, by
~40% or so. This is pretty nicely visible on the "trgm" case, which
generates the most temp files of the four indexes. An example from the
i5/32MB section looks like this:

label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
------------------------------------------------------------------------
trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682 1016

So we start with patches producing 2.6GB - 3.7GB of temp files. Then the
compression of TID lists cuts that down to ~1.2GB, and the 0008 patch
cuts that to just 700MB. That's pretty nice, even if it doesn't speed
things up. The 0009 (GinTuple reduction) improves that a little bit, but
the difference is smaller.

I'm still a bit unsure about the tuplesort changes, but producing less
temporary files seems like a good thing.

Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index. So I was wondering if we might
parallelize that too, and how much time would it save - 0010 is an
experimental patch doing that. It splits the processing into 3 phases:

1. workers feeding data into tuplesort
2. leader finishes sort and "repartitions" the data
3. workers inserting their partition into index

The patch is far from perfect (more a PoC) - it implements these phases
by introducing a barrier to coordinate the processes. Workers feed the
data into the tuplesort as now, but instead of terminating they wait on
a barrier.

The leader reads data from the tuplesort, and partitions them evenly
into the a SharedFileSet with one file per worker. And then wakes up the
workers through the barrier again, and they do the inserts.

This does help a little bit, reducing the duration by ~15-25%. I wonder
if this might be improved by partitioning the data differently - not by
shuffling everything from the tuplesort into fileset (it increases the
amount of temporary data in the charts). And also by by distributing the
data differently - right now it's a bit of a round robin, because it
wasn't clear we know how many entries are there.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

gin-parallel-builds.pdfapplication/pdf; name=gin-parallel-builds.pdfDownload

%PDF-1.4
%����
%%Invocation: gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=? ?
5 0 obj
<</Length 6 0 R/Filter /FlateDecode>>
stream
x��ZIo7�K��6<�d�L.F��*�b�/��dm��:����C&	������U,�%�-:��b=��{���6��/���������?f��f����t����������X����o-g?��9�����@��
�k��fb�|7����J$�N���oC	W�0YW�i�1���iC	�T+��j����t�Ym�j�5-%�Ri�������YUW;��E��|k���z��V��}�P����.&�9���������#-1J��r1[~YW����J�(a�}������\)WW����[�`��f�u\yF�]��rI(U��uQ�'�(<�O�XRUW_������r�x�s�����J�S�?���t��u�o�Fi���4�h0\[��a53����"��6�c
3��hC�2�i�S���@�T]W_�~����W�������{<o�$���2~d����W�/��w�?�U}*
��z����������3$�DI�z���{A"�T4R�]%������W�������@c����X�=�^cu,����ln�
'�W��E�2�����*��gmr��	��B�UT�~>�uX�L=���q�W��4�����p/
c��v#,���4<ZiK����v����D0A�b�Y\P���d���H�{���"2�|���������E�������>l��He2�_E��).N ���2M�T��^�C�6�x���y�J��&q�3$}�)��;B���NQqbZ�T������)$�^���L���(2Fk������Gj�m$��k
�0���� ���g��1R��B��0K�� �9��K�W>%��
<���\�|���:q[�W��(���4OP���u���Q�I�4���oB�(E�q������O��V
c��L~�~��-(�uu�xE�\�:��H���I��q�&Hab�K���7c�z��0�
=�SSW�c�\{��������L�
�u'�!���1\_�f��G������'��9����%Y���kg����YG���h[L�1inA�<�1d�<	��kA<c}��Y���2��5Z;�d��q����.�x�����X��'�e�j�W������
�������T��fE��W����D��z�p��,����L���p�:lZ��#�m�7��x���M���D���nY�_�b�����Y_��e�b�|Tw��B_��B!s���;���U^��UoGg�X�a��%��L:�����T5>���*������@q�X�	Y{��&��D>4������m:���5`����	!<y+
;Qy*����	XD:kZ!�"2�����3������1b7"����ys�~h9�n>�5K��A&�>�7�	�7�>lj��t��z���B.�,�s�*aa#�~:�$2�U�5�w�'�x��[$^��r��Eo+U��y�v<H6����$�"���C�ax JMyt?]�`���n�
�pP��89j�h�?*�x;�P+��O6������Z�d�j�>�	k������5��Y]���w7$��A�;��Q!����+�ly2��x����{�/3(w����,�O�8Mm�Wxa�7�����I�n��%�$�RD�(n��m.��h3v��b{���pJN�Bb�^D�"��}T�t�o�k��v&?e���=�{���P�1W4?������`[����l_a�O]�I�4�@[���R3��d/�A	3z��Ki�;^F��4�:�>W����D��ti��g�����K��g%I���(��2�au�<L����� �����M+9�D
*�/S���?Q�ODB�"l*��������� )|\�A�5wh������l���y��������3hZ�Ln���jp��o�S��T�����78���1RI����R��-$��[E���MC�m��=�fLn��Dt`X�*2pZVL���0�C$��g���Y���@�6�����h\��*l��k��X��I7j���5�-G���1:����a���+�ga���1��,���|������7c��G6�����M"b��=��aJC�bx\�;� ��x���������Mk��A�b��������������#���0�����+���k�5,>wb~�Q;�S���T5�7\�5i?���'e#�g�og9;������endstream
endobj
6 0 obj
2198
endobj
13 0 obj
<</Length 14 0 R/Filter /FlateDecode>>
stream
x��ZMs7�Sn�i�H�a��2X�[�^:�$��(&�	)�����8��?t����v��&����!�,�z�}�_��%%���/��a�J���������i8-?�=z���������������l���-+g�G�4V^�F&t9�0���W5%�p�dULjJ��
SUq���bn���RQeM��LQ��cn�U�o^3�*�y1F��8.n�og�#&e9f�(�U9����=Lh�*>�K�JU��n�A�n����r�H}��[1-v����n�|�s�
��nc&)i��+����=�5�������J�t�$�U���D3��5v[e%�U�g=�TK"DU0oI+�����Ln�2�9e'�XSMl��t����Z)bL��������2��J)B����*xM�����*^{�P6�X�,�N[����J�`��u�f�C�O���xn��[�rg���F$@���$��dnx��3�������n�<�qO�e����3$�x�M{�1A��<��1�����bh_
o�gj��6x�
K	mD��[p&�w�@P�� 0�Uw��w��������h���*�1���U�!�^����|��V��Y��p��O5��s���_��P��ze�2��\U���k�8�D�����������D���E����K�r3�m2�"����Sp�0��`Z�,0|�9���^
���0\o=�5�[A��L����[�!�(��<#d�����d~�;�[�Q+�l��fX\�	�����j#��:>4tVscQ2^��E��nx������*(����gF�#��������8U�v�G����(�
9�*$��@BAm��!;��I(s-9qL���k��P���^w:��������J��Ux��z���D��_�f��C.m0UHR��/����X�?�X�������B���-��qk����{�����\�sCP��
5���R�U��,TO�j �����6���D�j�TZ���H����������Dhy�|t�F���������hNf��b�����4��w��T���&�N�y!|Z�znp����8r2��7?*&�����8'R����A��xn�������m�)���C��!@����7w�#Jt�N��	�~���(� �0 g1q�����A��������4�\��N�;�~������,���6C����vc��d$������##w�}z�Q�L�n�h��Q�l����
��T[�G�@�>���!�&�v�
�C��L,��ap��Q�L�����=�Q�|I���4O���z���b��<��\?{�*����w3��������4�9�#z�Il�u�<�~�	B�d��o����w���B�&����Y�l�������n	0z�s&917�2KS��j��e�4��t"]�e���%l���?�	�u��f��~>�"���������es������o�.�Kd
���WK:2!
�G�;Q]�>p"�/�Q/W��!�����BJ�a�cs1��W�����>$K8�6���4�:�$�.X_`	���%���0;id���Gv��i�[�=@*�7<_�m�7H��pa��c7={Z	��iNf[jkD�y�.2��E�_�A+����l�����#S��:,�����R8�],��A�7`��MSC�|)����0��5[��;�I��jecE�V>��_kJ,3�5����.���{o�1�.w�_dz
���8nM&��
6�n���lt2:��.�endstream
endobj
14 0 obj
1740
endobj
17 0 obj
<</Length 18 0 R/Filter /FlateDecode>>
stream
x��������yV2�������tF�#��c������f�����K���C���
�
�^����D����>�����?����
����l�k�8�����O���*����aa���~��n���9�~��oc�����p9���|��.�RB�d83����)7�h�������v��%%�+��v����j�����[�����+�<)��v��}�2c
�__R"�TZm����_�bb�a�pVn�������KC�v�����^�R1�J%)��2��������������g������+ce�v�nxwI	��������-����M:g�N�x�fx=|��a��s"��3���*������L����'���k���_�������4���LV��V��_xh6)�%��������0��
�B�S[��W��Kf�d1��e�%%����Z�H��At��B�7^����Z�g���$-�m����[W�o�o|�|&4�����]�t���#��}*^��<���n�:��I�|WG�����4!4�F�t�����gP63�
���ZNZ-`/w�	M�&���7����������|�[�SPm���3���/��X��L%5�R}�t�g�r�m����j}R^�.L����J��e�����w��#i;�������$}��m�o���'�����p�v�����4i�?_Rb��\�_b�	^$�c��\�>6HRu?SZ�|��N���(����R��f��UR����$��3Y�
��0�s�=i����P���vi�Mu�:n^�H��GI�{�����Y�%[~�t����B���@�Q:��pU�W��I�'jZs'I����ns���1�����G���\}	;��������������#�p5*+���\��Z)
�?
��!����;6G��1D��Z����Ox/
���w\"�����u2�]������ppe�(��6/.g4-���V�eV��B��^�/�$�c���z#�:+����e$Y�� �TP,*AU+gb=,��@�-!�:���JK<&�W���kR��%&�6�.r.�5ZnInW���Ki���jZG����f���iLv:&����dX���*�,�q�t�|@�Kb�LA��qo�q�4�����v'
��E����������S���A��G�q�V�p�Sg�w����C��Z��4d��b%��k%�D0����Z���a���\)Y��pH�>*W�����Hj������pH�9*W������)��$��k�d:.W�*���!w��|���&����_�p����n�*?ZV@�$�
�YV�y(�Y��DY�I��:��
�%����^MV($��V���pX��:YJExY��DY�$Y����Z�BAV8$QV�}����pX��:Y�����$�ZC2����Q���0�����UguWl�qU�}�����*A�{(�b����4>�B�P�P���Q��XIlw}�o(�Qk�#�	D�#��<�����a��Q�O1a�8/������P0�="?�����������q��IQF�e���#[o ��EA��>ZzP7`�((�>�@r��,a��(O��vpb���~5��aW���b'���q�6(���P_54|���C}(��H
�������U���"�����U�jp�y��"�b#	&�}���EZ��F��!'w5^��n��o�7��"�T������}�������OR��
�I18���<|����H�P�$�u��($8���V��pH����c!���V��W	(km�g��H���L�7���SH������k!��9��,�a����D�F��D�li4����������b�Q�����hg����0�)H'����i!xYv�NxZ-����p�N+��T�h'���j�2�Np��l�� ������K{[� ��0V�/{�Nx,-�/Oz�@p���B��\��I�(��f�?�K��R,6��������b�|�4@�d�����|���4���48��i�����M�^��������/�7
N�h����iP���i�/���48��i�����`���4�UL��L�_���M#�$��))Q�����i��b�QL���m��|�H�L��<<8��	M���n���4���iP������	M���'x4M;xcL�~6Mw�`���4�M�>��?�7
G2M|i��7
��h�b��u����������OK�J�'|(.'4���'9�BiI�@�����/`:q�1�AU������������VU-C=����;�Wa����2(P��vZ-C�7��}p�$="���0X	���y��`I���Uu��%���*2`7,-��U9��[�N�f�:�.�m(��)0�O�T&�j�/�D��nD�6���02����Se�c�/Dz"�HC�IqE�E}�7��D����������\,A'������*B$0I	e�`hHXqD�u����$��%��[����D�D�I��?��%N�X���IpncZ Yz&�w�e��	c�'�{��X������1��	{�V�Z�i��3K/����������r��f-����p�`�^V����1�8eo���BW���a�){,GAV8$ ������ +�(�����"���H�r[Te����e�= "����[ Y�I�e��"�B������jY�/QE@Ye�=�*�� +`B�UF���'�=��$�����>�l!���p����U�d�%�
�d�UF�>.l�pX��:Y�s�^V8$QV��^ADY����j�Y��DY�I�r���pX��:YJqzY	��2��Y��"���aPfi�+��VYe�)���������	f�g��&��Go������������A>�"�]�o@���
K�H��pH���^��9E���6�����4�Yi�����D�u�J�d!ITZ'I�bK|��"Kr��"4�(��L�?^�%
���*4 A�:/4n�(�A����S�7Bk\;���-+��B$�e,g�G0Ah8$QhI���>S�%
��B�BC!���I�KZ'Km"<IZg�6N(���c�6����<^i8$Qi��RN����D�u�TO(	���$������$\�(�i�XmF������3(�����4 ARZ�r�IuITZ�Vj'����@b�J�����G�Yim����@i8,Qi�r%(
�$*�M�3��eV�:��4�����)5J�a�J����v4�4 ARZFr�)��J�G���3�fs�>�_]�E��Y^�����
�8&�A��z�}�7hlc�v��1p���'�"`D��b���a���zAj?v������_1C��Z9�kV�����h����������p��8��!��g3��o�����E��n7�F5):��|t�������j|�QL��{���52w��
'r:2�m.OX���;�?�w,r|�/�u��/q7��endstream
endobj
18 0 obj
3445
endobj
29 0 obj
<</Length 30 0 R/Filter /FlateDecode>>
stream
x�����$�u��1h����z��1���Wc�]�&�E+���d{� �@�����sXdU�U]}b��=�}�?���.�������O����}xw.��z�N�r���
���)�0�����wG�l��vP�d����"�(#g��+�,���P��I�
�NU2��#��|xw�R����$��E:%�>��;'\�.�u'7�T�0��*m��1����pVm�"��
E8��?���P_0#t��=�$�":��{�5�h��j=��	�����@�Xk���2�t5TD;Rp�R5��������u�C�����Q,�.��i��+f�&�*��]HS'��XP\K;��*��k$�j*�26��i�.�Hh�������emh_��^�X!S��J� Z��������X~�qK^Y�P)�$�������H���b�|��
h���(���nm�HAP>YAY/�X�/���B/�X�tj�c4R'��I%��M��Q��e�+K�"������e����\�Z��v���A�R7*Z�R=K���:����t��4�n�q��c)�Mu�!���[D����iF���t"��������zl�S�����i���E������Z��,�HhQ8�J���G%A�����{��	g5ub�V7��@O��37F'I��yFY��C=�*^�k��v����v����:�Jq�GSd�
4��jG���b/����Gu1+��LLl!��3=�����T+�2����������IZa�B�IL{�D)�c�-���K�HB�	������"q�kRQ3R*�"�FTUh���N6���Vs�e�r�!"�����>�j�"�c'�F�4>�U�����[^��������4�����������L.G���Mz����?�����J�b�������w���~��5��z�����P����h9M��
Z��joE��m~
����E����u���S�i���k���
*���B3/\��!v��7>�J]�4���g��hNj/�QK������M�_9�$��Mr96U]��F��y�N^}
�w��mN�����!4&6�m��!�}'r��M��&����^�zS��5#�2*�js'CA4c;}i�K3R:��3F�uL�R.b�8�&15#������S+�r�R^��(IM����d�� z�^W
%D5���`IQ���}2j��t���7�v�Ar���n:FG�tbc�/;1�gK��Lzd�tb��!@*�|�Y(��4m���IC�5��q�8[�lX����
��S�Q>���3��!�(x�N;��4��3Z����x�S-5��D%�gg
�1���E����A1��!�9-��]Y��N�fg7tA}�p��z�0E���\G��S�uW�AJ������E�[��d������f���IV�`�o���Dmub������Q]��q�d����etc���h�xk����
��:Q;.��5��Y���i�U3u3���
ZAT��I4����l��&�] JziEn��n��nA���<c�����Vze�Z��������vdg����m�M%��y5t��)NTK3���c���M��(�i96��u]��{@,����?t����������O�O�{s���;qsz���'+N6.2�����?^��a3�p����7���s���>���a���X}}�1�S����
��^��|.�
�N�������7���''�<�8������D��0�2N��������{#�`�����x�02!������\_^$�R���L���bC���v�#�J��	�	��Zs�r��u���J�*n���WY}���d/H����_I�F��!����>�2��������1|GkL(�}�C���R���?��@x�V�h���9����U:�I�D
6�\������;���Y���_���(2�H���*���W�
���_D�� �0�8��sh�����YL������w�����������%3�.���+�����e�z(.���h�l|������(��s,����2�D�������P��L8�t����\�-�_�TL(��I\_����;�N����^6��js����>*��eA3Uiz�����I~�(�������������R^
��C�Z|V*���+/�h_l�U��8���7�����V���"(�d5;�@n$�r�����jz�������%�tG����d�!���Mq�8;p?�8�lpD�3�>����2��W5��pH��Kp�W�z<�@��!)9���Z�z8���qY5��zTc�_��@I��Z�z<��_�qI���*��������tB?!&��v����X���4PJ����lB��&��~+�U�I}�K=hA�;��[)@���x��=���x^F�W6�%�[����~�~�������s��r�Y	[��O�$����Ep9�Y�r5�W���~R�S.x���J�%��B��z ��	
�NS@�����!�-4>�A��� �nx����AI�����p)&�-��|��D������
��T}P����#���t-��\-Zr�XK^���ukA"k�X���`�Z���Zf%��5��^Q�8�:�&J���G�@������Q����C���(@�X�I$?�F��
d�d�!9
�Vt���g������$��u|`��&�S���&,�pS�D
�i	�t�M��6�cM�Y6���'#I@�v����$(��P��M
t�� �(	
8C��t�}z���a'>�J����F��U�C�$ �m�t-
�%Ag�Q�l4�����d����U���p�6�/T��/|���-�P6�E�p�+TO�t4�f��;>�}9G��&��O !AK�W"�H9:�>��x��$�v�|����`G~l����"y\�<Y�%����J����*�n;��]�u�A�������M��0�:0��N6�zn'"N8V���E�%w���ug�B;�pbj��s^�J�8{�
����[1P`����e��<@�u�E]��������]uM�����U�mP[F.4X����^Y��P�@B�����������,��J��w-���+��/~�X]�l���|?)�>�`b@B4>�X����iV(�,�������>mF��H�@p#�CI����� i�`��8(y���L`�&5A�,����8���j,a�F3o�� ��3��zd��X�|?>w�H��e��Bx�j,4 h,M�G������x�XH@���*�r?��Q�X��I���r4�&Ho���-�b�l,�P���.Ac�U9�^^����T(�6�cQ������+�m�?��,��`�)���w�%O��Ih�e���f�r�c�	����@ci��O���N��q
Keu�B���L��z(h,�Pp�>
0�}mg�����Fyx�B�������a���.Bx��������A�I�������&x��E����
�a}��������l�H+/�:���y$,�p���T��BA�������
.�o��
V_(��>�\��B���P��U����`���h��v���:�K%�G2�(��k4B�J�R�a�7��B�)��ZP�� ���$X�9c���P�r6A�f'�~�$X8N)��l��K{�p�0e�4LA�G)-�^����$8H���{$����E��A�
id���
%��n�X^�@@�������H���b��i���a�j*�����nl�y/l����[kS��)�&Hww�&AA+�P��V���$ ��@pq
�lM���"t�} ��9	
�A��v�p��I�@����������AEXY���Y!��~�%����&��g��;�x��	��bzr�$
�Ae}�@��`H�}���By������A7��� @�n�D���! ��@��(�M���q��`4��)���k���������/V ��g��[�Q8��������:{���l�h�A�s-N)�����
g��Qm�����i�m\��</��j�S�D\���k})��
���:sC"\��\Nxv	U^�������Pp����\����|P$����A��E���g-��`�(�?u����Ih��j���{��D8���
|�Cu�Oq�Om�Z�)*���TP��`�4���_��5�A6<�O����6����8�
�a��I5��]n���Cy8����|���}<gy	sG�9U��Y�
{#	�/�j�Rz�	����A`�m^(g<�H��l5�����|B�� 8��.�
�6��4Q�~���.A��RWv1J��&��<�Gd�(D(��9�6�OO�
/��s�X^�����G���?6��!	��X6�����9��x"��&����4 �?���#)P�?�(���D �M�-�A���c�d�8D
��]�3��H���Q�<�x��Lcz~cXS���������|A��������>����3N ��_n�y���?�g3��l�_�G��V�����{Y�G�������������d����BB�K��{]��8(�v���H�g�w������{1;'0���r����x���������U������
���_�p7(�����3����-��79��<���������,
�2s}�M��|p��Z8���"�<���@�\P���q�P���o���U��7#Sq4c��N8&2%�r���h�avb!D��NcL������h=�g-�x�s�.��Y��4B�������o�t���U���z|����Uy:�}�����.�rX���Bz����w
�I���K����a������<��8P�l��s1g_y��J�Fk���7�>xE�#������Tg@"�9�h�Yn<��#t)\9/@����=�_����~�8��
{������K�-f����y)�ba�]X�C/���e���kz��W�����fW���^
N�-)�>������1i�[[��/�%���~��&G_*������������n��K��T��/
�c2�m�~���������`K�s�D�K~�'���%xH���|S;��f�I6�O/8Qj:���%�"J�����.Mj�����!JO����B���ku�@���ku;|�����W��q���X���C���Z����j?xCr����lQy����}�

v�x�����*��
��������}k;e"
zJ����i�
B�M�b��f����tc�3�!�L�_*�z����'b�lA+�Y�	�h���.I2"g���%�
��rA+�-h:nI�B����V4[���a��Z�z�����<������S:����^��f��>�����_G�2R����r^����C�#8#�����G+�I���5A�L��H��l�O�����x%����cN^�5�
Lendstream
endobj
30 0 obj
5436
endobj
36 0 obj
<</Length 37 0 R/Filter /FlateDecode>>
stream
x���Mo�F��<�X_���(�����k��Y��J� ��H�
���6�8lNW��^o��<x�.r�L�z���B5�����}�������}��+-��i���i����>����"�&8���6m�����O���'�����j�>}y��|�����O���	Vh��
������W�Rh��4����RH��s�q�p/�U�U�q��^
lp�����:a�:�>���;�.�|�r������MOdCn��f;��R����������R���T���:���w�el���{9}p�+�M�'���]O!���gD��P����W�f��������Md�m�J7�A���V�\2F��K��l�'r���T%
��K�����j&+�\����K��+�fFd�8��U��R��=��1y�'����t�_�2����o�NC��;)��h�]�G�o�FyV^^:�y������r���)��CC%�U��*W�h=0Y/G�f�UQ{h����P��������i*�o#T4��W��^wI}h�����}�]��FJ���i��\~��BwX�-:��������O�������6�_�����IB�L����T��)�#�������X�
����@pp��ZXo�6�(�t|�e�`�����+p����/9	����;xk�]:��p��K�w�8��K�7:�vJ����	���#.��H��:�T�)�_e��d�T����0�W<��r������*���T]T��
��"��_�������P�W`$��6"%������5x�4�W����W`(��Q���E����P���@�]�U:,uW�;��$N��/N�
�#gy��?z��,���i��7���Y��\�fVw^����X"�&Ot���$����1���W)JJ4:�'��C"gb�H���2�Dd��D��X&�6�A�P������(�D�#y��Z��0�D�K�(�.]|��.`;M����G&Lf�����4O���b��fV���L09�G����L0I3��b_��N�4Z�GzQ���5�P���P�%m�������	�"qn�� N,�h��7�]��`(R'U{�
�#�;y���t8w��\'���Sj!5����^
%;��q�����$�����X��fnD���	����gn���d�X��Ub[��i�V	���gn|l�:�fn���,�X��Ub{��i��GC�c��[%v�:�fn���J��������Bb�e��
#���yy:F��������XX�b�TfV#~�_mS�p��Y%���]�oj#���w�)]��E��I������6^WZT2��hW8W
M����9������	[)��
�{�(;,�C9���<��*�C9�K�j2o�$�H����j|Y��D�ziE����|�v�mZ�)��'T��)m��F+m:��m
�#�f|+mZ�7�&$���mZ������_m����
L8�Am
�#�n��X!�LH66���6����L��`(�)�"�`��`$�i�����R�hS���
kS�����6-���k�*g���iV��6����6��M3��6�w~��`B�)_A��w%������F$�b	G�"�M�|d�-|�}�hB������7�4�;�/��>�dS0���Z�����o���
�6�N
�bmj�PFl�mN�~�������(�onIOo�v�����T�H���ebU|Z'������a��i�W�M���_-8�KC�J,i��~��h/pl��b����i�W	\����_-6���~���pl�Ub�
d0BY�
�V��p�T�L�2�*8��SA�EweY�k#.����0ue<_��d��G4!ue<��)�`!0ue<�1��<.�
K8ve��50ue[���!���2�`J,���e|�o\�E�P��eP7����#����������y#f�����~Z�vN���=�-Oi��O�M����ba�����E�%�V�fH��9�M��[!�^����F����}����d_0!�w���}�|�����h�d_a�/��������8aDC%��P�������}s�[�N%��
������:����<��UB[��29��Tp���w�z�[Q�z�x�^�%{M���M��yY�[���4�w���� �bG�n,}6�G��%�����?��t�x�t�x��h���������f��&�� \0	w�s��(���m�_~���H�������l���mo����������������l�2�yI����F��7�g�}��z0m����f�'P�+8endstream
endobj
37 0 obj
2288
endobj
43 0 obj
<</Length 44 0 R/Filter /FlateDecode>>
stream
x����s������4v,���/�Yej�{����Ek9r���&���8=�������������������K|�(au���?�_�������{����9�~�i�����}X�����}�������8�n^�oc�����p���_�Y]]<����pf6��������>
�R���_/(a\1�7��%Ts�T�Y�u�=Wb���m�J0�Y}�>d���mV�\P"�TZmV/���	!D�6�.��r�:I?t������4�lV����+_D�]%%�W���Z�~��fu�z����s����������~�z�����KFm��|t������]\98��7�?��^��i����Hm�@c.�!�J���&e,�.����=S���X%6�ax�
"�bL�:�����*y{;�}3n>���W�K��q#6��nO���v|�Uq�������D2����"���U��Zh ��w1�RI�����~OF�K��w��?�b��u���[W���o|�|&C����������W�]*^��<����y���IR/��(Yu���&�&���v�w
�2���P�V���I����`/���C�z���o����%o�7��|6����&�5d���K_��P��L%���z�4��d;������j}R^���T�&���%�����.7��#�q�(y���T$m��l�o���N�%I���u�`>N����(����l�R�j� �F�<iC��x�!I��L�5�~8vBg� EI�^�����6k�/���4v��p�����[.1����*;��rL�7��k�9�|��������L7ccy��� i8I����K���l���%\U�T?c�������I�>�����o������}"Xg���/a�_��>��Y�6�����D�:e����(�.�V���.\��LXf'W�#��c��@-d��%|�o|��]�H����N��8GOJ�m��.\iY-&�}��3�W�W��2�TZ!��s/�&T��1IAL� ��u<�2�#V��d!�I�%��XT��Z����x�%bn	��Q�N���c�zqn=�`�
1�Do���Y�%�F�-�����l��4bnI�/�:�PY32qs�������/�7���(�W.6�������7S�61�����F��p���8j��hq�E|?��z�g�r>?Ja�/�6���$i�����BL�e8��gF%���J��"C+	��V�KD!LVs��$:�E��re��M�Cb�^�2��rCa����+'%���sH���pX��+W��IRs- �������$�D�QV�/�����[VGeYE�AV
�UYe�uYMO�QV�$I�,��
,���_�Y�����$�����aY-c	�B!d��$�
�d����
�d�H2�MGY���Z�d�C�j$yT��jY�YE�DVKH�{s�x��e�#�G��	�[V�eYEnYe�{�*�*+A�;,�j��}4����p8�$T�rTzU8$V�\#�>
�`�����"��H�<G��p8��#G*C�($NP�92�5x=�p�G�L5��C���9�9S���X�&�3c��FT	b5�8cD���N�2��(�|)���|i�o�<�\���!�R���+Nt�-������{�����>Z�F�P�vp@�g�R�CA:F��$��E�������Eg���r��������3��F�1�t�������3��e��������
�lJ��b�3Sd�\~�����)�y�V|��������r��y�
)x�������aBb�#{�,�y8$��jFT���^�=3�[
)x�������aM��LG����H����Q!�DQ-eq����I����xo��'�b��I�	���M�U�W������u��y��.�:�%��<�Wb8�c7k���[8�c?�=|PJ�AUY���+UPNxPU{x�*���������^2^8�seB�	.������=�{�N�367��`�9a���������i�7�������i�r�MS��$H����4=��%�������M�LS?�[�M�|0M{�h��`����48��4��L�~0Ms�`��`����48��4��=�&?��3�&�x���4BJ���FR��S~��o"�R���e�n�����E���v�7JlK�t�r����\�����V�%/O���V�G/����
Ri/y�
F��)�c>���NA�
Ji�vP
Jt0J{��QBl$�d�Ofb{�0m�6�
$��j�%�y�������P\$�������	�J��q���K0���N�L�Dg�U����2w9d�%-Q��*L�����2����J��p������������t������e���*O�������W��
��b����:�
K���������%-��'�6}O�(���"L��6,����n��
3��
O����
�	�'I9�g����{>H$0q�N27 ���2��}��x��	L��!�{�@4K�w�Z+��nNkH$0�!#yR��r��e��Pg�N��'/8�!����:�������X`JC�*����'Is���2A��a�S��b?�o	 ��r����gD�G��	�O9U����~#�D?�$��}��pX��fX������B~�!ix�K��R�'�����q5W�}-�S+����pH��fH�g���'�������Ip�4Cr6s^�~�B�^a�&
�m��~*�I@�����e4�q�O�	��I9�i��MG6��pH�OI�G�3W\�'�S��x���O($���$s�E?���_���d#(;�p��2���y�<�e�S#K�N8$`�F����X�N�Y;}��S$�y��L���\Sy;�������n�U�N�4�����Z���0�l�L�m����l��{���t����e}���E�C�LD�XJ]��
$Z��x���pX@h�,E���B[V+ 4�p5���CJk����mT
���F���$P��\������x�E��8-�Z�4n�e@PH����Fi��]�t?e��WW���AQi�i1c9���}}4P�B��4PZ�en$(
�dPZc�����J[X+^i8$��F�J/
�ePZ#K��FH$��e�J�a�-����H�4�X'������$\����&���`�J�� ��C��e����I�zi�q~�����CJk$)OzGb��YN����Jk$�;����Y�"�J�!�5�����2(����4P��Z�������WZ$A��e$����m1�sMz��m�����m�J+�S��a(-/��J��,�1�z�y�!�D���4<s�%*m�ev$�4PZk��gz �D�����D��JRx�a���Ji��$*��-����?;{�jw]���Gh9�����a�j���w�s�$e����+�f�-��/�|[fZkQ�RS���w�����33k�-�3k��c{������\��\�����|y�EH��I>��%K�%B����� AI�/��x���vU�-2��H������t��dy��������?��-�<.���Dn\x�W�2R�q������k�k�f��>����4�u�w��M#��/+��������.������l5��
��K���|������g/�����K���G/���!��t	�li��*��~��?�
�,O��g�"������r��s���gc2�H�����>�Z����G������xB�W��$�)���,�\Y��r��a�����$)�-�PY�9K�k�B�K��UVi�J���:�%l�Y�N�b�E����_��r�>w}��^��}�����?�o`�a��_+��������c�I"{����~�����]o���t�����J�]���f�_C�^endstream
endobj
44 0 obj
3849
endobj
50 0 obj
<</Length 51 0 R/Filter /FlateDecode>>
stream
x������u�g��d�X$����,��KW���c��#J�-��(;�W���T�����}�s��4���H�����|��n������������}xw.��z�.�r��3��7������wg�,����H;�w�����x���Yp�#$[Fx���#o���F_�M�Bt*��:����Y2�zAN��en����bF�0){�*�Ax�r'�M��F�)��"|���l�	q�W&T�t3��QZ�X�*��0mV2���K&�C��1]T��r���<B�
�R�I%���|Q%��K:i���
���%�^de��r=��;G���R�u//6y-���@���HM�b�gB���P���=��2��i�����.���3��`&e���g/	�y���V���f�H�B&�G�t��_I*�0�<K-qjy�qhP*������.@�RI! �1��a���\R0�Z�����n�L?T0������-D�����\,�Y8+�.\y�P���tk����DAE+%����k�V�N7�rc�
!$VN��BM���n������V��+,$/���Rh����|l(���AC����[�6PS7�����^ v�r1_���^h����,��|.�(�n�h��
�~d��f 6Is��"����fQB��F�"��T��(~5����KWm��K*���l#1���i��T2�Ur6&���mTI7P)�����X�*����?����h�D�m�1�if��4��mL\H��>��szI<�0�8���H����������\m�0:��������aP�����C;F:��{�V9�l��v����Q
�lr�$�r0V:�.E/�Z�W�C\��C/��^�C;����e��i�7J����U \��Fr���V���0n��yh��V:�����I�_}I.�0&u���Eo��YV�z���b��k�����
����7(��h��0:v����N����Z�J��F;O3U��������3���J�� ���f$�%Z�X'CX�e���B���MM]<�V���
\zhd/�3�fRO/pe�����KP/P(�Z�9�8�)T�
��B34�h �(�ibh�3M
�q��f�<�2fM����C���P34������2���R(!J�lG�t��Q$�@!A�s`0�R$i&�t�L��.S��^��^�����F�(�Y����4C
������^�>j�h�������5�5�^�Fu��^$��@O|�P;4���Q���$g�(�n���B�0�E.t���4��&��:��5���LrSZ������y�������Qy������~'�<��3O���o�������#�c���1�@��x�$�.�I;��t�?���D��m�ZQ
�����ds^����$��P�O�lf!�#��y��J(�d5BA%������&���
Z�m�K7��v���G���$�Hm�=Y��H�JQO)r���2m���D/��L]�BK�P
�ehR��@��������~��H����4��:���kPD�4WV���mJ�|
���U����M=^����,�^���1J��D�����MI7#6M�eP��{q����n���,�$�N��tJ�����{W�����%|-��r������\X���<���~x����g�E|�M4�����I��en�����������I��|8�x�&f���?�~��M�K����e������������_�_��U�GP#)@����J���j<�$�[�-:��1k<���9���m�&��+�tq�Z5����|�}���Lw���w��Z1r@i���^�������Q�}|���cr@o������>5�nT{�\�%o!�~/�����.X�q�ML������f���7�����������W��$�����TX;I�N�H���z�<�@��#p��������A(C���}�����!^Q��
.'�Z�U�s&��WR�L����)5�f����mi�����{�7����k��8�@L(��=^L7*��{D����E�zwDk�M�������7��M����Upt?c���������:��=j�d���������'%���Hd��'�ZPX��l�1���o�.��J&��F����D����v�O7P2�{ULC�o~����o~h�P1���g��EL$�
*+5�d2z��Z��hK��Ik��F��������m[��Cvm����%�';l	���][Z@��%b"��.���9?�D���T1=����O�D����m~���I��������Z�DV4�����c`�L�|�j�0#%B�#���2��TLOw��|$3�������G��
Mm>O�����@��S��2�G;���J�*�O������,���\1�O������e����0b&4�1&x����g������L�|�J	J%%B���u�g]I��|*��#���������~b����������T7m�����.���Y�����nZ�<������f|�W������#����p��Y?Z������VG��qR@����4�\:)N�~E�O����>�U�g�)�����7?�O
�X���^��I������g����/�?�X�*�j����G"�Xo�kZv %��^�h��5:a��w���UP���P)�a�2�Xo��`�
R"����;(��Xo�	�!%�������FH�p��e��.�@�����d���yW{��O�]u��tw;����k	����.b"��A���
)x�n�CI��5H���!e��3=��.b"��A���%z�X���E�D�]}�����5)xW�t����wio������	�3��������*��q�"��QC�������H����td�E���5TJ��%Q��.���Y�;��	�k��`[2R"��.��q-S���R���H������y�6R&��.����Z	6qN�����8���#�Z>��;}Y��7�Y����t��BR���������#^
[�=�E��axy�?R@x�������T��!��	V�(�6H$���0
��#) ������x��o�
#���FP~���~����-�$v�����>�U�V�(>��f�D���j��1��*��3��v��L8~�2m��h��M
�R�A*!��*�rd���3��i������)���L�OgQ��i������	�OC5]�=E;���d�v��
����U��
���#�z�n����]�"���{yCbR&t�.��=��2%Qv�1���2)�L�is;j"t�.��;�����1�f��D�2cDykR&t�.����������U��L{)@I�b�I+��q�:��P3��p�m�I��aj�';{0y;sR$0�>��Rx�&��e�_'E{�#��T�6}'w���)��\��`{R ��A ��	����fw�Z,�7��d�Dk�������~��^��@�32�����]�"����1���GU6oY;B:������T�t�tD@��c ����}|�O
��3U���|�),"���\C���5$���'f�����|x0%� �~e���������QK9yA;��~R��4��^|�l���|V�����C@��-�v/>Q���K�:Q���!e�Q[�i��h����K�����p�6������mHK�a>�Ly�6TJxF)��Ji>}��	�n�����EIz[���d��i,��r�����*���;V����P{f���)R"��.��.�����J	���$��%��l5Z��EJ�5F�O
�d�5��G���E����H���*��w$E���O�v�Z%����i,����$������m��������BfGJ��%Z5�|J)ZT�i�&'<���([�Q>���	-���z�;�'HJ�5����BJ�lQ�f`�-��DhQC�4��H��U1���kOb��qFN^��#g.X�������������b��p���������i�oq!X�{��)��2�FH�	O������o~�������y��V<������x�N�/^������=/����/��`��GC��$�{�RB�K��������WB^��,�����X������w^��9��H�{�����_�H�Y�/�wS����N�,���spzn�3��v�Z���Ls�|~������/�P��TK�v�r]UX�/���L�m�����uz���N,
�2�p���3)������7���OE���.����@�\PT*�]Q:�M����.��ffz1#\ Rk�U���W���X.9`
b�,<����W����O1��r5ie4������rq%�s�����N����
���eE�..�����R^���o�p=~1_yAZ/��UVQq�����������!��S�Z�m�)���VT�����KC2�p��BJ_]]����Q�#��ss����P��+o�j����s�e�J)���R�J����b���:n[�X��S�k���W�&�+�H��`Rk{^��*��5�W�]a[k��3N-n��X�l�q}��ttL��c���OS6F�q� ��1�^12*oz����>=YpBmI����MV��I�am]��l-�������:&Y1M�-�����OGvCV\JR]�l�G��u27�O�}��Q��*���ws:r2r���v�c11y�K�����^� �V��Hz�M����&5d���q�����7r
�$5��V�������V�%�I����Cc�I����K
>Ijak����IO���K-4Mj�m�]lx5��$
�6�T7�q��]�&ixg�^����4q?i+�
�I�G�v���P��$wuG��u������-J�~o05�$$�����JUY��
�Z��M��D�-����A��8���n�d$$���V�c��x
���E��V��oW��������n�<i����h�q������<�#'�"7�!��=C����*��!]�@4$4����{9]O1�!���!U,���l��HH�!�K�n�T��hX��Y�!���!
��!���!��$C"!��T�T�~u��}��hX���,�W��dH4$hH;[��
C��N��,	69��
�}�rY#���3$H���b�����W�}��hH��*��!=k
R�r�����/$$����r5pk
� K4$4�A0$4�>�]�	
C"!�����?Z)4$4������!���!�l��-
���d���y5$&�[H���b�[w}������abe�c 9\h�'�(1\g�'�<�(9\f�'l�&����M��(9\d�'��(1\c�'���(9\b�&-�&����O,��D��K?���%��+��B[m��tg���m���R���X�Wg0�c�"���S�{�����,m�=�H`�Q�T���+}8h��X`�Q��{qOc{�����;8�=FY�}���c�$y�=Y������c�Z~�6
J�D���$�Z��V�	�>�
��F+e�qw���k!/��/$�7x�������+��O�w��?�o��1�/���9x>�_L������Q��_�~<[�/�K>yuQ&|�ePn.��n�j���������Z\^�o�����6�%sendstream
endobj
51 0 obj
5610
endobj
57 0 obj
<</Length 58 0 R/Filter /FlateDecode>>
stream
x���Ko�6F���y�Yhi(�7��`�-�(���EW��E�"����k�2)?��Sd���q�����j�D�������u����j����v������S�T���K������p�I��������?��x���Jt�_������N�W�?U]@Q;����L�����u9�q����p���_q����4����3-�~9�]q��v�.goz��������o�O���SR,g_uk�����o�o�7��mu�6H;����V�|?h���5Z��������D������0=�h�q2����)J;k�����U�Ry<�nG`@t����/.D���%g��y#���|���h"�%������yJ��x�����K�3�Od���4J���%���\��Is�KZ���T^w�V����b���t����r<�m��RB��_\�D�/��������}i����3a�g�I�_���������1)loL
���|�v�g��F��L���\d�����L���>Z��#�4G�il��5�P���B��D#Em��y
�y$T'����R���
�9u�^l���9����Nk=��N'Y#e����L�F������I���lF�������\$�Zn�Z�\�Z�g�Lk<�;�m�-;���
���5�dp`-v�ogpX��_�8p�v�oglX-���%��k|����]�Vs���h��Ck�?z5?�������C-��1,l�O3.�������d�ad�-,�`>-��C��"�"}��f3`B�0#�������R0_�����t��^�P�z��r�|J��/h	Lg3���U>��y��q�w�K��t�4(�+����4EiPj>r�vr#Y�Y���;)^\er������E�t�*��3G[4_*#��E -Zt�Z�h��)���IM�(��,��M�(��,Z&,[�"gQ0Y4�;���X��E���YLG=�����#�N��,��[[4����%�����e���BgQ0���YVg2���2�����a���Z��ER������V=�M�h�C�|�������t�e
&$��	��`p(����e��P,������C�t��������C'�u������o�ar
�"����gr`$�D�Xv�x&j�0���(��K��hr�������o�G�j����{���0��w�v�h6
���5��n�YP0�E%9T�	���J{#a;a�Sio$p�Ap�ui�6
�J{#�[s��Rio$l�$t`*������K{#a;��Sio$p�tX*���8/o�*0^3���QA�.H�������
bX���a�q�N���v���`���
���$�����&����ewH�������2�|M�G��|����b&��E������|1;���'���t���t�"R>CC�L,��L9Fv}�U�H�hd��(N�\�e�f�B������1n��H�������d�x'�yI�5���`:r���4#S��&$��	/���9�����{��|���X��s�;�9LG���{��`>r��|�C/��X��s���z��C�s��=:�%�#�s�Eq�2_G0Z:f�lh�g������������c��%�����cX������,-8LG����v���u.���[&�*����#��5����L��X��s����`:r�������s'�u���������H��^t.���@���a��F"�&H�b?%�n���?����'����u�����~��6Z���������C��5����B�vn�S�j`��������3M������3T��endstream
endobj
58 0 obj
1807
endobj
64 0 obj
<</Length 65 0 R/Filter /FlateDecode>>
stream
x���Is���L�*vD��sI�*�X{��%Y�+�)��*'��%J��C�~
h|������V����=��n���2�_�R������?|8Q[����O����r��Zn����������DM��?|�~us�����r{����jj��m/��v{��������jtF����K)d�]�.6��c��x���|��^t�\l�8=�v����,���A�C�.6o�/F��6K�*��^Ja��c���S���l���]l�����`�T�]l�t��������e�Rt���7�Nn�x�y�y�y�)�����z����J�R���7�.��V�a� !�����_��xr�t /~q�y�y��Q�Fka��&�������^����|��*v�r����_�|A�sJ���i����:{�z�������F^$�������?�'2;�M<���#��_m^\�AX��#��GJ���.=���H��w1������v��J����U��'y!���x��tM�z��^�y�7�%�e7{�]f����O^�hoS�x��������Cv^��Q���2�4c:��f+�7��-�I,[?�	���rv���E�����.�mW<}�{�_����w���t��t9����j���E(�c*�ce����R}�5�'�������N�6T���f�t�2f������W��od��������;���Y��*�6�W������U�W��y��4�v�gW���Rj}U5k�F�,k������t@VUo��Z#��3]�
r�����R�2r(�����;�H�I?�
��(�+fj=y�^w���?�U���������gY��$K�������X�dG~�5���]��K25���i8�l*n�~��,+j~�>�*��!������/'��[���Q��������_���b��������X���
F(��;��>�sN����
��a��0{�z�����{���G*;�&���/�Pb�3,��z�A��:-��`LZI��'�����~8�a�Q�N��J���;����Q��V��b��XFJ`�2����7�2
�d���u��i`�2�][���F�[ft�����j�[Vw������1��8&�R�N��X��np�
��j3vVtNG��df|����33^����7���
�v�y��)
3r����{�k������Q{��b
�]����,�?��	����Rv����0��]���� 3L��W�{�n��7j�^w��6�	n��;�������A4�p�����t���j�u��o���+��C�{����0vg�W��vb����"w�Cw���������[?���zZ�*��$A�����E������=F�o�!qJ��9ix/=)C2(1�v��M-��b����� �5�qV�i0,�=������Xi�+aX�;�����I2!�!!!1�W�����r�I�BK��/��
R�"�~����R�?���q����
��]��Sw���N�����%�@�Sw���OF��Sw�^�&��+������]i�t�
L<uW���b8�E��rd���3b� �YJ	������!cY9w�R6d(+O�w$���p;{��t2C5�&���Zr�$=y�����&���g?,���V����%g��NRh���L:9���OJ����<���Y�F�C��N#&���'�:���J �C���P�78a~�LU�	e|!nzNR�������V8 �k��~�zO^8 ����~��@�����Njh/{t(>�����oZ������?_v��Q�1�d�NH������"7b4�/����C
G�&����';@�������hL<��?/{�&���Nv��';��;`������x�C{|�$<����@�d�z<k��M*�;0s
�i-�����W��������J�d�(��-���\�e��\b�!]����r��PX?�t=����\ �I.Lxe�n�&�����	'�4��\ �I.���Z� L8����Q.�x�K{|�$<��9����'�����pP��9<��X+F�[�n8b���~]V,5D-E�[�B���w�OfAd'�0��3�t�J=����d�V�KNZA�'����.$�����I'����s<�)�������R �d��7e�	����}[����s�Vk1t�#3
������;/������</���'�S �I*L��D	&�����	'�����U�������L8��=<�Ofi�f��'�4��Z0���z�)����E/����V�8Ja
d�D��]����_�E�����n��.2�R����g��'�0��=P<��=��Nr����;Or��'���{���I.���\@�$��x/Lx�K=�����I.��^.�g)�ON|����P�Yx3�=u�G&~�!���-rI�����\��0L8����I.��(�����a�\0�Q.��$H<�e!~��I�&<����(L|���A.�p��B8?���ri/�$���K~�P� ��81:����F�
�G&~�1,V��E.)1,V�r�>o�	Ora�'�A�$�z��mA.�p�Ks8���R������\���\0�$��� Hx�Ks8�Or���r���a�2�t����9����qJ����.���M���,��v:������E�GM�;#tgf��F$g����HXG��VH�V���Q�`ieE_9_it������(�u�U��^0v���wB�O����A���R����
��`��������0`��_��0X�X��Ix0kd��uh��T��\~�p�+H���d{������VK�_�F�G
��$�FI����a5D�+�-�1^��$/$��zU���R4zF^d��h�N^�`�X�EX�b#�����"+n>Vb=�V���%n>����BB��-�M��Q�7ke9�I�K�[W������BH�Ou��~����q����Z�c�3���~QkX��@�;�#Nk������=w����,n���"���=<I�,s��~y�XyAXh���{�m����|U��>7�w9��|Ug��|B�|U'YZO@�����Y��Dr��p���W�+L7L��#�GzD?����� �W%a�����wX_�`����vs�����S��v��-�
}U�5����@$�W$����|�a!_��p�.@$�W%�y�n�5�X��ZY���I��� ����,�WD��UI2k4��6�W�E����g��{�/�#��VX��r�F[���JTG-� [�\���i���
����QR�����A(��ev��3.!;UA�z��$9����u�MrSdQQMRS
7��Ifj�����N���J�m"��"���<�ZrV�����z-�F�eic��V��� b.�$<���sY%�y_H���$y�!����~���T�6c"!3�#!5AX��Y�-IA$��:I�t �S�e�����$�������X����%*�`��H�jmPT�6U�r?1�in�E�2�o��w]Q��l���������LN������T���6LU�(�����M�w$��Y�n�!IJcH��&~W)m%KP����HR�mAX���,{?�"!��<+QiRZ�Y�vX��$�5�~�	)m]��I@Jk$av��Z
BZ�P���0������T��������D<,Xf;N�Y��!IJ+Hf��
���a!�1W��g��!!�5��K�1,Iiu�����4	)�����Eev�BJ[��!IJk$���E�aXHiu���`��I@��u�}��ji� �1����������p�-��B+JWmi;o�@�n(�y�|�I:c@�����B6��,uE&�A@Hfu��mb!(�e���[	�!�5��?�UA!�5�����$�F��l�<VGY\�<6��i ����15tF�1��F-,����@c��D*b��(]�_�\z�j,P��b����Vy�iX�5B�+Pf�o��$z�~N�,��'�aP�c�(��$z�����:@(�c��I��<��1�P��Q�e�a�����X7t�@�%V�V��[��x�_�H�C����,�D����,�e��e �d2��a��T�a!��Y��mmr��\V'Y\deaI2kdam�!!��<+Qg��J��3I��:����������=3+dgo�dk$�-9��QH���r��R�>B�9R� B+�������H B+X���b���$	� �K�"4	��r�4-#4		����1��$�5��<��0$$�����:�����
�AH������0,$��+4�kZ$�5.��)�2�/��������)���w���������rJn�u�]��{��pbz�qu�?�_ozendstream
endobj
65 0 obj
4374
endobj
69 0 obj
<</Length 70 0 R/Filter /FlateDecode>>
stream
x������u�g����(/)��Y^���@o]���L�IQ�;+��&LJ�"�~
�9x��A����YZhxgN�����j�����I����?�?�S����c���J������_��q�<+�IL��N�:H���8�&/OZ���dX�4�?hoS����"���-U�U"����������9��Y
oWB����>�t��b�D'f�2'�+��xw�B���
�WY$�j9��gN`�R��8��H����M������`L���wG��8�fV2�ze�H�>T	ED
���LpZEpc��B+"2V
i������te��"�sW��rlV��w���T��wP�4�SQ|V�`c����*X�_t�e�*����A)�BU����f)y�R	�����I��e!�2%K�U�����)}��A�7M�.����������u�ThX�O.ru�#s������&xs��PDd��d���^��bR%�G�\��P�����u�Tr2��T�"�S/��P+��V"�V2
s��e,�e4V��Z(��]
����I6t�W�J�
���*gp_e�vg�K��an�Ii�a2Z[�}�XJ�IQD(��6���
}�C�@*��Se=��TD���G��r&���g�j�����Y�r&c}���o@�����M�����%��*Kz<�q.n*��>9}� ���wFI\�m�y�t1�U]���q\����":����hpC1y���D�����r��}��~�������Am[��C��r$
���B?$)�c��	�@nx4�Y�����Iu�P�98������y��C!P���t/��(2�v9r������]���h��
KZQ
6p?�
}�����$]TD���
�*"#�����������X����X�B��j,����*��e5:E�0�����A�N}��%��;�Z�.Z5rD�X���(��(�H#�h]���=���A�@YYt�S:SiMD�4L�hk���v|�+�����M�X���*&�=v�@Xdl��rl�km�����EQ�^H��;��(��#��Py6����YMtB�J�R��%
w�8���0��$w���j���SR4�������K"RA�&�t�,+g���V�\7ag�G�.7��5B�����,;Hk�� ��s���N�^i�k��:�w�,�
�T�����DFZ��r�F���i�&���Ud�]���2�n��imb3b���5�3�]@M��L�9�j;xPWTt��>.T� z��d���e��:�j����Q�=k���Y���*h96+d��2��R�s�,��H,h��T��wP��������b�������jhE�B��J�0�2��R���4�~n
T,����8l_�&��K*!:v�qy�4F�����q�1�
E�M���Ali������(z����Z��� *��-���uI�!#��K�oV9����c�n��JD0BhDC\6�����0���<Q����k���6I��
!��J�ll��=a2����2�� �H�[Dd� �.�$2�:asdi��D�A_w��EP�����k$�os�:�U������D�N��w����fBA��Vk" {ZA�"=4�02��AP���W���c=t�����~]tz��_��0��tg��A_�T��q����c@�*�S��"�����<��TM�*�kf��2bM����HT�m*��\����|���c1�bS���&L�Hu���rl.w��6������*�)s*r��+���3/P����*�%#�T.�����O�u�?����0T>���2�E@�����C��
��k���kY�B�oZ����������,����

�Qt�d��*������+�F����R��b�\Z�r���rF������*���0jG?7��z�����X����caXuy&?	H���-r1j�
�vn+��hE9d$��eC��B���Z�k
�{�m�
�J�A��Fs~���$'22�u�e�����B]�UG��@yp�����G�o'%��L��(��������:r�y �A�J����g��Lv�4�d�����Y&Dl�	y]h�t���o�P�i���w�_�2�������7�/_��t�7����'�N���Jsz������/�$�uR������_�I�Y����������������+1'��?�����R���}���>���s��"~~�^��������xu��������.��H����k��<������E\)!��?�LW�����8�J7t�����2�9YJia]��?�1���U�Gd��q\�NL��VJS����u������i�����oPZ���ms�|G�O2��.&������� *�������o���p���{��
��s���������f����T&�-r�3���u��OM������K�����Mj�_4	#��_�����@����&��(�XA���{|�o �=�-b�s��?]�w�����y�}�I��Z����W�'����m�rh.�ms�&���������UJ��_H3Yc=��d�:��>`?5�c�c"�����m�������+.�~}$>k~�w����){]��>��r�����������nh�U�?�XU�>��_�b[�/���U#�������Z��|�*��{�^�x:�n�����$���_9z%N��>�6��I���_N?���_������|����3�a��?���Z�N)k�g��]�x�l�N�����P��������?>;��gX�3�,�}T�z�����������3���a��L`FN;��q�7������������(x�����>?�Jog01i�[[O����&0g�j��Wz���j��������Ux��r��>�����9�}�����$��n,�m[y�n,�Uj��U�di�5_@�3wu�u��`L�6g�1���:��yQ���Y6�BUxW��,\�>4��r2���)O
e�|��9Uy�����I.��"��#.x����+B[�T�T�eq.-4���K_D����K[�%w.g���/�sP��m�r�����e���V�&��d�jMF��&&c4�5��N�9���5m�&��'-��k*T���p9����7�VB�������U��"�u�|�3�p=9�3+���*(�b��o�7���#��*����d�'�%�I�i����9������U��^��4��y�����jRF�Q=oP����2.:����vv�6�G���7�O��������)$���%��<'��]��DI7���r����?��LA�'�v����-yR7�a@�b&���F�]s8@�����Wv_xP�X�����]}@��+:��u� ���i�r�f
���P,CfRM�l���n����!3���i�|�5��������>O$�.H���l����@�]*oB�����<�Z
��V(�H���[����zn������K'�sH��]�S6cAAk�P�l��h
:�r��cB?�>y�m
O������������uk�$�L�Q���7lc Ak�$������q��5�D���8@����}�P�v��Nz `
;A��|(`
;Q`�?�d�50��5�(O6:w��Vr(�XS��m
O���9�z��:
�����c@�g�f*��ZI���+��P������ h
�@���(�.D�y@�	�a_���99P�H�QG��d)���p���(h
����5k�[Vj�N���{B������m���/���o�_]��
.H�M]�%��M��?��1u[�#�����:�����P�/�jv��Voq`�Mpo���[�6,�b���[�V��y�`�T�o��X�r��lI|3,�b�jw;��m1`%;�.����b�������/L4�Z���c�a��r���5$��k��5G�"����*L �A:��M�����,�="0����������!8v��`���, �1"@�M��Y9F�����Xv>g�NQr�D�j
,s�Z�I{������-L6�}��'��H!��������'� AI>��%<(�%��W�rx@�Mv����P���(��tdX���8P�S�T�R\�|v6��bBA_�Po���,q�|X�a���U��,/��d�L�t��j������L��,���z���$�Y.H���Y�����(�W����-PnO���8�2�������/O��FnGAg���zx�� �YH�Qu���r;
:K�rw�4�
*L�e�_�0�m,?.gY^;���p�>���lXV\I8�e���^#	:�%�}&pe8��p���� �����([��QA �,;A��C(�,;Q� #t�}:��"q����(wW�Y��r��c4Ly9)���X^z���
�������zP�p��
XF�'<����8����N=C��e
����������R��D.����8@�YH�Q��J������8c)�B�lqt����<I�b�������,�11��������C�������|!�*���]�*�*�mX��<P�J�j�"1&,X%���q@�*������X�J��*���UbP��D,X%���3r@�*������X�J�v�d<*n�Y���'\�v��]�2��v��
;wU�����n:���o������<T�o:*RU��0.4��8<T�8�S���p��J�u��=�W������`�DY���_���������uaJ�y�3KE�Ih��6�Wp! �:����\������������P����j�s^�eA��+e8���f���]yR�&�@)~��l]�� �����s�YP�H��5�< �v��3�YP�H��n��cC������n?X^w��,~�^;;^IX��c!��F>��u3P?���+vYaBA?��G�s����'�3�?�P��vU�t��} �Tu�eu����@���(�$�����;3�,3�q��
v�,��B@�9�����
��W�9����Vo��B���<���D�A�Q�6�k�����A`AA��gu.x�N��e�$8yz�>��e�=I`�����Z��L������	9�z�n�~���J��b�g��hu\���x@G�hc�=�=`J���] �,(�H���x�@��������(�{�����������'���!��a�^��OR/M/wK[����;Z�w���g@���������?��q�0�'��eQ��c���+�S_R]�E��Ssp%����y��T[� ,\�4�!���P��(��g����Q�V�/��`uC�g�p��(��J���X�?�j�of�u�kym$�e�K��gT������:����{�����:����B�!�d��*4��b���S��*}�!oz�!�v��q����`��$�%���("�c��	��:���&���v�Y�d�RN.,���?����+L�${���������&�a@�c��l���BRl�"W�P�AA!Q�o�'��5d��&<(�<H���VL ��@�=xP�>H���
H��] �<(h ��}V��&/Y����4;�����EZ�8�	�.u�}�v
�� �1Jg^�[c���{t [�{�y0��y\���E�L�$��n	ZI��f 8�.��q0��q�$�g��p�o���`"��Hoj�|�Z�����7��c���ob}kOfv��>>��;����|�����h�����_�^��
!����ws�\������7M����_�����'�W���5�endstream
endobj
70 0 obj
5808
endobj
76 0 obj
<</Length 77 0 R/Filter /FlateDecode>>
stream
x�������u,��zw�^��C�����k�$�G���8I���$@Q�pz��_����Hjv��6���V���C�r^��8����{���������� �����?�9s�;��yo�������������c� �T����XLz���zgb1�����sN���	�C���%��b��B1'��;/d'� ����Z��p����8�T���D)Q/��� �-r���:�Y�d(&c�pLw��i��p��������O�R��^�"����j-6<%}hg����w
-���E`�B��#�+�e��cQ+\B���W��A4��.�&�T_��;���;=�U4���a��%�����.j���MX����#k�5��o�N��(�sg�A8������?|����q������Ez���������;/e|���}����i&C?�L�����7���O9|�Fw��r��������'��O��E�~,��:�^>�L9��=��x��t�91w�O��������h������E��4w�Z*���~QIa��*�y=�����_-���(��2y���1=�0�D�$�����J�K����")��=E���Z#�e)�
�|�'����DR|�w�UHY�P 5��Qh5�����N�@����h:��")y���fH�:HAL�k��������d����3�T�D��t�j��T����/��iB{�����J��M��	�Dn*��z���rB3e9md�'�P�l�mD��4����fzr�Oh�����4�
�LYP���ID��v����)+�d�Z��5�l! y0�2n���Zp��,<����|�r�E�l����{V>���d�\����r�������7��@"%�����'����9����3��p���E������0��W��/�c�]`�r��������`�/�����R����
e�8�>��>��PW)����2��t���`z��9��Yz����G1$�:�]j�,U->���"�A� �������3>����{oz�@��xs*����'��
���2�m5���^��XX_��|:��j������{A�(�;;�����c�gJ
�����o�{�fV�w�/�q�B�A���R�p��&0�l �|��oga>���M�D#�����|����r�)t��O���)E�P����o���Cn��f����H��.��:L���tee�&����@��=��}��S�#����{��������]k~�fmv<�������Z�<�S�:m\���4���`92r���P�w�u�O{�f���rj��NF��Oz��[^����6�1�FO���V�0�f+);J��(�<��g�}�Zcx���)|hN-�~�~���"@��(�pJ����X<Y��t�5�nlsn��r!$aL�k�����v�O.��!.q�%U��
\W.�9���V�:��p��
�oI��Q-���P0&%����h���o)�������m��-�����x�p�����a����M����w�[�_N����Bw���]�(9Ne�?��Xl-�����l���|� E�i�4Sf�����d[�Z6h1�n�~��1Zmgxu�A3;t�%�A����������g8gj������cG���S��;v� �hJ�;v$L<+��4�c�y���<���=w�������c&^5X���M�~����H��Eu�%�i�=���x'���;E���x^1���������e�D
��\�<,�h�g�y������V�.:�����Db-��Vy�k�4��X]�+�Y�0$J�s}�K�b1��W��y�H������Z�[�X0,��J_�l�����D�T��\�����($�-S�G��fNZ��.�B��w�:k	i���.������j��&$		�$yX.f��&$$KR�r���jB��d!uZ�z��0,YHY�� $$��$��7	�����U��0$YH%���DB����:,SV��!�B��p����,�bkJY�8h!]����[S3����h���"!
����� $��
��r��H 	�BBj�\u&�$$		� Y�rw�e!aXHH�X�� $��V�\���a!!ml�($		�2�W� �eR�r�A5!aHHH�VY���
I+��Fl��p��9���_�!n����%�1b��l�BH7���$$�(���-����0,$�6�bO�"$		� yX6Dg�0	�BB����!���$	�BB����!!!U���T�,$�(���-����0$$�v��V�QH�hf!$%�1���~26d}4q�tD� �����h#�(����g3�0(��6J��*�B**@�]c��B&���L�E�$��B��=!
U���CB��*P.�c�A�B�P�Mu����q8�,dU$��H �^w^����$�W��Nf}+��j
YS��n9lv�%UN?k�K�����N��"L��|�	}�	�s�:��S �(��.y,g�u�e7`����N�8�c����N�<�C�Q�Y;\��1�r�Y;X������#���`������Y&�z�U�"0�wc���!R�v�,WgR0$�����N��) ���(YV��E	�H����*O��������Y�>0$���J��b���,�+2 ���(I.�$�+2���}}�fu/������+�N���
t���NJ�=b�+��Cy��Y=�����?���e�1b�l���j���f�Z�h=��B6+X�6[�J���l�&�f]d�aX�f�X�� $��6�d�aX�fY��0$d���m�z���6�:,��VI>�����$�;g>�gFf!������_����rTm�V���-����	e;��g��g���|�a!�,W��X]�aH�gm��(�����$�AHF��$Y���g���V�>����
�����}a}�f9��"	��M���U����i�f��lP���z�]�5d�q�v�f��X�r[f�+�Qf��e�Z�[Z5��B*+P�*��~c@�dm��=>�d�6�$2���� �c��F��1Y�=vz��"�!PF�(m�]�$!��A�tl&�$�NJ��=����8���G��l1��n9l�a��r���sz�Qb� �>�L�������(�/��n�!�5A�A�X�aPHb�P��  ����d�aPHb�P��0 $�����+/;�2Jl��vX�B�$��]Q��B�AA��c��~>��x�_��gI1F��n�7s#@��D���
�M�����F�t�h��x9��/�-'���e@�QF]'^��1�rJ]/Z��1�rN]/^��1�rR]/Z��!�(��/M��h9��-����@�X%?0g����G���ac�������y��Z�H�Ws��,/-p����D��}�^�Q��A`J1�h��i�U�T*N� ,+��r��N\�dm%��u��R�V��J����������,���%�M�L�~��B������� �D"]�]4�R��� ���i��J��r5l6\=+�k�a�pc�{�!�pE��4�R%�p��p�J���@`d�J{�5\-y�E�+��M�d8n%����A�F�m���2F�C�E�a��p��K��l8�h�6K����0$d���m���&���^�
6@��0�)�}���'G����a�pc�{����Yi��m�1X��
�����	F����Y�����"�X�&F2��l��h8�h�-X�+��p02,�E�+�n�C�~k�h�6��<	� i�w��h8�5 7T�23������ d�r"ln���,Z�>��`��+�V�~;��^���g2@Xd�k��X���K��`����U�18��l��
��"�X�W�l8�h��e�	\2��W��
��m�h8>�� �g�����'� d�r"ln���,Zd���Sc4k4\�����w
���*�u�p��pa��
�����d�F��V�W�h8V6�����`��
������2�`�i������H�_��u������
�3=���=�(�`B��������O�2?*��������Nr�"	��a�o�R� �S�r���k�3�v�E�a�Q`/ZR(^N���E�i��hI<�x���}���{��&@�r`/^��A�r`/Z��1�(
�/���h9
�-L��)&�Fl�y���n����z!��;7�^cz��!v���,�	T���X��n�����K
��/q6���.�v6�r73����.��7��9� ��v�V�n��.���9� ��vA�h/V^��X�s=��0,�v)Y�����h=I��QI���,��;/�:����?���A%��$����������<=�f��s�q���\����z�
=x�w�L ���m�-�C��X��j�Q����w����t���_������l�endstream
endobj
77 0 obj
4412
endobj
83 0 obj
<</Length 84 0 R/Filter /FlateDecode>>
stream
x����o�0�s��v����c���o;W����&�8M�$DA�}��Nccg]���������=;�9�CSD��}_mK��������� �0��������6&�a���Q�1��_����W;���w%���������k����)��i��D��5�l�ouq�"���I^_V���M]|���v������u�^���R����+�5������`M]���o>�o7�Eya���X�U2}@�&*�*�Ae��d��1��E��F������N���
��0��ma���1"�VI<.�R���ap�q����8�2�?�Q
����P��(f������I����'X�4����	����q�Xy�Y�'��?.��/8�U������hz�������KN�����B��*L�q�y���b1�oX�*#$�|`\s�qUC��y�;K*���S�#��)wPg������>&�sn��=�;KH��K7����t�v��&t�!��{�<�]h>���Y���:�B:�D8I����� ��`�����������7�[��
R��DaNE�&V?�1{��z���zC�e�K��{�2���{�:�������q�Y�������)3��]�9��i��>*�wo��=����{a�z�&�R��U�{�	�{��Uw�r/0�wo�o��M?�Bz�v������'_0��#�s/0�w/a�^`>���o�������{C�yx����/0��o�����4a_�PL$��f%%��y�Vi��2A_6����O��9l����R�������Km%�[�E������y�*��[k_`>o�<�,�w���������}a�z�t�������*e_`>o��o��q3�����@�}������ -�����qt�����R��[[I�!h�]+����dm�%��1�4H(C���N�����u�G\mm�����(�}���endstream
endobj
84 0 obj
945
endobj
90 0 obj
<</Length 91 0 R/Filter /FlateDecode>>
stream
x��������uV2������]:���|�1c�q��fm����&���Iz���!�����&����{�|�����3�������b�����m����\'����?��x��_�bx����������~'������nb���q����|���o�_r&:���o�_r��4V�7��ciL��\]>��1k�~������~�������^���N�7o�/:����$���]�z����wn��_b������7w��9O��;q#��o�t��������
�������������'����=����7�����9����o�����Lj�}�S�t�?N����o�24#��������F�4R2m�i��r�q��6��q^�>bsm'�~�rQ~���C�e�z�B�]�V_'O_��:<|�Eg�����N�7/�W��������M|������K��=b����+9�Rq3>m�%���]L7&it��g������~������'i#����Wb���I�w�7�;>O�f�����2��O�rB{�*�+��.O^����u��?$��}%���a�)e�uj�w7��w�'�m�������2n����7������W����K����0����g�������G\���5��9��fF�.���d�x�<NG[����k��|g��0���Y�x�o�;Y�'�v�$y��aW�,��*�m���bn'�%�6����d�}�l�?_r������j�Cv�g��1����>>� ��GJm'��m�l��(������Jq���/������M�Z�t+��G�~`�{O��W�Q(��O����������g�>�I�p��'��qg�����d�Iv������;�qGv��[���[�/���@M{��d�~���l���l���~�gJ��p��D�c��~�q�������,2<���h&��x��4���u�k��q�'�����ON+^�z>a���k���;i��9��}G�TvU�	��H���v�9���T0&)
s�m���8�`Y.q�%mx�%\w'\��0���=����XJ)��R�<��R��P9�crr��z�ci��cK�n����pi
[Z��c����ll�u`^-�z�S!L�[u�7B�2����dXA��f|&F�X�V3kd��K"���!�:�U�v&��gRf����������8������Ls�(��(����)�6���3������`@L���W�^���s�x���o�EIg�p%yu���IZ�x��m��%;a�;ul�W������czP
&��c�}��(����;l�_���Ss�n���_���gJ�&N��Nz��J~������~��[�@�'��:Fr�������S�A-���$�����8��	�2e	��(a���I?����g_d�Q��s�:I�����_�d����e�1�Ca���r��4�����X�m
E��&0U
�e�
I%�R
i�XC�c�k���K��������3
=(k���)�V��d��N��y0�4Mi�����4�=>�>NS����R���4�_��`�i�R���0	?NS��~�&��)��Q�W��4�9~�CO�� |�C~1S�~�F1oAe,!���R����AU���e�B��"V�~1s����&9T�����L:���^T${TC���a��I'3�n�A�lC��/@�G/T[^�&�����dJ��Z�)�lP�*�~p�����2+1��~���EK�~A
��)]����0��+�D+��3�i�(<j!0��F/���f�~� L8�����@��'��j�G5������'7��������c����?q�=P8�����?��uLyLMI;���_����?DnLM)o�5����R?{�~�������(����	�P���G?`��Y��?`������������0������������?�pLM)�7S'~0R2�15%m|�L|��g�]�&�E���m�W��2��.�%6S���}?����^(R�*�W��`�I/��Q/�x�K{|�$|�Ks8�Oz��OW#�`�I/�m'�@�G�4�}�&����B���T{�����N���(�Y�Ni��_qE��b�����jPm+kv�\b:�����=v9�Mji��j���Y���Y��X��{s�(H:y�������H6i����D�h�zzi&�����$���)i5�N�A�zv�*��1�@3)���#sJi�54c���"���X��l�O?v��	'����Wk�V0������@�G���+3L<��=>�Nj��O���Z ��[��>�Nrit$���X��>�%��f,Y�����:��MY�aB��L/�EZ�l��%kw�^�x��%�?�9�	z��^������I/���^0��^�;>�Oz����M�z���I/��Q/��Q/����Nzio���13yi�z�p���>��z�B1��|s��������������E/����g[��IvPx��L����Q/������	'����^0�Q/3����`��^��I/�x��	�A/������^(r9�=~�K�\����f�|��U�uSS�c�w�^J������X���P<�8�����^0���z���^0������H�����'�`�I/�����^0����p�$~�K{|�&�����I/)���z����X~6��_�w�;L*#Xg��^���X�	7�%��T���V��E�+�XV2i���,�yn���G�U�m�u�����yd-K*�\����pFr�(��UB.+�R��xW_G�*3����u�t�����U�L�����U�����2�"�u;�`�k�a�ZY�+��QA���-���t�U�5YZ�F�+�q���p�&	l]�*�R,{�|�f���[����~�>��o)��L>23����(��-�ZI����Q+u���j����� W����EKsl��H�$�^l�d���/bYY�����{�����^�~8+HJp���
�eF���*�x#��E=�TF�(�{eI���I*{�k�,����2����T���a!Ie,����t�^��d�T#�yYR�T�e�4b���$�HR�M
�e�T#Ki���$�HR)+����<i��>2D$kk�G{����d'-�@ ��g�S�%UZ7G���cN�JR#���9�����izI�H��r��G4f�t��@,QR9���O���$�JR��b��je)��@$QR�$���`XHRKYzI�H��ZI�_0��\�'O+�uD�v��^9v�t�������/�H��f"��*��\�yT�U?�(-����R��b�������R��Z���c�d�SHy�8���%�	BvZBr���r��<��o�Mr�2��&B���QJ7|#���r?4��d���9���)m���n�A���p��3{���U�P��4������S��?A,$�:�l�3�	B2�i	�	�BzZ���!!?5��X`XFA-d	�����I��(F��iY���$��S�d�����Tg�t��Rf�,H*[[�����k��/�_�_������Vs�;�����+-���Y��V
W��@W�XJ�+Jw����I����R������AHF�5�TJ��Z�%;�(N�0$$��^��!,����A��0$$�:�������b!� �H�Y\��}J������n�*�c�����J��uk�QjY��R{T��`��K�(S����V'��&����Vg�,9*H
B2JmY���0,$��^)J
CBRk����A�(�F��� 5	I���r��X0��e�2H-����u�����R��3�&�vu��qZi�al��JdTZ����iS{�P�?]1
BB���}���A!�5�+��Qgu�I
��3
���O���`3���O�+!(���
��eRY#H��-Ss\�'��P�1�tC}��"�*f��(:�t�S����J+�q��c��Ld�}1���Ad#����	_���L^�{���@�c�d���O��P���=rg����0 ��e}5B�[��k=�R.4bP�c�(�/��D�-��Ac#
����R����X��"4f�eJb���i�diy�o����Ej���7�>#+]��)2�,�3��l�H�eu���d�a!��Y�1$���$���u�a!�5�J��"!�5�JEh�Qh�,�+g 2���CJ#H�q!���H���(���Irz^�J3�cTc�1��� ����:L�1o��E��Re�Y�;di�#���V'�����a!��Yf?��!�V'y0s�&�aXHiY��0$��F����1,��Y�J�����JT�@��{ePZ$��gH���/������>���!���Vev�3|�ax,���m�
4������">�q���Pq�endstream
endobj
91 0 obj
4378
endobj
95 0 obj
<</Length 96 0 R/Filter /FlateDecode>>
stream
x������u�g��b��[Z�����z��,EfrM���X^�7�]���S�A�}0=��X�*{x{������j�����I����xw��������o'�8�Y������(�yV2�S���lN�/�����d�P��'#�dO2T������!F��OK�d����o�JKAE��(7�}	a��|x�����F�����-��������T�@�R����7��]>� '�����I��T��cc&��R��b�"#�611�V1���Y5����T�&�k ������kP���Ta���J������5��	��P�=���)g]�`�Tv){e��K�n�n����^h���d��Z�F����B�J#�vl�52�<p��3xu����J�����~������c���
�Bn�rCp^���X���Q���� ��T�������B��B,��f�hH���7��L8+�T�Y���Ri��0��,�Nh�f���>�Z6��,�"�Xh��}�e�KU���K!�����Xo�42Ul���?R	���2��\H�PX:���*E
�rn��ERYd��k5c�`d�<j���I�����(:6�:�IM�0�t@HWFa%]O
*"�J�u^���E��kx
�Fh ��6�:��K_�Q����
���`gh��L z��,�*T8u�/�\��������D�\/����X�c�z�e:eg���R�b�1�����k.����o7�J����Y-C�f�6���Vm��N�Fw��T��jX�(���\�1��>h��T���.G��(j����"�g�ln�J;Tl�)��>O��v[9{���Y ���V|�����f�I�2vyxc���NZ�S����n��V�Xl�������B�^p:�X�*�Y+T�V�#�Y�2VC�p��O��,����81/�Z��RDo��T����&b�Uc���/����K����z����w��eW��E�9����1�YT=��"��e]Wb%:i��u&
�Tu^�+uuRRz��9>c<�������Z�n�v|��GiA���h�Z����m���X�L������*��F�#��
�����R���f$U����Q"6	��t��P�	��2J�j!C�����h�U�ZEh����`�:Uk���X'�%����
�����D�v�IL���M"2�iG,vq����N^�Y�&��W�P9%1��0�Yz���h�S�E$-9��~�����9�F���JM������'��y����5q\����Q)��)���N����}�������ow�A�*:���m�f�{�cJ�4BG���J���%�e ��+�)������0i����5N\�%����2�AXq#��R�*%�K*9h�1�r���������@�B�f��*�fA�[*����k�Ny�B�E��8Up/Kt�
����u��NM��f�p/T:#��+��T���g2�0��=(����9/�����v$����\�t����v��1WU����[bS�_�9�jI5K��L-�S��;���t�uSc��I4�'�"�l�Q�Q���M�"0��VI�z �&=P�r�SWc���FB�`���5JD�:�UYp��uS�c�3:��w�v��\g�Q#�V@�TA���5��x~����4+��Rs�X�Ao�B�R��n��P���T��U��]�Xf�<�r'��
xKiW�LG[��y�cb��/5m��8��0�Tl����!v�/�\~-��a:��v$����\�d�:��X��������uK\MuG/�\,�t����b���f�������k
b;��e�8���Z##q"`�8���ca���0&h�rfZ�$��J�
�{^�Y4,���r9��0X2�uD��@x��C���.���2�]��45�\�H�bAK�5�����j��vRF=�R�����$�^�08[�|�V��^i��n��(�zRH'���iz��/_/���n�������2e�
��uo1�+6���P%}��8I��TZ��8��n��*���,}�U��:Qx���(k�r��f;�9�d��u����e�L&5�X��I	�!��E��n������X��ItD�&�P��:���Id����\wD��Tq�$)�NUi��+��\��"��������G�?���������O_��t��i�)^<yu�VNV���w�?��y.&���*��<�?�DLR�B�w���>�����;����l�0�����b�J	w���'���v��F���E�[���}v�������~�U���V���`N�__��Ow�<.��^�_�v�b��)A7K{w�*�������E�����-���qa��������������*��Hww�p���U�����*���
�+��i3d�d�sw�_f������?��X�	�?��3~_{mCs���+�7�u��~_�`f(`��MuR'���qw���?W��8��X��w��rS}���R]}M~��*��~����bG���/�f����i[�$������;}�wM
.e�����%�p�-��������3X�C��_V?ZE~�|�n�X��"��,�O�����D2Ke��`�U�c�m"�����������k,K�%T_���Y�����Z~�����\�����������h�U�?~����%�)���_4���JR��jyJ?��;t���Z��
�E�S���y�v���$y|�������+q�#$��'?�f��TlD����oN#�G����c&eO&�)��d����:k�J8���3��:HPv�p��E:i�:g4�@i���	?�������vF��Ti���[��~��Ii��N�����3,X����4�������y���)m]�zr�/0��0F���*���������X��QfT[�q�7,������gX���r7���JC�,���1�,�[�s�t�����7�c�UeA/S���`��%K���y����3]����4���5l���m����z=�z����<8�W)������m�������>�Q��9��<������w������SZ�{�\NN�l�Q>���k� ������5��aR��/ns�x�����p.+�k�N=�d:F���t{L�����s������N�`��^Y�3�������xM������kRl����
+rP
	s�>��=,HFM�s �������rz|�u8����[N]����u������� �v����p(��~=�D�X0 y���eg�������W����(�z����b�A�q&��n�`��M�yj��������RP��Yz���;��pp7�[E���U�L� �o��J���X8���"�S"�~.$�u�H����pp�s_#=[��a )}����6��5��(��0�Xi�y����7OB���I5�����G�~�,c`
�G���:���<��
�]�X@�V�r�H
Z��m>�
�8@�'�I���M�D��
< �
$H�(�^R(���6�bA_+����C��������|����O�%�3<�t��i�
��V�>��Hr��Tv�5��	&pd�P
8�r�x@�A��k(��(��8
�:Q�O'�ce���q��3����P�Z�'�8����4���~�&p;��mgn�1��q�;�Hx�Z�����@��@��(����=!�R���h��9P�H���l����0T9���(��P`Kt�1��&Jq��q����3��YTi;OA�0�����u������7������r}�P����&��wK���5���,(X��j�p��b�Zvr�Z��j�%�`Yw���eU���Z�� �zs��h�kt��*;�r-����6��Z�� �dg��/���%=?17�^���f?�r��Z���3����L�C��}�Y@�D���a�h��;"0v,�b�������``�h����R�E���2�
�����R�D��k,D5��u���������r-�6��|�����	��|�����64�1���;
Z��~v/��)f2R���@A;!Q.N�3���� ;Vm��K!Q.�Yx@�T�@���(�V��;?K���,}�&goX��b,���0�,��&�Oj�';�H��'+�=�����A�X�@�# nGAc!Q.�RX@��� ��Au@��(h,$Jol~9��v4�����R�e�L��8nAc�N���K�����
*L�p�����?1c�^6���X�����_H��8�e�;����p���� ��r
��~v������� ;�2aBcD��o8@�Xh��[J���eL'x2� H9����Eyt����Y9I�17dR����e{�f�c(������c���5���9����B��Z���P(h,$J�uGL1�ci@v;\9����,�����X h,C �9[(�X�t�Gzq��������P��4(w�h'c1^O�e+Ie�TV�}�fe�����W���!?������u!��)
�-�����S�����K��47(X �U����b�c�	uP�@��������L>(X v;�r .c���9�`�T9�������G�nr�e�L�x����l��B@�A�������U�0
j5,W���}�d���

�=�P��������
�=H����L ��A�`A)}���W��{���������[%�,@�y�3�����j��Wb! ��SC��k<�a@�4�����������4 �v�[�!�,(hC(p�.H��1<��-�D�<���0R��@)0T=p1Z�X����(�H�K/=�9�
f��W	1i1[���)�X3�63�/[����jY�lT�Q� h$���%&��!<D��X�H9��-`��AY�g@)0���s��� ;���hP�t�d>��Lm���6�a�^��������k�$4���Z��;��xf=X
��A������d`A�ADI�0����0��,�,����
Z�riS,i���v�q�789l������6�c_D$���,�6������������d`)0���0��,�-`-��X�J���1��b
��k���a�Q?I���Y����������Axpp�2��7p�z<4���������aS�=r�������y���i�5���c�������L;�.��C��A�Sm���}x�`!Cf+����LT�K,\x&��u�-��	���	���
21�������,�Wj��P�s�k{d�e���]��i�+��4��B�?������c�!��`�g�D�qp�N��c���$�?�p`�g������x�H���p`�g�,����w�k�����if�QJ9��5������0w,�eM���-I����z�����@�8H����q���s�(���)�1T9h<(�ce����c�L�<XP�{��$��A��`A)����]��#�0�������YO�s���^t���x���]���������eoe'�`�� 9v,�e"�*�d<hc�9�H�9�H�q0q�q�;|��}c�D�m0q�m��Hv
t���^�O�=��������	��1������qe����]�|�}w������[����j����^��
��Ebw��K~�_t�R��������>�{{y�_.:Jendstream
endobj
96 0 obj
5738
endobj
102 0 obj
<</Length 103 0 R/Filter /FlateDecode>>
stream
x�����#���L9Y$a�c/�*U���R�f���U�LR�E����-�%���#��
l_���Z��R��q&v<������V����~���VH�s������~���	�'���{��
8w>F����������?���)�N���I�b�t,T~Q�:����N���9����[,�-s�O:sJ�x�N�Oq���hW}���T�*����M��^S-e����qV8>QFp�7������`r����'nc�'n�|����t>�.��;9�\t��������eak�J]�L�E`�B��C?(j�H��*��g8kl:Z���
[.l]x2*v��A�����W8��Qcx�U��sX�{L��12�Cau<�y zM���['c��������J�~�n�������O�q+����o_������R�W�����I��f2�S������~��|��3�'i�~��)g�Ji���|U>y^>y/J����j�y��3��3v����[�9�����>=�r�on�������w�����X���|E-����q�'UW��������D�t���\�$�gN7��FM���������&j(��lj5=�
�"y�����H�$������i����HV������/�@�4�G#����HZ������@����h>�@"����u��p��d�6���K�b/��3���GL��b�j�t�j��T��2��i3R�.�"3CQW�=�&4QVSMtS=:�&4SvS�i$p:�E�������vB3e;�l�|
%�zZGtX ��O+���J���h^� ��P5��Y�<'���#w�[�m)��A1�O�3!�����6/7����
���g��o�IR��%o	)y,�,�)=}y#�$R2m���}��{����O9��	'�1|X��)�Lj;	��|Y<~;�{������?�/	~�(�M�BL��@V(��~�Y�z�)��R*>�Y�ej��~���=�J�9���f���cH�u���VY�Z|X��"�I� �������3���n�y��r�Z���y(�����7<�i�eh�f���]��XUX_�a>)��*x]|�����b����8n��?�~��`>l?�����o����~a�
N�������?7�IV�/�|��"���g�w���B�]|K>a�|n��:����3B���d��[T���fF�)�����y~4�����\��i���������^.T��b����O)G�g��/�F^�Z�Ow���8����������G(���Nx����Y4,GF.�����/EW�x1�����;�PN������^v��Wy�9������Si!��L)��4���y��S8���g�}�Zcx����"4���X>}�Xoh��3�t
=�Q=������n���k�,�\��$���+����\-�g\�K��a+��\�s����u�f��
�oI��Q-6��P0&%����p������Z��[7�N����X��/�&�[j��Q=Y.q5�IK{��������:\�<������%g�������������/�=�w�������LY��{��G[���I�<������h��1|v�I3;
�%�A��m=���p&N�.
	��S��<���9u��A0��:u�H�xV�:vi��D�����yO{���M�S�.L�j��SL�~����H������K����4ww�=�w&�C4���<;�W�8�h���5���C"�`J.In����s��A,jb��Un�=��9��Z���r=Z���2�F_9�%JC��<�W�$I �}�|� �s��rU�	��$����W����N��� �}��{|��W�y��@��)���-����2�}��zB*�{B���g�R�P}!-5�!!!�$7����b�B��d!�,���lSH�,�A���Z%	���t.KSH�Z�,$K��#���!�B�I�y�����w�B����,W����!�Bj�pg��($��|���R�1�ZH�m!en��T�����)R�P}!���!��T�����-$	�b�����		��*7�V�B�����,�EdCH�YH�Z���a!!�l�($		�1��W.�B���B�X���
	CBB���M��c!i%���-;.�;����?YH��eW���4GCl��
��r-�!��T��t�����a!!U,�����(�B������r{��0,$�u,IH�YH+I��0,$��,QHRc,��j��a��T�����%$		��*��($i4���������>�G	�>:p�tD� �����6z��d�Q��f����eA!�Q��:�B*�@n��sz��B&:��e"�,�umB"����V�DA@�g��� ��K���G#�$��T����&�����-�p0�D�0�[��H(�/H;��NY���U����]����}F�����j�6�hA������� �(�l,�.��
�E�`����A�lH8J<��K���yg�`��p9�l.���`9�l,���p�t���qL��s��j9���H�~9Y&��W?�d��X�j#�zwf�C4D�@���5���1Zq�$W�s���'�� ����Y�	��� I^sZe���
b����,QZu�K����b������"��_���"y�\y`X4���K��9����+�A�<�I����y�����3��/�f�t7�6b���w�fs4��Y��}�-g�h3�l��������� 51�b�r�~Dr��\�o��W;��B.;�e�Y]!�]��U�er�J��2	��1��Ov���6���j�7Z6�����$�$�h3�
�����Umk�/�f�\9�6b#��w�fs4DjB��}��R0$��*�����r Z�U,��Z�aH�g�VyR7Q�g��:��3����$�g��J��3	��1�[$��
`Xf��Y��O>�����$��'�LY�4d�QK6)��Y;�.���x�v�f��X�r_fK�D�!@f�U ����=x0(��
��;�K3��2��O6�D�%�2{l%H��4�%jB���!H�e�Xc>i�����  ��>��� ��`�_'�a����,kg�Q�!�s�{��A��V���  ��J�����0($�����Fgt�$��e";�B[���%�$;�B[��!�u���>y�a�Yb��Ik�4/�a@Hb]�'�A��b���+r��o�r���>�d�O�b�T�n�7r#@������9"�n/I���F��90�r6�(^�&ZN�E����|�A�4�c����Q�<�c����Q�8ec����Q�</C�QN� ^�|1�rR� Z�a�W�
�J~b���3l;��'�ag�=D����y��Z��U�$P<Y|����%��85V?a��;���b������]�i���e%�|��{�F����V�1��
���um��g4K���� T^+�P����2���G���"��1;=�.�F��h6k&�!�(
?�����l�\8D�
����h��lU��i�V�7k6[��7���#5��#����fk�j���l�Y��� f�a��lc%�A�f������"�]����"�UP}����1,���,���d6�����Y?o6zM{���	�h���=�vV"�
����h%�����k��k�e^�����`}�-�+z
�E^���L��5y�b��5����C-���5y�r��5y���������>�(�(y
CB+����{���Mq�&�-S���Lb���	r��<��6G��EV-r��Z_e�`�f���4[;{Ff������K��Va������
�Ef�+�
5���P�|;��r�h6����������� ,��*��4���*����8T=��*��D�@H��y�V���6G����a�ls4�^d�"��
�5���:�l�o����l}���_�#�
�Ef��z��p�����"�]������v��n��0Xd�����0Pd�
jq1��U6�l�>��'���0$d������:v0�d8��},.����	�f����*�T?N�^,"T�'���U)1���yR��9��J��le����!�E[�9�o/����F���@�rz�(^p(ZN�EK������A�hP���7��4����F�������F����G�}�xqE��}�hazvN1i4bK�+�EsK�0=�s�v����z����������-�3�+��Q�x�S�0�c�h�������i��������d�^�	 ��^i������)fV+�W����<E��P������z�|������^i���B���e���i zD��CT�<��j_���R�~�w��m�����0����|�:=����?�����6�/��s�q���N��f�z�
=�q����@��N��3"�P�����X��������o�\���+R����db`endstream
endobj
103 0 obj
4451
endobj
109 0 obj
<</Length 110 0 R/Filter /FlateDecode>>
stream
x���Ko1���G��s�q��k��{E���!P!��j�����{�v�i�LU���D�|3v����0�	s/�~�.9q����5FL
�������4.f�b#�
B��r���n17zS����^����b	�'��r��#)(b�������VWF9�A����`�i�����
>t�w�.�m�[-�b�`Ti�����W`5����g	uU�W������t����*�Q�M��Z�Z�N�Uq�X��d52�����K�J�{�p=��=uZ8��qO�:0E�)���h]�n0��O0�oA�8��%��2t<
����HH�� �\��`����K��$nd�'�@��PuT����?��RQ8�w�O���NC-@7=���O��$ST��ic�'Q��'���K���J�����RQ%:��'kV��j�����^��2w�}D�9�v���v�v7��$���|�i�=�E��M�r&;��^�bz�J8���4��������{;rD7��-6���|���m���m��������{���t��1�4�t�+\d� �jr�.��7F�e�4�Q�����)W���k�� �G��~��2[�>���rwu����m����A�Q�3�v����F��p�r���{�A���A�y�I��W��|�S���o�
�C��.2��)�G�����'�~h
����L(�Ck0O8�S/2_Po�7�"���^\�V�T��:�S/2RPo�4?t�k�+kN�����V���>�z����e��*w����M�z�78��Q7H�E�����M�r�����z�	�z���,az_���/�7�7����+�^d��^��zq�Z���Q/2aP/F�"��F|�,RzQ�����7����c�V��HA��<��H�K��_���O�|�T����z��e�����S�PD	)�RDj��������~��m�������K�9y���Y�`���endstream
endobj
110 0 obj
953
endobj
4 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/Font 11 0 R
>>
/Contents 5 0 R
>>
endobj
12 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/Font 15 0 R
>>
/Contents 13 0 R
>>
endobj
16 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 25 0 R
/XObject 26 0 R
/Font 27 0 R
>>
/Contents 17 0 R
>>
endobj
28 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 32 0 R
/XObject 33 0 R
/Font 34 0 R
>>
/Contents 29 0 R
>>
endobj
35 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 39 0 R
/XObject 40 0 R
/Font 41 0 R
>>
/Contents 36 0 R
>>
endobj
42 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 46 0 R
/XObject 47 0 R
/Font 48 0 R
>>
/Contents 43 0 R
>>
endobj
49 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 53 0 R
/XObject 54 0 R
/Font 55 0 R
>>
/Contents 50 0 R
>>
endobj
56 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 60 0 R
/XObject 61 0 R
/Font 62 0 R
>>
/Contents 57 0 R
>>
endobj
63 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/ExtGState 66 0 R
/Font 67 0 R
>>
/Contents 64 0 R
>>
endobj
68 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 72 0 R
/XObject 73 0 R
/Font 74 0 R
>>
/Contents 69 0 R
>>
endobj
75 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 79 0 R
/XObject 80 0 R
/Font 81 0 R
>>
/Contents 76 0 R
>>
endobj
82 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 86 0 R
/XObject 87 0 R
/Font 88 0 R
>>
/Contents 83 0 R
>>
endobj
89 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /Text]
/ExtGState 92 0 R
/Font 93 0 R
>>
/Contents 90 0 R
>>
endobj
94 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 98 0 R
/XObject 99 0 R
/Font 100 0 R
>>
/Contents 95 0 R
>>
endobj
101 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 105 0 R
/XObject 106 0 R
/Font 107 0 R
>>
/Contents 102 0 R
>>
endobj
108 0 obj
<</Type/Page/MediaBox [0 0 612 792]
/Parent 3 0 R
/Resources<</ProcSet[/PDF /ImageC /Text]
/ExtGState 112 0 R
/XObject 113 0 R
/Font 114 0 R
>>
/Contents 109 0 R
>>
endobj
3 0 obj
<< /Type /Pages /Kids [
4 0 R
12 0 R
16 0 R
28 0 R
35 0 R
42 0 R
49 0 R
56 0 R
63 0 R
68 0 R
75 0 R
82 0 R
89 0 R
94 0 R
101 0 R
108 0 R
] /Count 16
/Rotate 0>>
endobj
1 0 obj
<</Type /Catalog /Pages 3 0 R
/Metadata 121 0 R
>>
endobj
7 0 obj
<<
/Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
endobj
11 0 obj
<</R8
8 0 R>>
endobj
15 0 obj
<</R8
8 0 R>>
endobj
19 0 obj
<<
/Registry(Adobe)
/Ordering(Identity)
/Supplement 0
>>
endobj
23 0 obj
<</Type/ExtGState
/SA true>>endobj
25 0 obj
<</R23
23 0 R>>
endobj
26 0 obj
<</R24
24 0 R>>
endobj
24 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 21690>>stream
x���\y����&�a��``+��b`a ��y6����w&*`+****gb#*(!��5;�kY��Aw���u����3?�e����@�,K�P�m�2��(+�m����(+�m����(+�m����(+�m����(+�m����(+�m
���J�n�1��~�@S�&���!����Wjk�$��/�gB8���b����T�����Xh�PT)=�21�M��#��Y3!y#�1�c&����s��B��&M=a�=����G���)���n*IS?~������b��������|]��Ke�D����O��5+�5�:Amh���s
u>���R6f�na�������s���sqq�<yr~~~HH��������������z{{�i��K>�L&��x_�$+�Mi�Di�|����G�/���o���s���](n��]OOo���������m������������������s��+�5q����������>|��E�]�v?~�������>r�H�Hhdd���o�;�����fb	���eNc���*�������������srr�l�R�F
??�q��eff���q8�"F_�����pSA������#�0���������!Ch��3g��s��_���{�n��I$��
��w�������t����T�8�������+!!a���r�|��-}�����

�9sfjj��9s�O�nll����\�����B��]��Ux������������6l�j�~~~"�h��U111�rtt���/r��S�*��o�������7CBB"##�����}����+V<8???44t��,�fdd�����U+�����|�������������SRR|||��9��E������!!!!444\�xq�O��yiKb������W�\xs����u�����o���������E�����
�r�;w����=6�\���=�?�*MU����r�{���4i�H$�6m�L&0`��[����������m��"pP�m
�xn���������y:�����:z�hBH```lll����<yR�J�1c�0�u��I�&�Y����a�I�&����n����&---'''##���.55����\.�^�z)))zz����_�600�>}���b�D�.M?9B��f�1��&L��v�Z//����_�vm��Ennn�j�z�����7{���t��Y�f�%�Y������p�����1c�L�o�>//�9s����������s�������k�677�p8O�<qtt����������f���0����v����I�j�N�<���k��M�6h�`��
S�LY�p�����]����A��q&/�~��`����n/���K�N�8������;v�����������?�4h���;���_�xq���"��o�I>�������v��-/]���y���|�P�y�f]]�Q�FM�:����={����W��5�\��}���XFV��p��z��O�<�5kV��-]]]�^��0L^^�����9s��=��O��'�
���Y?�m=�
o:88<x�`��zzz�g�^�|���c�&M,,,N�:��c������5+k��;�>��|sN���"�t��k��=z�����+~��������?_*�������v��={��c���^wr^Z�����������NJJ�������U�8���]]]SSSU�=�4i��rssso����w�"&�BnrO�y�����g��i��INN��7o�>}��_?����y��n�����'22�B�
E?2�P����������G�
����{��G���Y355����O�>vvv�������V�ZRRRtt���T������3gN�2e��M���+V�;vlZZZ��+V�����������������m�������3;;���s���k�z�Z�n���x���!C������{w�����{���+���@OO�z����5�������VVV���_�re�+/��#""�W�ngg�k�.�@�����Q���k��={��%���;w�\�re�eYlmO�2e��EZZZ/^�x�����k�^�LLLTG���o�����m�������������������B��?~��u���G�5q�D{{{OO����"W^|m7j����K�.������_ll���O�����i���w����w��h�����E����vvv���^�@y���|~�=,--���������v���"�\|mK$��+W��3���_ �\��K�.������</11q��%vvv��u����������/_&%%5k�l������������o��}������]\\<<<���6l���^����vxxx���������RSS{����a��7o������ok���0aBtt�P(l��A����6�(��}||6o������m��-[N�2Eu
�������>���?��?���U��/��������;7%%e���...�z�������_�n]�N�TGI8p������9�gk���{��3f�HLL��sg����^�z��Y�X���bdd�f��I�&��bk�2f������:uj��i���k����������.]�z�������1��=���6��[�lqrr���suu�r���Y�f��1`��#FX[[���888�����bk���;\.���C�f�j��������bcc���&O�<t�����q��]�~�������U�VM�<�[�n���[�jU��--,,��Y�i�����;v�q��111�^�gk��������7��m[^^����,X0y��E������93::Z��E����n����s�:v����1d�����V�X����������{���fY�Q�F7o��2e�H$��a������3MLL�n�:k�,�!���??��������=z����_�z�c�����e��13g�WW�}�����>x��w�����EL^lm?~�8&&&;;���m����w�;v��������(cc�-[�,_�\(,�������4@��-��^�|imm������������D"���M~~~RR���������_W�\���666������YYYB�P[[[u8\OO���W+VLLL477���+:����bqbbb�*U8����+V�(
i��P���'O8������q+�\mgee�{����N.�'$$�����xCCC]]��/_*�AhfVd_���JN�J�VVV���������
�">>�b��ZZZ/_��i�r��E���m�Q�@`aa������^�J�D���k�U�=*r���6!$>>���D__?99Y&�U�\9;;;--����a��/_ZZZjk����k�e����[YY����/_jkk���fdddff�V���`mm��r�\y���zxmll8N||��������7o�r����B��
��[��&~��M�S��'E[[���,11�ak����Z|mB���sss���E"���o���(�z�������GI ���=y���:S��iKK������T;;�W�^�d2��lmmtuuMLL�^9�
�6���F�.��M=!91]���>&�}#�u���]�h���Y�4Y~���M�����Ke�w�
85���&XvE�i]lZ���I�f�c���YB=!f�Je���<������!�;8sK�j*G%��{!������f��D�.���Z]w���Y5+cb;���Y(�~t�����LI���_8���W���:uM�>�����EM#t����Mt�~q��
PVP�e�
PVP�e�
PVP�e�
PVP�e�
PV���O�8!���w�^�J�+V�k�N.�_�x���C�}�����YY�1���G����r����[�<x�S�N���o�������}���}}}7l�P�h������g/Y�$44t��%�o�~��Arrr�v�����-[&����]����g�������P�P���@-PE��-���-�H����\�����h��u��M�81==]$�{�n���'N����_�~��C������x�����(w8Lh�=�iZ |��_w&I^^^LLL�6m�w�������=n�8�a~���k��M�:5""BK�A(U�L�9#B���f���=
�W�5I@����g
MqjS�^?z���������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������sg==�#G�T�Z�������
��}��o���x�b����\n��j4�W�����W�^�����a���G���o���������>>>����:u���+�xf(gP�����L�>}�>|��9���<HMMm��Mzzz@@�X,^�n�����UK��2����.��bk�`�<~�8���/<��+j�������]�t
�������
6���JII9~�xxxxXX���C�m��Z>''����19Z1�EI�maL�~�4���hkk��_Q���/		���<xp�f�j���{��f��	����_~�e������wpp���>�3I@c��$��P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���P���J��Y��(�4���6h���mOO�
6��1���M L�0!66v���5k�3fL��]��_���_�3@9�Q��x�����8l����Z���V�?~��7.^��P(7n����a�����
6x{{�����g�rF�j[�%�����6S~�,��}��$...�O�vvv�x���+W�Ba�F�RSS7m�������78p��]���ll(G(E��C{���*��{��q�Eq-�*Br�f�LF��i����|33�/Y�+j�O�>��g^^^||�X,�:u���w}}}����N���w�y����qc���6<����3z���uv�~?z��P��T_AYO����������� ��H��4�L��f���~<\�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�j4j�����������}�"���c�mP{�6k�y��������R�CH ;��_/E�
�
�:��������mP{�6����O^\H��)��G��	a��	C�K�;�w�������mP{�����.'��0�;������/�����@=��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@���@c��@�������r69��������
j�]6��
�GW]DW���g��Bm��Cm����m������g��Bm��Cm�
�6�w�����.�m���
j�]6P��j�j�l����6�=�v�@m|�mP{������P��Bz.�4Y��*�"�p���%���!�^NMomn���gIr�����}��,!"1{���Y5��{Q����Cyk{�������6@���\$g�
/a�S���;i�
&��2����$����/��yo������\�b{"�%3��������.��&�W[}\�- ��C��j�����\"�����xq�G��
��ZLj�D
,���DrY�G����������iiT�|����%]x�z�v�%"��f�8�<��w����6@��q��Q�����O.G��MLB��O)�9����I�����q
�U�(Y?�Q��|q3c���.K)�,rfNC	�-���e�)��B��1��r������\&qDh@�w�x�vKD4E&vt����Xu��&ss��(����>��@=k��b�&����j����P����j{E�$���~>�Q����m�NE��%w
����B�m�r�]P��Y�m�r�]P��Y_W�W�\i��9!���sVVV��U���Q(M�6MKK�u�V����\��	�+�vY@m�g}Em���#"""88x����F�=z��]��?nbbbee5x�����O�>=  ��g����.�m���;�=l���[���_���[qqqiii�[�NOO_�bE~~��u��Z855�������'�7h3
2���u�*u��7�Y������<b6fJs�4�X�9F�k��N����M�������|��E���d��D T4h�w���S�$����q�	!�+�l������@�]���\Z�Q�j��J�S-�jE)r����%��qZ����{R�����������c�����2G]��v����I�&��utl��r�f������6%��[����>�|E�?�94�f�X�K��^���msJ��o1]l1���ix<�����,Y��vuu=|���'���gcc���z����������C�n��M��\./��P��\��r1� S\y-N-��:F���b#M	EQM�u�\����+������~2�I�[�������������v_n�TJ��sn-��E��+�m�=>9z%KH��[<�L=:M�r���D{.�j�DLSd|'n�z���(&�D��8Ox�<�m�����?�)8�����Dy��z��+�����W�����T���Q���w�4��S�����V��$�EQ��2	�&5+P��}��%��I��f�@�P��p�`�����#G����u��G�i��m�������7o����m���u�~��P�p&IY(��$F�#�U�
&���-���3I~\��\@m���k[Ri��[[��w�edGf����{\��)��	��-~�QQ�?
j�\@m��bj�[m���]��js�Q}w�1�RdQ��{I���P������
�����
�������9���X��d�~tL�\t~_���Em�/�m�rAk�u�B����gz������-P�?
j�\���n{. ���������Bm�6����(P�e�
���(P�e�
���(P�e�
���(P�e�
���(P�e�
���(P�e�
���(P�e�
���(�Om�]�����y�'%��x
&Z���X~PX�C��9������{�m�j�\('�=��~�%)!d��0�dB��Fw�(&~1��D���J������N?.�%�|���[�Qk���6����TbuG~N�k[.!��tXB�#�i������1�m�r�����J�6���,Y�.�S��mU�>��k	��ipj�����nU���H&��@�
-�����N�{�����
�`�h2����+i4�Nuh�=G��Am������(��j{�����U���<��?F��d�����������5z|����1|�'d�I��
OsD�����=G��Am�����ur?�����rW���D���j��F���������^Ad���O��VDm|;�6@���Fm�j{m����\nq�Qamo�nFm�
�6@���Fm�j��Z7�fOd���/��_Bm�9�6@���Fm��~�6@���Fm��~�6@���Fm��~�6@���Fm��~�6@���V��^zD���i�����������l]�a����QP��j[�j{m��N�Y�L�!��X��AW��7����F�l85��>9�3���J:%��j�GAm��m���:
VW���BAAms*[���<�LX9�������^G����I�pR���K�GEm�(�m�r��!���y<��������<����e��O���D� ��K<*j�GAm��mM��g��>�6��Bm��m������DF*'N66/Oz�
�Y+��/�0�m��C�G�jK����,�P��j�]���h�<'L�2���`�j&)�c]Ew�?�r��)�v����onu�i��y%��?C~�"<.e��S�����j�\@m�������B�t�Y��~������y8�P�S�����|��I�a}K<��<�"\.m���{�G�j�
P.��Q��_��&Bq��;�P��;WnQ��D8<2|_�G�t�m��6@���Fm����sk�\LF%\a���h�m��6@���Fm����6�j�\@m��5���rB���KB+�)S���3�2I}J
$4G�P������LE(/e(W0\���W!'���T}\�r��Cs(B�W���[��P��<��J���Cm��m����vB���#�[�>-MiU���7��a�7a3b8
?���}��O�n�;@|��Q�-<]w��;����h� �9�'a�����\��+\����H�x<!4�q[���2�u�i���>������c%~T�2�����js+~����~l�K��>���~����H�1;-~5�����%��%p��<+��m���^��(_��W�G���f����w�o��(P��mM��~�y��l�
���HZ���Lj���B"K��.:�6]x=�r�=���!�VV�s�mj�L~�S�9a�Q�v��^�H|M����qg�+Gu��]�H�2�~$d4�K���eP�Q'��}�T0�v��������Z}�
k����yW�)����M�Q~��=O������<9s�c����l��#����D��:���Vc��FDYx&��E��m[�����������qh�z���-��~�)����I�R��R!d~_a���/X\����6j[Sj{�oyo���V�Z7�k{U�>�����ab���?	v�T�&�f�z���m!W����}mU�.����I
KCjW��4������\����g�.-��ydR?�O������3���N(�Sc��O��g�6�2���[�J�X��,��i�v������es�b�7V>W;��q����e"�%�;�{4���Fm��m�6j�k������_[�����\2�U ��E�8�
�g�b�*����~{r�Y�KKm�i
}H�*�x�%�������\kPDm%�f����}�oWTm��v��WAm��m�6j[�k{c�*J|�yam��]�����������\�������c���pt'f���DN���kq�$��5�\+V� �k�iZ�#�8��J���|����Y7������"Y���[����"j[!�G
�'���&\�R�,�j�\@m��Q�^���Q-��=L�f� ��G����6,f�:�j����|�������>��^��z�.����M:iW=|S&��^�������5� �Z�Wk�����gDmG�j%��c����.7s�0����um�Q��"���k�a�um��������L����6j���V��"o�������;;��u[Wj�����nmn��V���dY�_�9}��k{���������V��F�&�����"h{�g��Z���@����������6���q��u��"����Y#�z� �lg��q��>���-��=���1�[��"i+aY���������]E������P��j���Fm�mm���se;I�M,j��%���M����WX���bX�����F�����.����d���������sRM~����:��z�qMN�9��B��|��e�"��{50�U�	�j�\@m��Q������:�u.�I��*�m}�F�~�mjOdy��U��������$�XF�u)�������p�D�AX��+�'/J��6�)����h����\( ���Y>�p��J!��
N���#=B�%�T�H�g���mcB�����RV$!�z��U�y<��=��
#�V�(���rf"JS����1,�.��
T��\����`��������Z ed��Qu��- cH��5���w�*X���Zh������Fm��Q��v��.��������������/���������D�~���o���;!���G�|'�Q5C��Fc������n>}+�x�Tr��^qL���a��ID�h{
�T|�&n$��n9��<>��Ww`�G��_a�����x���<)�u�V����Kx�i��*��#4�4��q��_��s=7��B�k���WWR3�V���Zu���v��0k�xBA�v�f���B�ixn���r3��gmms��T�O��x���EYydvOA���/X�Gj�T�0�o�'����?���Fm��Q��k�V�1{��[D���^�Q����v��?#4���Y�itm`h�K~��P\n��a9]I��P�a$�Q�^��W\F�TVS_���1���Q�e���������m~�D6-��
�����%��h�������	�cJl-��]V��r�����pC��`JK���9�)NAm_tm���x���u�K&(/^C����,�^f����8����KE��2��S|>���V;������K+��k�v�_%�"�[�a�%���\=!5�-���$k�nu{�c�X�N�����e76v�m�����Oy$~�����[���me���88ZJQT��<��K��{�.=8�������{�e����A�=�����4���+�\���]LB<mfAiky6+~I���d��d��F*8B�s�lu�U��`�P[���y���PQ��$��Z�g������
:f�>�J|���\�{(��KZ����'
)+����[��Y	u,��B"������
F�������XF�>����Q�b�oAy#J}?j�k"�SF�7x���j^��2+O�*�Y����z#+���'V&j?(j���Fm��Q��P�a3���+k{_�<���������v'���_�)������|�1��/0=~A~�����U�����N�I�T�]44����vx[~��+��&,kyM~�Ch�[m��7��	s����N�q�(2���WlR��ZGkQ
�u�������$`&+��z�z|��~D���13z���z�pm�oMN.$����7������5c+�X�x"?C+�P\74E���K+[Th0��E)/x�4a�����w�����g��S��G��c��[����'s��P�)|��u���N�-���-���z�����6��Y�>��G�P�_��_� �����l�qQ��u�0�C)�
�C�i5<��
���tp��GLQdjW�����c�m;��������?1��:�,��I��
�{FF��-��O������h^�^����_x�
8*	�K^����������������+����6j���Fm��?@m�mm_	�?q�����Os��]}��7��QY����"����k#��^���u��pZ{J����SH���J%�PK�����}��Y���I��M�>!��+O���Guu��qN�,`�������t�?*�W��Zv�O���>;��-���g=����������b����B����Ris]�^�'�a9��:;_�������u+rt6�1!]��x��"CLt��]^f�z�b�����5c����FoW����n.���BFRDH�>��l�9:5]�(����*&qk*�q^�������b�.������Ks��*N�lDM%���<�|���m���(�ZP�k+��RB-�ax�b��R�!��_U�1r
����&��������
��!�!�r��L�<��+���<��R�3%��e<�eT��������x��Ra�����2�`I����"7A���~�j���Fm��Q���5��������������C��}������u/5�a���Ey�
���s���qq�]�w;��Kj��������G8U����j;�m}B(����+���]ef�����9,ao�?B�R���\r����q�!�������o�Wv+�65��8��uS���1h��coE��Z&�'������ds���.���1j +�o
Q$�a�l�l����tj�>_7��ed�7yy����T}�N��uj�����R�h��NO����d���S�x2aet�5�==�p��u���u�g"�Dh�C����J�X�(�O��P<��$���}c�[A�t�[���l;B�����!����kQ|.a9��N���Z�V��
V�#���Y�m�6j�����������%K.^�8f��o_j���.����	Q0fgo�/U!�D��/J���K�V�u�[P��&n�3!�eb�L����1M�am�f��)_S���G1w�����_t�D�$�%�/M���~�G�;X�+����������rv�".��t�f����2�b>���$5���w��G}�M�)���<��-�����[���������p�-����_�q�B�9�}��������Cm{j�;�H|���Z���3���t+��M�MnjY�w��duk��jm�[��,��KE���v��I�4Y�V���8�!#i��lz�L��I���_o`���+���n�����Fm��Q��m��m�X<a��?�����{�����vK�;�����N�\*��{�g�Es������C+k�p=�����^�����t��^_o��������'������\������sa���G-����UZwJ�C/��m%�s@�T�h������=���}���'�C�M�
��(��\���F�����<q�������f���j�
Wn���O�6���������2IA{��S.0���m��/n*_(k��%9W�+$>�=�,�S��}6F~�&���j�����$9O�O��'CO��I���Y�h-�wZsR����[�.��J$������L�h��vf�1#���MY�W������x�HOH�������^6_Y������B����!�����uy���$��s|��)������������*����IK�-,�F�fO)��#�����i5�t���� e�'��'�v
���Y=z����Q��F�c��L�����v�w����G(A!!nk6�|95]�(.�j�����������V������+8��Z��96z�&ThLNVn�}�	���i�M�0��fIo���������b.�����q�uL���a�����n?qK��!	�s
v`3j����l�Z��+Z�@o������k���vx�$��K��d�;a�dp����%2m���3�D�
�������������)��7��9K�2�<��fU y��F�h���<���`��OF����#��A(�`�
&�����/��:�V�}S�����H��S}d	w`,!�+7VK�����W��
�����?�Q6��U��.W�Vm�A[���*7VkRz���1Lc�Af5wGKe�?�<e�r�����U6�"�hE���q)�G�n����W�~E�����JqY���=��O,�?yDY�a��S�&��\"��j�r��d��r�e���2���[��x�T�qIB����(�PK���)K�Z�85IY��G
���/�����.��Rn�XY���
��]�~HY�'��O)/���2���KI7%��n���C�����<HM��76�����Y�`��PT����vUu<�1����S/����{�.�����Gj��G1���R���_m���IY����#iB(n����?:*�.��V�6���o�IL�H-��uu�W���^B�����+k��=��
j����E<�������c�� ���x�6�
j�"g��������}�sA�eD���7��m������<��QnW���SAU�g:8�U�:���p�:���u�!���*k����?8��&��s���5D&���U����f*j[��=�T�VD&5�p��`��R9��c�.�H{��>��Q>��VU�{�MJ�L~�S+�Q?���EJ2��B��������������W����;R�]��J�$� q\"ciH����:X�� ���$Ex4m������W�H�k���L*?W�^'�n\�4_�\��rjAmO�*���,�@yk�l&�����������	t��5�!��v	
H������W���ve��K��u
j�����>_�M�&�������E].�aX�����V��7�Y�_�5m��t�y�:���X��;��~q)������v����������TY��
��/�����M��6P�v����G������d��r�_��o<y+����������M-�����O	��<����7j�������������M�eD�B���j{��q�6m8p��]����5�JzA���V�/��_��=����(�s,a�_��U~&��T�x��O�D����g�G�,����pT���_���~2��QY���<*M���Q�����Q?���G��'@Q�����g����G���������*�Q����b�p���F���!�����U���o�
�+������������jc�������]�����dc���%�X��3*��%�?����,v�oxT�����+*�3IV�Z���[*+�tj�.I��M����$����v��=GG��O_�}�v�z�!����������mT]]]}����������U��?�+|���-ZB���ckk��W:W!��7o�T�P��=zT�f����{�n�:�3�"""LMM6lX�c*�������@lll���?����'��W'��8q����q���;���74hP�S���:t�����=�Z��+������Q�������'FDDD���K6jlll�z�>~��Ctt�?~@�R���/�U�V���?�Q����>1���W�Z5���o�z�����]1�����CBB���M�����_NNN�S�����022j��I�F}����{|Tn����A�����laaQ��W�^m��i�3�e�;w��6b&&&�5*�����d<���)�w�����I�������[�����<� �������yyy�%��=��}����,�J��bvX/���li��
��#G5jT�b��2�'3|���yyyR����P"��<y�Y�f���\�������z�����]g\\\�Z��,v������o�KEJJ�����,���nl\��Z.\���ukB���o
E=��222���~�{TTT�6m!��_���)�GU$���|������][�����9#Z�jU�c~�&���/,--<(�|||._������C�^�t)!!���e���-Z�066vuu%��{�n���~~~W�^U(���nnn_���v���&&&g��IOO>|��C������g��L��M���W�n�Z__�c���]]HH���_�|�r�J���;P5��;�����c�����L�t�6m6l���A���+���+W�7o����
*l��i��I�g������x�DDD$''{{{_�t)11�]�v���������M���={���"""��i�v��3f��H�m���


=<<8 ��<==�^�����=z��������7���`�##��c����9r����e�m���j�j�����|>�[�n�'���{
���F��\�r��%_����<((H[[���#44T&�
0����YYY��w�����Q�5jT��*55u��evvv����_�6l����eK�&M:��qccc��FW�0��[�n������u��888������g��}���C�9{�lRRR�~��M���i�:u�888�>�e���Wkii���%%%8p`�������C=z��2eJ�=LMMw�v�8p��+Wj���i���S��lT��-[����9r$;;{����n�z��������g�=?~Q���6~��w�����yzzJ��}��u��}���>>>+Vl�������^=x��S�N�j��I@@��E��y_2������uppHMM0`��c�����V����(��������W�J�<VVV���c�v�����S�;�gG���quu}���������������g�������T���\�R�^=mm�������%K�|�9U��bbbZ�h�����u�c���HKK322JJJj����H��#G�V�������Err����_��x|>_*����:t�V�Z\.733����������W��:u��j�5}����cc����2�M�P��[����QQQ<�j��u���p8R����ZZZ}��}��Att������[xxxHH�@ pss�v�������=!d��y.���>w�\���MMM�0�����V||���'y<���}��
�={fff���w���o��z��������������cgg��M�7nT, �����7l�@Y�b������w��I�W�����CBB

���Z�n���.N�<����w�~����W����{���'O���i����O?�t���Z�j��[�n���999aaa���iii]�v-�9�������[��=kll���:u��z����{������)S����{��-���;v���477W���y���ORR��y����E���I�}��a�5k���3'//o�������3g��[���93|�p��+W��7/  ���sg�3s��i����s�n���a����_���K�.���o�vrr:y���I��={���Q�J��'N�:u��MZZZ�;v�X����q(��=z�e�������#��������G�k��9;;+
//���@ww���0��:t(<���W�C�}�Q7o�<|��M�6���$&&8p���/^�?���w���j��:r���7���m���*U"��������/�2������_���lmm���?~ljj*�/^|��__�c���z����9::���T��}��W�^��3���a��aS�L�}����{��?d��#G������1
DEE����������������5j����z�J__��������o�?~��bcc?�����M�<y��e�.]���6l��������+._�|�����W�1�������WWx��9��-qm���...QQQ����J�������			nnn���6l�H$
		i����K�T������������{��=z���������g�n��%88���k��=����w�i�F�� ����;j�j~�J$e/0PZ��A����E6A��(�M�6e+�����U�AF$) !,E !��H#Z*�cG,`�1�,���3_����}?Ke������9���9�y��'N�����Z�������������O�>144���			MMMqqq�=jll���8� ���,kll,((���@__���d`` ..���EOOoxx����8�J����477o���gdd�F���KII-,,���R("����jmm�����/���F�x���3������R���bv������"�Db��~���?�,""������[XX���{���f^FFF===`�������w-H�������D"��b�p���]]��g�����������{�����K^��o��!<��n�

(***��=?CCC�+W�hkkOOO[XXp8A��� HRR���Tbb�N�w�����*�6??������:�#G�LOOKKK���3�?�����������y{{��tPkPPP�����h������EEE!!!�*��P(������>4���}��


>����{������#___OO�����W�^�~=...++\u���4�������=NA\\\�������������B��d������������fAA�L����������p8\ii���{dd��[�@���w����x���}��������!!!�?�b����������hmm����|>���6,,LYY�F�	������Za9����B	��333w�L����'N���2P�������B�����������yyy


���&$����;�������@ ��uO&�y��!			*����A�055���x����W_}UXX����%����srr������>��:�,������Baa!� G���������---�~��w������F�yxx��|KK���������:]]���WUU��q#<<|�=4����n{'I`` �J���.//�|�2�Y[[���z��A^^���?�����

�P(��������i4Zllldd�����L���"�D
		)..F�PVVVl6A������/��7����T�w�x<^VV��k����������jkkY,VtttGGG���qww7�N�?^RR��r����������|G�%�HAAA%%%���������������������������466���WTT��������yS88 ��P(nnn����}��@�}�vii)���s����h�������DFF����mmm����������JKK��;���h``���������B �����	�?������������$��`0������Y,���M[[� ����������{���L^^~���%%%���FFF@"x�����oo���:sss.�� ���iWW��m.�����l�����gFFFss3�������_O�+W�dff���H�=77711���F��n��elllhh�����bGFF����>}*������rrrv9T.�kjj���911�����/�xxxhii}���|�������Iww���
�SV��ebb����:=H����A__���`^^^~~~RR���������d29?c����C$edd���/]���������z��w���,--�=T		����k``�F����111===EEE2����x��)33������
;;���v�����]�z�*�����;w��P�������������s��]+++>�O"�fffdee�Tjpp���Jrr��}��������L&LW]]
�����%��u���������>55������?7���)�@���444
�<���lrrRNN.>>�����g������������Z������O�������X,�y:�������������:::6v���iii��<yRDD���swRC&�������+�coo����f�A1���EIIiiiill��(��trr����r�������1
��r����sssl6[RRr}}]LL�H$���IKK���������#RVV��������pv���J���������fGGG>����jkk��r������0���huu���jff����B�cfee988HKK755���KHH��������{CC��3g�O �H������4MCC���� !366~��EFF���h(Jxb��|SSS,��D������H$OOO���v333
���JY!!!CCCjjj����Ltzz:??�H$���TWW��������������$�um>1���LMMWWW;::@���E�P|}}---Q(��,,,p�\;;;&����v�����e

���Tp���zNN���Wyy��8���occ���lmm���������iXzzzEE��������F577���K�~�������~koo/:::bbb�d�����'O8����y��`0\]]�t���7555GGG���������JKK���������D"�J}tnnn�����������E,+~��������!�@hmm�����p�;���+��Y���_TTTPP����MMM�:uJMM����������


<������>KMM'������0(e�H���\�����|�������C������===jjjSSS������wttx{{#���zKw3���Y;;���a�@@�PZ[[���������^�x��b�����IQQQaeeeFF���AQ��7�M������)**B�Pl6;,,,88������
A6����C�u���{��]�xqee�������������]���������v"T������p�������:v�����@ ��~pp���(��~����Lfll���*�F������D�0�/����|>��=TWWW<���-..��p������UTT��=����������8���:33cii���F���B������J833���������t���m5""BCC��dJJJ�X���O766b0qqq6����3::������500����hteeeDD��
�����fff����j���L����N����9��u��PSS��b����o���*�7�k^:����`r��+�[���[VV���|{dggGGG��(^����(����}SS����O?��X�`���������k��>��hs���B����l{��o������^�	{hT��P_��^\\���	�v�B�����w��������P_�������������B��3��}�v��m�VWW��������P!h�K�6AA�-0�� � ����m� � �)0�� � ����m� � �)0�� � ����m� � �)0�� � ����m� � �)0�� � ����m� � �)0�� � ����_]�p�
endstream
endobj
27 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
32 0 obj
<</R23
23 0 R>>
endobj
33 0 obj
<</R31
31 0 R>>
endobj
31 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 18522>>stream
x���w\S����sov�\���^8��m��E�ZmU,8� CQ+(h-V�-UZ�Z�(u�U��u/@qo6	d��{�\�
��Q���}���|����#�X�% @8���   A@@��������Q��(q�L�������_�{��qS���%�>!�?��4��?%5%b��Z������^�{J?�n\Z���$�!�	 ���S�k�A%�>!��>z�������k�P�Kj�{���;����v����Jj}�1i��J��N����OI$%vi�X����P�)��%�~�.3�����T�����������"��*�Y��vsH��xy\e2�|}}�ry��=srr��9������$�J===333������ki}M����
yO��h����[�n}��M�S�L�]����Z�����W�^�6m���^^^5j�(�*��z(��H)���~�����k	!�����o��uww��k�^�		����^���1c
\? n���G��2���)�F����%KD"Q��
�����lhh(��!C��9�e���������#dM�WO�`r�y��~Y??����W6����.]��i����_��k�<y��V�����=�U���w+��?���|���{�&O�\��^3�y����:U��{Hs�7w+�92��h����#Gn��=p�@���3i�Y�K^9�������"�rhhh�J�F�]�M`�G�i�=�rE������[9��HMM�����t�-*p�[���]��{���nQ�i��{��A~����/Z��_�j��~~~�V������nnn^E(<�\����0t����B���TTTVVV�v��7o1}����0???K������-k��Q�5�e�Z���@@�g�}�P(:T�Z�v��x��7�|3p�@�Fw��-~Y�H4��c��Q��qc�D�p���/<������f���w����;w��'��?����2���+�>**����eo�����������;w�\��
G�)��
��� �V~es������u���A��_d�ne�J5l��W6���?!d���s���(*��E@�[Y$8p�����/���X-<
��y7���/5j���s����~���rpp0�����?��c���/^�X�re~sX| �!����������}�]�-  `��W�^>|x��M/^�:���Z�����c���������_�v�����g��e�*
�U�V}��G�f����?���o�:w��v��
>����G������)y��z�*�0��5k�2d�����~8}�tLLL��_��f&�����S�N-�
��[���q�������G��_d�n������g��9.^������~�P�VNII������sqq7n\�WQh���n�6m�$%%]�|y����k���~���;v�����Q��l@@��
���������>�
�	<��HNN��q��q�v�����&M��n���}���y�Y�f�n�5j�k�����9s���===��;�/{���E�-]�t����v��9sf!7������J�YYY�+W��o�l�:u�u�v���^�z�>}z���~~~��//p���h4.Z�h��y��U�������M������V���������5kpE��
�e;t�p���W6��A��z�����c��L�R�M(<�n������������		)p�" w+���#��X�dIhh���S�Je���o���������/��O/�*

��r����S�f�����,^�X,�_��X�l�A����O?����=�������1j�����B@
B��c�n��9n�8�����?�n��}�����OMM3f��n��I��g�}&�G�E���Ow����g����.{�����}��A/^����%�����Y�f����e�m��I��!C���}�_���y����j������c��^���{������
���O�:5r�H��esO�T�;H=~�x�.]
\�����y�}es�������`0�1B*����Y��w+gdd��O�<��wo������������s��U���#=z�(��)F��ccc{��U�N������(p+���v����{��M��~��������]����>{�,EQ���?y�$�9
|��B@~<��������K�]�v�=E��Y)��H� ��������vI��o�~ �d@p����P���h\��.�
�.@�����������P�w(��%�~�wr�����@)@@@���   ���:t�������=<<Z�h�y�f�D���y���J��}�S�N���KJ��W�7.\�0m��C������7�W�^*�J�V���&$$�,;a�����3f���`�b��h

R(3f����


NII���W����o�i�a��Y�5s��-���S����o/_����0y���G��^�z����G�������sss���wuu�q�M�S�L�J������NW&7@�
E�g��`��e3f���}{RR�L&?~|HH�����NNNaaas��	������{�����]�  A@@���   A@@���   A@@���1$������u�{�r�aI�;�p�,!�1��[��{���aI�<�t�!�>$\������!@X p��aA��!@X p��aA��!@X p��aA��!@X p��aA��!@X p\��e��������'&''GGG{xx�h�b����������#*��}���N�rww�H$e2<��W�HLL�u����mbb��K���������W/�J�V�mmmX��0aBTT��3�drk @���+@�/_^�j���###CCC���SRR����j����i�f��i�e
���sKrk @�����R����{���������������0z�������877���xWW�7n�4=e��TJy���V�-�P,r������BS�V�=@��R��J�*�l�
�����+W�T*�$%%�d��������,���6g�������??��%?����� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���C�� @� ,8���+n��������&MJNN������h������%�����#GT*U���O�:���.�HJrk @��w���p�������7/_�4o��^�z�T*�Zmkk�������	���f��Q&�X�?�
PXXX�~�6m��������V��o�N�4�04M�,k0���[�cX	W����h��u��]���V�^=�|���Ggff��������������q���)S�H�RBHJJ�N�+�[P\������!F?�E����8::*�"�V�8{����_����t��d&������l``���SXX��9s���322���J�V�0�����C�� @� ,8���C�� @� ,8���C��T��\���� ]��c��������
�R e���sB�������=Tl��aA��!@X p��aA��!@X p��aA��!@X p��aA��!@X���i��v"RQ�'�M���E��N����D�M�_N��]�##�$!@X��H��Bh���%������Q�!�R��]�s�n6�L�]�##�"}�	u�jBQ6��I[�-�q�b@���+d+�{�h�&<Y��2���Y�Y����s���|���GF�EY�C��~!�(Fz�L���������7�z��9�� �(GOPM�V��@��aA��V@�� �M�+ @X�&A� ,x� �
�I`�$��oX���7	��aA��V@�� �M�+ @Xe#c�,��St�Z��o�������+ @Xe���nD�%�T�~��F�����+ @XeP6`�l ��P��t�RFFF�n�>|�����E���7K$OO�#G��T�����:u���]"�����@�
@�@��F�q���������^^^AAA�������J�R�����			,�N�0!**j��e29�5e�"������&��CgJe��K��^J�Im]�I
��,�?�����������"�)$�?�;C���j��J'�x���%u�&U��0PQ�'���k��������


NII���W����o�i�a��Y�5s��-��������/f3���w�W��X��@932$��ID�F�D��W����A5q��{/V}��)���5+v����>��?k9�lL]"u;��[2
�;j��\����;i�F���}J&U�AX���EX�R6x���5~��������s�����E��QQ�����b��;@V��V������q��}MW�S]�'Q�k�A����C�-}�$iH�mR���liY�����	����R79�x!]�����O��
��_��m�6y����rxG����)����7�t��BH;���ZL ��r�v�K�o���BE�?�������������0z�������877���xWW�7n�4=e��T���RRt:���=He�I'����6}ERSX������T��c��g�Uy����k�r�i
FIZ��+���X���e����4][.}����`	��5���)����em��,���F��pr\�>���zG�8`��^�X����eY��K����O�)�T�9P�L��k����R]=�����"��L���y���HU[���n���}W�c�	s�$l�����.����u�?����w�~���Pn�,����l���e�I�s�y��$"$���OS7���[}�����W!�>�e5��j~%�d��?a��G�B�[�Z�]�)�[o��7d�������s$Fu�Q���jK��T�kXJ��������;��X��Z}��l����LM�3���iJ����V����X�2E#r�3v��#9Ft������[���k��6�v����s�������������1�zp�+s?�J_��rO���i6{�E\����t�C��������l,����f��D��p3��`�7,�q�E<y$e>0�XG\���8}�8����r��������;}�8�=����U�N��/�A�������KDK��7��<�q~OJ��lmu��C'��z�)�������%���x��ue�Q�
���2�?��7�6u:w{��EQ����f8T�3����E�n�!������n[����W�����p|:�JI�m�����P(�<[� 99y��Uqqq_|���;w���d2����CBBX�
trr
�3gNxxxFF�������wM>��w:5�5OJ�>���S���c��IC����������M�����?L"���~2B[u��g��M��9����n z��:g�/D��v�"�i`��8&r�4�����B$�xyg���_�w\�<��G�:�*m�b�9�d'RN}H���~�8M�)���\���?�]`���d�����=�e������\�j#nk���}/1e�{���gs���R�Eh���	ec�~]�%�A8��d����o��wT�1W�qOE��7�|3��|BHUE�����U_^8L�����&����C�_O�O3����]$5���`��$�������:E��y�;�i��`b���|���3rMvr� \3^^c�B�q���~�Z�|=�$��W����0��O�N�n�E�=�����[�o�����U�q.�~�����/�I0T�Z@zB���|6%�h�ct�����w��$B����\������������������������9{�&�Nu�jdu��$n�Z�����������q��y>�s�.I�l���_�L�#���n��4������>R�=�np?L<B\����SU�c���'!�U���������	����I���rq�q�������������wn�	y��W
��]���0���04�I�v��6m�9*;m�b�}��t�w�z��$������oU������/L��6<��;�����"�Jn������v��FM���2I�7�K0n=erl�j4�t&7#��N�q_��������3hC��m���#�x���x1-�2��[Y����z6wJ>�;��8$��M�$nD����.,�;�D	�����t��Qk�_d�����
����� ������@���y��"��wo��e�g/ ��XB�w�K�m~�@���	��i��P����l;J^��xbw����:#���QJ�������?����T8��%{g�������(�L��,qM�r�A�}��r���]��������F�����Z�g�\�'�,1��@!���Z��so%w~U�[KK���2��7	��!JOE�e{t��r3/!k��
����b�f�LJ���n�D���������R)4�V���BI(�]I���2�����^�TSf*G�+���H�����
���W�W�r!�M�({��S`�6�������{�n�V�O��t��3�:���OzB�2�J�gS���5����N9�#����6��Qg>e���p.�.��yP�;t&���� ��)�D�v(%x� ���<����V"�yHU?�5��>^9z���0��-Um<]�����	��P� ����s�
������cJ�n�R��7	��!J�� �7���,��Gj�
 �
��?[}�{�����w�����!�M�({�
M
9�����X�X�X�@X�(w��Ie/o�$������^�~d��G+�o�^�:��}� ���CH��d���-e�%�r�@�
��K�?�H�2J��uy�S��&A���@kf�Gft�-A�����>�aj���_'��s��8��)�/C�1p�n�5~/�2��+�O��Z����C�m�Z�j��w?<�
�@������v1�Y�w���1�7��R/���_���5�T�kuw�a0���T�O����~���O$����n?�l���gg����t���$Y�n��J��N*:�I*b�L5���9��"���	�9N����>��Z�ao��n�����F[5��8g�_����c�N�B��h���<r����4y��no��� ��{���)�R�0%/��v��B�sI���t��W�Ks"�.wE��������3�9u������I�bZ;t�d~���$�M�o������n�T�1�������~�
Y�������R�<K����$1����|��B9o� y}�%��}� !h����� ����!a�t������g�qa��kj�!a��K�4&�YZ�IK1�I�l�$
Y7X1e����<����-��y�T�� �?���*�i&�m��wE�[�L1�}�;��������?��b���U��������1z�;]���9+:�IePze3���a>HX<T�rg�hs����IU��7���F7�w�*
L�7�/����i�h�y������3U�(��R�������R��)�����HA�zS����iI��Y���������A%�x�C=�O�7�^=���lK':�����hG����9+:�IePz������M����
)0��\�p�7!�{t\�S��~��L�a9���[j�T��"����jN��<���@A ��5������{G��1�tc��]� �I��m���L]"/���N�oo�?�,��L����]r<�����oK+����-7jBI$%;�� E��Kf������g��({�������9�����34!1=���Y���^}��D��4�f���`����9��.V����
�d��
�B��������X�;��s�@�rn#�z�k�������_P.c���w��E�cC��\��ZDBR��O|��6����N]��y��<�����}����N���u]��w�1���)����������qO��O$���Y�IC��kL�%g#Su����~��d�T��i���������2���(�5����Q��W�x����IZ����������!��� �|�u��4��B���F~�O�{��<G(e����Y��(�I�a���[�'W�&��jKd���� e�Z>�W4m�6����������g���'��U�/'�H��f��a���|�4}�4����%~y���^}�L����EY�����;��q3��+m��>��fv�I���n��*b��m�g�K�i�����������
���su���5'�q�df�BH}�Z?���,i�]R�2}���_��u�uq�x�I�f����J�#Mwn�KI�R0��ZLWO�����
@�C������Si������n�]'�R!��k�`gw��w�f������gZ]���C]�g��/��������?b��N;TRN2�3��)�-U�SJl��3P���0�������WV���c�f^��~�in���%�+��.��0� �}�E�����Dw�3<�}�^�2��Q�3���Wt�������q����m��`��0�;~���)O8��4�0|���������y����P6�5��c�B,J��i���RJ���S�����CS}gzz�� �X�z���
]g��������k�%��d������������$�I���������D&&��k�����������:�d'�KQ�_�bB �4�����U���d����	�������~�!�]JP�$�zP�-n����M%��n����cX��$>��LV&��9�T��w &
�s�H*Q��#����B��{���8i�z��z�-�X���n��`��@�'L=���D�����~��k.,���c5Kodd�����i�t�����j�A�Z��`������z6���-��f��������~lN6�H�1w�nY������s��)�r��r�����&�w���;Rb��g#I���+:�yGZ��F(���"U����72^(Z�>��"��������(�%�72��������.-G6x���s(=��0H�a��D����{^��l���_�:w+�>w�xp��nV��I���gJ"�p�����Gg	!��Nv�0�3�;w�ydPH���x#`w:{����qy��]��PK�>r�:�I�(���sE�������^l�97����wuIa����dD*���7�W.�6�������"�"B�n�������6�u�&V\A��3����&5��m$y�}0�h��.j�������rd5w��>d��r��F�/i��'��{�z��:���PK�89_}1�%l{��Z�����N�
������W����SCH���H���<I���������!��tV&!s��V3�r/����k����	-���$V�X��C 
���Y��=�)rp�*e ��&��%e�}~��+��o_kF�����(��>V��	�����ST;�G�f%��-
�����������d�.8���A�k� _�s�t���"~h�Y�f�;������2��L�}����y���E��l��R�����(P��������_���" � �|�E{�>7���e�c�i��x���s&��%;���{�;�hS���z�I��q�P3c��������c����f3��F��BW��'BP �~�B��F�k�Z<����j[7s1!@����6��N�#���p�\��Z��9&��3����:Tvl����(�u������<@��B �5	�-O�_xq�2��;���`����@ C�M$<eN%UT�V��h�7k(��~�jQ�D��DBX�������_�|�i4���@��I���
�t�"wg��_�u����E�w���m�t�������f�a�m#"��RJ���3#���4(|��k�i\��T���!]fiSG4�])�;P�&��%�%e�D�w�����~o�?�h��9	O���1�z���O�����M��k��F�Y;��+�s���Q���M�7��KVk�pV\n����o����d���lV���u��@4��
�:�r���0/�?��4��Dk�`;�V��G�;�^��Q�f���gO���*��eF��Y5Q�E�����m$����[w'#�@�k"��@X�@XR��'C���H������S'�l��m~���7�Z�?q�{U�l���Y��*�Q��N������D���[<�y�bD���R��$�o���y��)�n������Y~���-�f�6��O��uPR�w��Z�r�����ck0����o~�~���[��Va�Y�@���(`	`I��L��G��h�����������W�Rxw�g.[�/k����h�p7�v[��Z�
w 5��7�zV��@���(`	`	�_��B���L�%
�T
��Q*�7��v�JFY1L�!� �`	`	�_���M�9p�8M�������b��@�^3b;�[r�;X�ep���p37�F{v��������/���y��h�Y
��T+����PKK����O/��>�3}]���������f�����r�*C*���(7s�*������%�y�b���M��WA7���{l��lG6xWB����%�r�@X�@X���D��x����<�M����t�����@!����������f���c�r��3.K�������������PKK����KK�������������PKK����KK�������������PKK����KK�������������PKK����KK�������������PKK����KK�������������P�a6o�,�H<==�9�R���o��)www�D���!� � �5�!� � 
�k�.�R�V�mmmX��0aBTT��V��������������P�����3g�������4�0M�,���s���d�%�%~M@~�%�%��Z�j��!���qqqnnn������7n��iz��)R��; ��������odE:F��#�"jy�Oa)J�T��i�y��JjZ��&���!S�r��l�m�Di�h�P�1�X1Cq���^j���<F�2J(�Do�u&Jc�%/�
���nf+�32nG���Z����i�B����B�������*FnkH�Y�<��F�0�g��������Yf��2�yf�$a��E����F�Q��3�Y����1eT2j��{�2"�^!U���|��j���~���e6�T����H4b�rfEd,%6��������f%�L��l4�l�gf��h�����J�3T&�T���L"�Q!R����Y��d����)Bl��>���Z��-�1��Dn2J��yf����g��4%a�b��5X��HL"��l!bkg�12#+6�AeTK�3�FN�Ejn~V�6�f�3��-#�1�|�gI����<��h`��3���I'6o-�%&~f�I���_���������d�YJ�����=Y�)+ge���$��M�Zo�i[F�2���S���F,��Q��d��61�
�^��3��F��{>���V��g�"�v�F����#�uJ�Zl�x�������������N����V����������2�������fV���yf�H$��"�q�3J��f���%2Q���`X�����tj����z���������g�2��,fE6&�4r�s%RKs����U
F��_S�h�3���dR���f�$:��g6Q"����MJ�)�sT:5m�S�L.��$����Qd�=��	73�0r�s#-VK���3+�Fw?s��Zi4�e��
�����c��(�|�����YczyE
:G����Ov�\!���)��I�R�"~f)+�1��&�y'&�H���i����4�������gV���y�l�L/",Ub3+�leN�����2�Zd�Sk�Q�@q���������2��N'�����g�#6i�����Xb#�!�g6R4k��,Ke3��CT"�R�&���J!���s��l������m�:�Ik���5R[-6?���5�
���$���yf��|
������zf��:���S���g��d���i�����Y�=s�������(L���u������:)�S����9#��_�?s�:F�2d���,I��<�gS�}]�����3�_�,�aE%����q3��3��u��g������mM�����Y�����Uq����5J���?����m���^nP9�U���er�T/1����B�Js���&�x���u��g��`�g��V���.G$�K�����T[����{��_���Y���		aY600���),,l��9���~~~��	����.@P�R!����j�������3$''���y��������:t8s���-[�z��U�
EY��Z3_�vm��Y�}�]��5�j^��	���9��s����F��{J��;w��x���/���k������`��6l8p���Kttt||��g�Jcf�Z}���w�y�����;w��������r�JKy��j�z���^^^s����d�|�I�f�������iii-[�,�����|||�v�:n�8�.��#""t:��������X63'$$\�pa���W�^��aCNN��E�j��e�����k����/�X�-�����k2�>��3�R��_���{��T���c�������^��,�.Y���������v��2��V�����h�"�X���������/��g���OE"��	��m[C�<y�v��5j�(��soQ�������g��'Nh����W��b��|���9s��:u�4����(�s���������������g^�|���.�H"""2224��e�JifK*d�X�����|�����������+ggg[[����F����-Y�d���:tx��������e�������Ds��]�t����E"Q�>}�e�����Y�����i��'N�(r����M�6
6��������>����D��������<y����p����#��>�Y���9s�,Y��Ex�����N�:]�|y��1�W��;sPP��E�!�=rqq���
+���������}��N����qqq�Dw��y������7o�\�j���{�����7�{�nDDD�G����o���xzzFFF���������j������+V�����O��5[�n�k���U��m�v��%�����|����w�����3����t����!C�l���83��������=[�v���w�3���\�v�������g��uvvn��YRRR��W�Z��K���4~������gO�X��m[~����e���;�O>�������������P(���W�F���;���m���x��Z��>}���s_s�u���9���)99���f��-��7���T�eb�����O���/}||�9���r���a��]�r�����:u��������u-==��~������?���q��y{{�����T�0����		���/�����.�J��9��{��7o�Au���[�n}����fe�ae3�����}�������X�~�Bf���2�L			����M������v��E�m��������������}��y���g�����L��@���RSSG�Yszyy5m���������������g��������322���/��u��p�����a�O�>��a����c�}q������z�j��A�������������_�����5�[�n���ygn���O?��}��Bf.��6��8x��������,Yb2�����f���3���>>>yg~��A�Z�f��^3�����s����e�}�������3��1#...###00p����g~��Q�*U�����[WJ3[RQ`��!k��i������7m�������E�thhh@@��	���{���A�}={��U�V�������{����U�V�o�����93!d����3g�\�re�6m�4ir���Wf�����{�F���k���m�&��
�_���}||F�u������F��+.\�_�����t�����K<Z�P(���c��Y�fMDD��u�
C��]����f��y��-^�8��S^^^�����;v�8}��G}���?W�Vm�������W��tpp���i����j��y�V����������?0}��Y�����������u�83��3g�������W_}�r���g&�\�x�����7o���}�QHH���_y��?������]��a��?�x��y�����/����gn��e��5M&�F�9p��k�����m�6oo�z��y��"��??88���?����w�S�N����������B����;7n��I��}2����p�5s���37h�`����V�Z�b������DDD�?���>��~��3���n��i�����������,�oss��FDDL�6-wf�R�r�J~g���������/o����?l��]��w���~��{��#Fl�����{�[�>~���1c4���cY��7o������#�j��c��]��7$�����w��[�.���������5k&M���o��;w���srrBBB��������b����f���S�v�:u�����.]Z�������M����+�����
����y��������L�4�Z�j�3��������]$��������;w����O�5W�Z���b����VQ�O�>vvv			����O���t"�h����/�0a���KCCC����/_�����~����?������o���C����[�2sBB�Z��:ujXX�V�-r�r�I�&U�\����^����W_��G�'O�1b��������w��{�R5`�������8x�`��'O������ 22���s����'O.�_���c��:t�����d���+V������/:u��o���������s��Y����������U�g�����k������\.���y��1�U~7������N�j����7&N�h�������?3���V�����8qb|||qf�(j���,������)|��
�����5k6i�d����&��������3g�����3O�8���)!!!''�����|����[�������={��V��^�W��QQQ�3�^��?���������k?y���������3�%,���[�h���+�N��r�JOO�q��U�^���s�
Z�n���Ki�

����(��������|�����*ei����{����K�.c��	[�bE��i�Gl�@��������,h��E!3gff.]�t��%��������w����f����j��������������k�������*m��������o������.���_,X���X�r�����#Gfffn������V�Z��h4=z�<�}q���/f���b�����������	��H�~���]��W_EDD�2��/�����y��g.����|}}/\�`i������o7j����q���E��a��V�Z�j%�2���[llldd���/_~��%�^��6
?�����q������K������9::�}���!hxx���Sg���n�:�X\�JLE
��c�����L����_�p����:t��+W<<<�9��E��x�b������7o�tpppqq��V���aC��u������Jd��G�&%%�;����E����?~���i��999����w����J���y���Wfvuu=w�����'L�`cc�w�m��%&r�3m��J�*�Gf>x�`�&MN�<�;s\\\��-�r9�����Q�F���u���������[�j��~��I�&nnn����j�JHHHKK{��Q�z��~���������7��cG��m���5Mjj���c��=����C��]�p�c����������?���Wz>������J��={:u�T�z�b�\�v���j��d2=~����o��������������
ERR���CM&S���={�����;����3�t��w��O����g���;x� EQ�����{�fV*�666&�I��������O�>MOOws�>�P��n��Y,�?��~=���+III
����#66���[��]����mmm?~��Q��g��������G�v���z>|���)S�(�����p������36�y���N������C������h4.[��a���������y��}���������
�9>>~���Eu�����KZ���F�Ru��5w�>f����>���������0aB)=�����G��]�6��k����_�������GGG���80&&���{'N��cG���O�<)�HF�u��a�w<���dddl���C�]�X��{��������T�Vm��MS�LIKK�?��M������2��]��^�J���J)�M����m�6�V���FFF9�7�|��9�ay�?;*��mllbbbF�������w�h��{��yg0`@LL�����A���q��#�����_;v���[���/L&���������]!T�����f���u�����_���������������t��\\\�R��,((�w��o��vq�����?��~�z��S�4hP��d���`�Dboo�������8������C���������^E���Q�����   A@@����l~8
endstream
endobj
34 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
39 0 obj
<</R23
23 0 R>>
endobj
40 0 obj
<</R38
38 0 R>>
endobj
38 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 17825>>stream
x���X�W������"���wQQq�[������"*�V\����u�JQq��E��T��V�{����"`
7��������p�'����H$����� h@#A��F4��� h��p�����D77�������������������a���zzzC�������U�{�������jjj[�l�?���{KJJ���W�X��S'���������������U�={���k���5>>��������b������eee������gee
4�R^^.
)������l�	��7o
��M��={��������d2��Y�x�bUUU77��*))����"JKK�tA�����;��b2�5��VG����y���X,ww�]�v	�B������0��c��_�^"�l������i~�?<�@#A��F4��� h@#A��F4��� h@#A��F4��� h@#A�7������4��D���g���������uo���g�����Koo���k����g�/������1=M:t�_�����7�"�0�{#�����V�#v��l���H���=}!���%QY/�����]���V��s�~'���cq�{�ZY�p���U�����,��z�[z��S7���g�(xHJK���dj����+NT��d�\9��n�m��JE�z�_����U�}+���U���Pz�Z�*�`:��054e'*x��K�����{�R��E�z��YPv�������'*x�\�P��L��2�d'*z��*8r����{:��D�"@� ��PX�5
��1��&JXN$b��2�+J�uI�]f�����J��m���g�{�z��w������.z���jw{���
%��:��df�9�%=F?!��Ga���M ���RL=r�@�kS#�6�x����1�|X��a�������h��7�P�@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��j����{�~�����'�����w~~~dd$��7n��M��b��u�����~f��@
���ml��)))���300��������`�������x�3f<z���b999������U�������������._�<m�4OOO&��z�j__���??�������A��~\P\\L��uO���1�L�X\�S��o�oX�x��4�f-�7��oql���gyq��m�;�\��U�Nc��-���53�0�%:��D=w�#�����������b�O��(�0w�O���4	�gH���)�������/F}�P&����S��j����|������S����w�������s�.[������� <<|��!M�6%��D�o��C�)���G��=���a5k��r
y�Sz����Q��d��l�jC7���
e�/�]A�8x���Q�F�8$��%� !���?�'�{������j>DBM�'Ym�M�K�a�bp�����f�S��f�|��n����{��������S�������������~��=�������D
<@a��� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/5��s��B�z����+h����b�J����LQf�(5������l�� �%�xIl]�Y���f*�f���s��Isc�����V��,]=�f�9��$%�����P�8'�^S]��{�&"�X��y���8��5SU���h�RC�w ;���[��s\^lA�@c��M����=�W�i�$?���7�O�f�!���
�w�5} ��:��g�v�N��%dx{N�gg���a������l���B��Lz���Df�2�4��-l�_-�����y�7a��mK�1)��&@/5��=�=Xz�E�S�
��{�9)��|TN�r��7�HE�����Q6s�?�P%b!��w�I���,��.V�*%���f�|v�,�>SSKm��f�!��&�6���tl������U�����!
�)=��'w��%����L�B��4�zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
��B� ��PX�5
���@a!�^�@(,����zAP��@/j �@
���m��������'O��~XX���[YY�������������7t����� ��PX�
����7o����]�zu���{��-))qww_�bE�N�LLL?~��o_+++J����@(�/x
��e����777///����������������5h��hX^^.��xr���x����]������1}!"�3�oD������^�������y�3R���=��{���{"��_��?"(bu���3����C.�l�%���?�I��(OM?$q���OB�5�}�u�X(t^�����I�|&�@���.1!�����S��p��&w��f.�����!���x���Lz=7���*���V#�#-����"��M8��c�0'�ZUy����{���B��&Iz��l��e34t����|�Z��u�z��mbb����{xxxzz2��5k�,^�XUU�������JJJ={�$����y\
E�`��3�-��w�h�r~�X6�.���E�X����/�>��m�c��b�vh��3����O)-�NK�����.!dZ�b�}�q��}(�Yr��x�Z�+vTMqb
3��~
9�zL9�k������������N�����]-���R&�85+k��3��D]C<ee3��J�%�*�=��b��7^��.��r�XI�@�I��:B
F��Z�uD^��2����>!dT����)IsK���)���c2����5��VG�{�����w��������CB����%777,,LEEe�������H$7n�Ms�#<�x
���������zAP��@/j �@
��B� ��PX�5
������\���F�)����Y��@
|���t���J�I}�f�X�(��(����T��|����������W�Z���2��Yj%�$o���"��=������qjk�e�7�o��v���^��|���z$�H�;6�T\6�v������V��
;?����G����{��B4j �W��c���SlGN����g�����g��e([!5�B�Ax�^���w��~]�d��
%$��D��	Y]�����Ptj �'N;"~6�����i]�Ol z1M�����Sn7���"�(=ttx8.4���@PO�~����n(�Gn�SHL���%|�T�;� �^�@P@
@���zj 5�CP�� �����@
��@40j (S�ks��;����M�4����C����h�t}C	���G��R9��E��H��/����e0_��������:��@/j (�5���C�P�j
"�y����	!��p��W�M������h5���I$t���%����L5%F����maI������G?��n��5���t%�FU]2��������#L�[1���������S�E��*
�6��r�*!�^�@P&{��8+Ci����U���A,��1��G�J).Y��������E#��-Sp`���Hy�D��.5K
�����O���*����Ob�3����>!�lng� ��.�m�i�X�!D�}e3K`�����8m�kxVL-��d��|�����.-db����B�Pz��`�'�y=��{��l��A� ���L
�M��&,na@��;���i�MX�z��"u%2�KQ�?B����\���\s\YG��$��^�]�%�)!��u�GQJ�Yb+C�X�X�������24�f����O�� =����/2?
��|���D$�99���R6�� �^�@P��x��\ZI8����!�BfX��2u��D�J��]�
��J`�8n�6���l��A� ������
@
e�@4Tj (����BP@5
�� ��h�UxqY��6��%���62^�K���E�37Pj ��
�����F����[*���6�������'�6���n��Ur{��/� J�����BP�P�m��N�� \2��g���xMN���e�*����fFP@
@C����CP�P!>�h��@4T�O!85�6^]#wv�.��|
.��)@W�P�MTt�8p��e_�>�t�M�z�q��
��;���DU���'��?������������"�������u'��<b?�t���H *��?DE�0	K����J�6�-!)���l;p���y��:H2O1���NWG�����HD���qu���=�o&�i���V�{u@%t�R�$�!C�?C���'P���� I?L�J,�JG	@��cH�y�2ar>����>O�|
�0���V���0��8�O����0�����D^��T�d_fhtf�W*������ch91M�|�����`���Z\Pw��.���5���j�K(K��*
����_N���n2?���~Z��n�f��
%������j��QR4�����qC�{�([���~\]� w5)}����4u��	5A��+�mE���;�-��=��]�;���6	JL���|a�f��q���.��;��s��%����9�z�����P�=X-���+}�t@����JH���ze�j8�Q���O/e��7�#�!* _
��@|c
7��%��2M�:�!�����r�H"V��6�M����:G�o�n
(|l-(/r�e��2�� ��@A����$�x��ltcs��G�^�__ �,�0S3������!������=��I ,�+-�S�a3Y�=Km%�'xy�F��(���=Km����E�)���������8�p�o�� A�}GL!d��X���5�X�~���Y��!��DK�|�d��%O�^����j�}}UIY)Mf�9��4x��`#e�3�;�{�Hr���oa��f�t�����������\�p�o��uK��@��;����&��k� ����O�Y�=W
�w�Hi2�f������D���N$���/4SUVb����r��SR^.NKa38�Z|�=�I��C�1����7��Z�$9�4jE8�A	����sK���<I#-&�ID���S��7��@|c�/��[��������f�5����r��h�����K���?���&C�B���.��9�$��(9�?`0KG�^�^���g���G������,�#eEd����Zz��RS}�m���b	�����}�<�!�������2s�p��D!~��}fx4�Awmw4T��)�j���� �}�r�=�1x�z��@|c�/��[UQ%N�����	OS2JJ���4.6��L�EV���-� ��z����S/cW]<�r�����N����L}>o�C����.��s���%�����N�Y/3W�g��[����	�>~N9��M`�HZ��\Y��O!��y,fji���@=��@|)@���^�m�$Y����	�rd�H`������w�K���/b���,=������b��5j	��� _
P'���/���?HeLc����d�":���*��K.�+X�Bz�>R'�O�\���G���=�?�4��dqw�?"�H�k�e�|����@|c�/������GGrZ�Q��Y���<=Kl�`9�\��n��	!z��?2�����������Y���3~����_9�O}M0T�_|�bC�~I�o���m;F���}ajni���u}R[�����BVt��+H_ n���T_�*�X�U�
J$Ym����PVa���p*���Xd����c�Wu�]i�����f���i����}�XS����ggIE��M�{dE�@|c�/���
^G3I�A�b�4�t��&�:�����{�8��`�����7�����^=%��[,����D�@^3�1$�>�#��X������o�_7M�8�}NM.�����Gq1����9y�{[$���"-�u����1���XHQ6���T��!o�BIAQ���M��7N�JbN�?<u���L��Cw�I9��9/A�6[}���z��$/��5*����*�t�55?��J�4����w�������k�^�j�0�{�|�e��Ly�$N�z���@ �1��B�	y��I�a��c���qu�����������t%C����(~�z��L���t��O����
�XO�k�����s3�z���]��A �M�<_N�IY��2����D��s����}���P�<�maI���g?H�1tE-����2���q���q�iq��o��4������ihB�����'�T>t��-��V|���6���A �1��B����tAvQy���I
�����/��j].j(��3�T�u]��H�~���'Ho2:%W6M��%"zIrono(h���!���� _
P'�e�[��YL$?��z��������wV�Mf��5��N�^|{���;/@���k��q�3���#_,S}R��(�3X*W|�����%���	B��������>��Q_���fN`�4��	�RI����0Y��������R1��$�k�c��d� j����?�^ZH��v9���B �1��B�	�@���U�;��"/BV4�0�Y����UI����">��[����nJ�R�&�`������i��q��$+����.n�����Z��Z�5���0�4���>F��x�f���5��V|� qz�y��x��	�����O�=xgyqI���~K�8sC�@|c�/���7Fn �O��[���������b�Dz�"����7t7�%$��f�6T�3P����Q���\",m�m��������gn��o��u����=�Lxq��dM��9<�A@���D�U>������@���a��D�f;�&�@��� _J�i�&�X�n�:�z������=t��n]�G�
"�������{Tq�@ ��K�!��[��O?=z���b999�i��
��� ���@|)9��e�-Z���3h��F1&&&!!AN6H���Iz*10�86�|8!�����-
�����yO�';�?�/*��T������N��u9�����YS���]�T�T�U�{'Sz��V��<+��/����R�g�o_SI�#�H�'&M�'K���z>(7x��T��2�����u��2����-�%*,�@�	�Thn(4���� ��b���
�K/�j����e�Z\�N\]^���b���J-����|�1����W�K�X�3OPq=��!dP��1gXDZ�o[�p%+W�*:1��%Bb�.���c�|&������^a�"��KI�����g�.�QB�
��
�q����KJJ�v)�����O�v�|v�������3��t�v��N��(ifv�w"-��EK]������T�R�d������q��,a�����m�(Gz���F��%Fy�P������&.,��~�H��<����D1�,���xz��� �6j7>�C�to�<}�/)gH��{o�;u��]e�������q�I����O������P�o�%����X6c���8t��[P^�N����tN���H�����B����F������4��@�'���l�w��1S%W��C��FS��8����~I������r����x��$�y�H��Y}�q3�|��N>��V�e�b����E����6�d'O(l���`i�JcK�v�r����\b���X�)���4G_I{@�Rml]���x.ar��G�s����N��p7r</;C\(R�!�
�xLq��ve����z=��SP�f0F����J���d<e�$>_��5?���"�t�,*N/�Q��tQj�sY�������1I�%��X������~�h�^ y���Qo��g����{�m���T���������f�uc���0F~�DG��\�[t[zN�w>$	����Z���a�+�v������
�8� CPD������<��������IDmul���sbC����'(%��^C�-PK�e���.����}'ia�Q�c����Ry]d��v��������"U����y���`��R�����w��c��]�H������p^z�5�b{���\���};���!}t5�2�s����TF�t�K�w��o3D�"�Y"������oV^�gK����7j+��e�����d��	��97[�e�3�$�d��6U7�/Q�FK�d�G��������D#6Cz�p�,c��&��mf�2pJ�J�Ds����������H8����wWKJJJ}����lr��G����2�i������|�@������~wA�a0����a�@����N�X,==��&���X,��g������PXx��FX\�x����v���������?~�����w�'NDFFr���K�~�/�9;;�����'O~�����37m�diiI��]�~�[�n<�sg�df�H�y��y��=����~6lPUU�j^"�Z~�~~~s���<���k,X���PVV�����q���-11Q �h��sg

j�����]��,Y�j��*���������?s[�n������yzz2��������cbb�v���3����������r��������3���;vL�8QEE���S���AAA_3�L��n)))'N��5kV��O���^���~r�N�<ioooii�m��>p�\OO�*?P^3������y��-�~��'3W�����r�j��u���VWWW"�T�\��+s��]���������^�zU^^nggW��W��'Of��u��
6�M��������������+nSBBB�^�����}��zy��D"9y���a�j���6���'���yyyT��T���y���a��U>���Q��+�p����������]��k<g��N66���b�x�*/�l����Xx{{��w���%00���x���+V����������;u���p�������u��)S���/^�������x�b.�K���������l���3��[�d2g�������{wCC���/�������\�R���v�����;::R~~~�F���m������������������[���������e�BHxxxaaa��=����������
>����������7o�o�>--�)S����SSS�������:tx��=�#}aY||���;9NFF��&M�����b��;��y��;��l���iii!!!�����*22277w��5���S����|���G�7o�,((�4i�l#
8p�������S��Y�dff�����.����WNLL���rvv�r���������7o���_�x1l��={�())��=�y����Z�`A�F�����O��o����|��k�*++O�<y��E&&&��
����d��.\(}�na��_��������������������D������>>>?�������z��)���������u���X����l��e����/���3;t�`mm=f���3'''{zz|��"������`$''���
8����c��Y��M����������e��������7.]���f��1#88Xv�����6m�p8�_~���a055UWWw����7o��k�������k�'N���w�����KLLTRR����u�V��UTT&M�T�����~��g6l(//���wttLJJ��x���ZZZw�����{^^���G�+���0  @v����'N��z�������)������s����c�����y��K����+VT�MIII122Z�d��e�d�C�EFF.[��m��,���!22�����-:}�t^^^�^�|�q��OnS�����[7i�$cccj������p--�6m�DGG����6l`��|>�����d&&&r8��w�zxx�;wNvH

555��������=���E"�����/�������4iRZZ��rmll���#�rss�v�+;�gff�D�%K�P��e{�S�NYYY������k��]�pa��m��{W^^>x�`������BUUU�!���$""b��%���oo��2eJtt4�����Z�v��������;wJJJ~����������;���:::����g���K�4iR���B���C�9r�_�~?��Sdddfffzz���#��=�a���]+��l������-Z�z�j__���g��������m��M�}��H$j�������[������Y���]RRRy���!d��}�?�����k[�l��E����A�-_�\$U��~��%K�,]����m��y��i|���0��?""��Y)dkk{��33�!C���?���L&S6������SO�<�����Y�C�u�����Tvcs��ss���-44433���[�����������b��o�����������Y�.\�p���-[��6���iRR���^AA��RSS�]�6a�������/O�6�������kf�@vv������S}}}e����]\\RRRBBB�M����������|������m�Z�l9p���� �����3����m�&�*��odw�

~���M�6�f����2e������s����s����T|�������=����<�8qbpp�����i�|||d�Hv��C�D"Y�n]�>}Z�j��37o�<>>~��1{��Y�v����������!�_�v��!B�����=k��mtt��g����SSSe|���U�,�����_JJ��~a��m��a�6m� ��s��UVV^�z�l{T�������o�����s����N��<s�F���������Y�p�R_���������(///��]�v���:z���U��.]*���m���g���F�]/3n����fO�<y������.4hPfffFF������{e3W�M�z���3gV�X���E���=�{�n��/  �m��,pqq������	(//����;w���e�t��=�L�������X�������NJJ:x������{����e3���k��a7n�8<N�<Y����[��544BCC�r���#F\�p!22�U�VW�\�����i����/_����}��I��������h����[�����#�#�����u�������XXX����o����������111�?�8�����1���w����
������h�����}dd������O�
v����w�8p���k������;v����X�t���s�-[�u�V�g?~�������e?�WWW/))y����U�������w������i��O��1�
���t��
UU�������^�t)--M6s^^����}}}7l��`0��Y���E���={�l��:*������egg������^^^6l�D��e?�8s���i����*�������r�P���,{�W�BCC���?7������jnn���]�6md%�����-{Lz��)����������;t� 
��������g�F����{yy��7������_3sJJ���G;v�hhhx���������%K�p�\�T...�=j��UVVV��C+f���2d���[�������
}}���X�����3����d2+�b$�8p�����5kJJJ��];�|##����o��e���;v����_�;6o�<�����e�SSScc����-ZggggeeEEE�.�������o��=n�8����w�nkk+��]\\��o�����r}||���G����.����{��������dO�����eee:::%%%�����d2+f�|��k���e?`�X_?y��o������o��Z�j���K�.���444\�j�l{��[��\�`���K��W�1�z���WSS�r��lf���u����O�����Y����������_�Z�������o�����}������+f�|�r����o��;�^��-@Q�����|����e����M�5kVNNNZZZ�~�d��Rq�r��9�{��YYYu�,�_\����0 ==]v�SQQy����M�����v�����#G�,]�������+�	!�oo�f���#I�^��������z�����K�.���Ow�����puu��a������+{$��[�������o?o��i��yyy���c��]�/~�����Ommmy<�����S�d��s�����x{����dwDW�\9y�d��w�N�<9)))///;;[�/�����?�|H�=q�~�w���g�>z�h�f����<�����O��O�����z�*55���w}�����JKK+((���"�������������������<���?b��#G����c���y�������b�8!!��_JJJbbb��o/�cz��E����?kkkS9��������;��p&N�x������	&��wO63�����M��ow��������������	�4iB��AAA���QQQEEE'N|��QLL��q�"""d3s8���B��!C*�,�~��w�P8y��:��ahhhvvvnn��f���{��MQQ��	*��"�W##�������}}�.]�*))���ZYYU\��
�:u���W����'&&&IIIc��IJJ�}�����_�|)����*,,,''���V1s�f�N�8��� �bcc9��#�?.�>m��}XX��i*���{�&''7i�d���_yU{{{���;&&f���B��������YYY�������z�jVV������dr��=�����>|�U�V��}����/�D�z�z������������3;;[CCc���_?���������O==�����;W|�=z���9j�����<{�LWW7==]GG���3�������[1s����w��9��C�������HHH���urr
		IKK{����_d�C���W��w�><b�������'O����prr���������suum�����'����y�����P&�����d2�����t���-[�trrz����+W��y������������+�����Bf����Y?���}�vKK�=z���?{�������������RvH�||������7ne�[��a.�+;�	�����A�EGG��qc���jjjAAA���+--MOOWVV�|H'�P��e{�u���#���epp�����O�vtt�������������5k&;�'$$�k����Q��e{x���������C�.]���3e����������u��u������m��9y�������.
uuu
�e{o�����������5�����}�&M�T\\|��AcccWW��� �����s���c�V>�7k����w�������R��|�>xxx��G�����H$;vlPPP�=FX�O��|�r�����X�����u��5�������?�����#����_GG�6������aaa�__�\�zq���/^���g�����]��}�������BTT����}||j<gzz���'e<x���Q�O��V�\�d2�5k&��w��u�V�@���1}�����������}+{�I}�R[!!!7o�,//������4������ h@#A��.Db�Y����� z�T������R�8�P���^R�H*��
�/���K)��|%���qq��H*��m&����K+����'��?���EKE�;M�!���ej��}%e����<��9%y���6_C�������DDM�[�-��x���ca�S��uk��;�'&&���[,����_����������������e��/����oXx����V>E"�T���C��n�|��w��cd���]�f
�����_�������6m���cbb8��)S�,Y2b�GG�*�'��W>�i��i:�6����/[���wV������V>e^�Ic�]j�x��}O�8��������5���?N�����6���7/55���#((�������}q�*�2��lhc��,���p��UCC�3f�xUd����+1��i�]���Y|������#�,^�XII������^z�B�S���8�f�=z��};::���S��������|
��'��km8p���'����|k�����O�U�����7j_��p������+VT���Q���?�qJ�)�~P5�7o�|���G�MJJ������U,�]��-��n�n������i��'Orss������V�^�`0���V�^�d2.\��[�&M�dff�\�RMM����������7n,//_�h���[544����/_.{Ki777ww��s�v�������[�����(�����������333SQQ	KLL��#l��uS�N�����J��Z�B����)L�_�Kk���	455<8p��*�y������W>��v�T�Q�\\[[{����F�244�b����*��#�S������c�����Ty�Lx��T����&��f�&M����XYY
0���C�	C��c�9�`;��k�x�F�JKKuuu]\\�\�$�ri��������N}ky��X���7���W��8��$�p�S����k�������ALLL�~�LLL�����w/��[������6�V��UTT�~������i����#I�T�v�����3g��;���w��-��g�X]�v����o�.�������=;***&&���N�0a��U,k��E9~�������/�w�~��=kkk�5z�����������>$$DIIi�����o��������x��5{����%K���8r�H�v������?������U1����'O�����X�����_�z���?V�x�P���?�|���455�X���~��M���ys��yUN^}T�x�~�����^����/^}T�8��h��}BB��_��x�P�u���;u�T��R}T�x�����?_^^>�|m�*��T}T�x�n���;���7b��.]�T1z�P��K�,Y�v�������e��������U^3��A�E�p��9]]�'O������?��:u���Gsss�����Q����<xp��9L&������s���</22R[[�Q�F/_�tpp`��zzz���������Wnn���=����w���g��UO�<y����������K�����m�6����_�j���w����~��~����knnn��J���������exx���y�v��X���~q''�����7o���X[[�{��������obbr���v������{��������[�j��E�/^}��[�������LfWxMP�����NNN���FFF:t����@��O�2��b;vl���U�E�P�������t���{�_�~M���2��tQ}����� }}}gg�G�EFF�?����N�ruu�������=t*^�|����f�SSS�����=kkk[PP 
SRR�z���n��=|����p���aaa��u�r���r��)�P8b��W�^]�~����g�X�rttTVV>u�!d�������^���~�����w���F�Q��:������8|������a��~YE�P�����������s��y�P��������W���@���h����c|>�����������7o������k��777?~�������C���V�/niihccS���5@��?����Gc��Y��}����o�.�.���X4�g��U����i��{+��:m.���\'�'��|
���i0\.������������0��i����B��+��8�u5i;����S���+�2��iO]�,�(^x����)�������ie����V���@n��rY\�}U����c����N.�?H{r�����4��F���r�/���LN��#V=��8|s4��� h@#A��F4��� h����x:
endstream
endobj
41 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
46 0 obj
<</R23
23 0 R>>
endobj
47 0 obj
<</R45
45 0 R>>
endobj
45 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 24117>>stream
x���XSw�����L�KT���p���{�]�T����8�Dm�[q�(�-�-*���H�9���-�HD���z��!i�s{�'_�&��@�6����t�
�+�m]Am�
j@WP������6����t�
�+�m]Am����4u�E�!+&?���>9�0�F��I���%���)Q��0O��&G1-.�(&��JI��>���!oV&J��$��5�ri^-��x>�"���<��f|\,}r���1����FZ����+����R?|z?4���'��Qr�g�[����=Am���K47�z�W�	�fg���;N�>]�~}�Q�F������u�����l���+V�x��Q��=����%_K�RI$�/�������c`�u���x?����������cG�\�q�F����g�a�����?����;���.pWj��U#������������s��G���Y�y����������+������kaa���7>�?����f�:�6�\ep�
�������}���������&$$�]��X�bC�>|x\\��]�D"Q6�/�O��.�D��b��G���������g����{�,;y����O_�z�e��-Z�HMM�T���{��=,��^�IK�������VS�>~�^�z����V���]��}{cc�
6L�0!**j����������J�����.>�O�����&,�i�y���C����������o��C�MJJZ�`��k����_�tiOO�l�3L��a����������X�\�
��;���v��y�z�JII��s�/��"Bll��={�Y�Wr�[}��0p�X�yi��q������xx���'k����[��{���_�=����������I~?���^S��rG�&��w�NDK�,Y�|yxx����MMM����0��e�~��g�X�y�f��.��w��u�]�l:9U��P�T���C,o��}���IIIc��U�T��w�y������bdd�~��l8|?P�?�*SM
�j;UE��e�v�N�v��9h� "���
�R����O�����q��u�F��p�B"�T����#>\�NGG����'$$������DEE=y�D,�/_���w&&�]���_����7��S	�IQU?����k��F��x���]����\�re���:t(Y�dXX��7��i3{���'f�G��pW�+��Jb!��q�X�R����k���'O~�����[=<<Z�n=l�����)S���F$=}��t���\|��}�no�t3��K�I���+W
E��E�;�������W�Z�b��+W�=z��3

�l���cG�L���k���Yy���rm2.��=���K*�&''���w���c���������?��y���={�|��������n�O�����q��~��Z�j�?>  �z��)))r�|��5���3f��Z�n��M����g�������RE�e����>}�t����j�j������9�KNN����<y��S���k�����O D��>�m���e^,U�����obb2i�$���K������?~�Q�F{��m��]�+kj�NGb>��RQ�UQZ����.\8h� SS�y�����O��W�6mZZZ�Z���sg����?��Q���6���k>����[��j;""����,�D^^^��5���������],'&&��}�m���L~���o���&S��r��'O�twwOHH���|��Y�.]<<<����N�����f�����(P�T�R��>��~9�v��-:��o����m��=~��x��QQQ���sqqY�d���g��E#"".^�x��Y''�G�M�0a����W��7o���s�
����F����w�����G*T������������K[�n������O/[��|��u��	?q�D����������g��m������y__���;�/_~���'N�ppp(\���i�������9�v@@��������-[d2Y��+W�\�L�
6L�4i��Y����7o�?~�e�cm�=z���/_�|��A�f���i�P(2v�4h0t��z��e�rn�]�T����_�x�t��2����i���+W�g�����*Q�D�������]<���\�����g�������K����g�������[w���g���{�n��5/\����9�v�5.^���o�k~��7___�T��U+;;��[�v���C�����v��k;55u����'O�����d���o���������D"	�5k���K�-J�(���s��������j����1�������F�o��			9q����G�N�\\\*U���c�l&���>��I�����FEE�i��R�J7n�:t����/^<b���/����+fd������\����_�f�������W�V���Gg<b��I>���=p������M�FD�k�>w��������L������7zxx�i�f���%J�X�dI���3vI���{��U�l����{���i��������o���L�2�/_>u��R�������X�p���#�_<��&�!C�,]�����c���w�^�j�~���%J��=������?~��;�9�� k��usssqqi����K�&N�hff6~��������/���+U����i6��X�w��������8qb�
����a�����
6�5�O�>������z�j��#���,X0j��-Z���{���j����]�p��������a��*U�\�v��+�Z�'N��R���������9��o��5j�����]�0a���3r0��s������>}�Q�F�:u���w�N�����s�N{{{777�a^�z�wps�mA*W�|�����G'%%�\��S�N&LP(����8qb�������C��v�����2h��q�����m��i���s���8n�����5��{wpp����mkee���9���'O�]�����:������m���6l�����E���9cii�v�Z�<o��:����\�*�*44���A"��z�����������III���)))����}��u�B�^�|���-�J����r���a�v���IXX���}xx���Mrrr�����T*������D"QHH����\.gY�@�O�>�D�������,�[m���GGG������W�^e�_�xannnll�y��u�A�smk��������'&&FEE9;;�<���{{{���P�e*��w+������d���>|���qrrJMM}��u����e�v��MD/^�P(���o��U�T�
����������9�

���34����9�� !!!��J�������VVV���qqq��z����A,g�x���qxE"��/lll���###�ju���y�g�e�s���&������d�M144����8��!�W��\�D�������������aBBB���3�*�d2��'���3�������]BBBTT���KXX�J�b������W���
�"��������^���u�DNG���	���kPY��
��)o�rs�,��V}���7�`]�Y�Iy��������F�0�������dq���6��5�@4�)Y�����yt�]�KE�
k����T�J�I�zD�Khz�<�4��4��Yo�#.�j<`x�,-���&�d�7c�����'k��~�xq������ ,���"�'��k�o����f�� P�md�����Bm�
j@WP������6����t�
�+�m]Am����m��*U�t��-"*R�H�
V�Zedd$B||����>}�={fmm���~���###'O��v��.T�Zu�����O		qrrbY�m�����<x���+u<3����=��o�����#"�J��o�M�6�:u*!!������o���]�v��}{��9������a�e����������v�Z�J5{��i��%&&J$����[�n�Z��������K�.��qJJ���|�X���d_r�Q��
z���D"�_���C�8��t�R�n����V�\y���1c�|���xO]Am�
j@WP������6����t�
�+�m]Am�
j@WP������6������������Re���Wk^}{74!�C�Fy� �j@/������F�0�2���T(v��v�@�G������v���c��
Z��U�aKIP���<�CDm]<j����v^��(,�iP� ����/�6�^h����4���f�e����V0���vX��f��=�J�5L��I�'�~A�N����p����57f(#7���'u���"^)�0���@���I��P�����R�EZ�ZV��WSY�
�<<:"�0�=�h27)MM-+�����{�����=��������v�qbq��s��>HX����ibMm�|>�U����_K���_R1�� �=�������{m?=O����zn�z���;.I��&��6s���,����$b�Dvy���@����B��������x�Hz��������q\d����e'���i��r��hb�t�*�P�����
�_P�z���m�j@/��u�
�Bm���.�� W�m�����6�
�
�P�����\���j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W�m�����6�
�
�P�����\���j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W���<xP�T)"�}����U��?~��|��%����>}Z�B�H���@�m]@m@��Em�^�������m����{����o�={��������x��]�v=s���	�������_Cm�jr��������n��
*��y������u������?~rr��%K�w��u�V]������
���r�`zm�T7��v���|b����t1������W�DG�(k�(:[�tN�im�k�������$N)�0��\��N��U{r�������=���Y���K�LI45`:U�l ��Yr����E�{��{��Mm7n��������%K�tqq�����``��;z���a���GDD�ll�$B�5��@j��;�;{.��,�K���X6�HP��k�U�RV��m���
�S��K����T�J&�-
yh�y�~/��G��Tn�������1B�{��>�6��u�:/�����&�M�0\�v��[e�2��FJ�b���������X&4-���]�=��!���L�R+����=��T*�����[���=z��5k�4i�z�����8p�v��&&&G��1c��3g|||�T��u�@��3It�$�\�=I�j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W�m�����6�
�
�P�����\���j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W�m�����6�
�
�P�����\���j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W�m������lR��k�&�U�<���\�����]���Sf�2�2[��yvP�jr����m]@m@�P�z���m�j@/����&J�pFT����5j�+���j@/���]����};�a�O�Q7@m
�6@~Am����O�M9r��H�����j ����j[P��+�6�^@m�jr����m]@m@�P�z���m�j@/��u�
�Bm���.�� W�m�����6�
�
�P�����\���j[P��+�6�^@m�jr�Mm��3����r���������8.00�����&M:p�����nF���u�
��~o������+QQQ666D��Y�#G�xzz.]�4��������6�J������J����[�n���C�z�V�Z�x��n��m��-�6���y=-hC.�=�����������_�LL���Z�m]�a����b_�$a���/���xz���T�:��HE
��bd!����}��x�4����|�
s���{���	DTU^����VQ7�b���&��J�Q�,�2�����W�]� �����}[M`�P��q��:8B�`t��������d����L�!���YFp0WzV��q�����+�%�������n���T*


��y�g���8��?��bA�)�m��� WZ�v�J�n��ADe�����<s���}�z�����K�B��9����u�
��{����.�� W�m�����6�
�
�P�����\����������p����%�6�
�
���-.�}���G^�5���i�%�S�����*��S��6�����m�j@/�gj{���"n�riw�����W��[=/�|�L-oa6�B�<=l�@m@�P�z��Y��,	����}D����l.j��@m��6�^���6���C�v.P���
�P�����\���j[P��+�6�^@m�jr���z]��h����qY�un�2>A��hb�UG�/�m�j@/�sm��/��KU<>��|N�zoY����QBm��@m������9���9�2�gmw(��F�
tgY��gwVm����)rt�V���j:���/�m��'�=�����T^ ��������W-����lj��ezp��2��K�Q3k�g���

�I6��H�������j@/�Im*��m�o��(O�:��D��F��eS�!���?I
���G������6��Rm��Q��)�6�^���v.b��=���J�������@�P�z���Fm��6�^@m��Q���
�P��m�6@�@m��6j�
�/P�z���Fm��6�^@m��Q���
�P��m�6@�@m��6j�
�/P�z���Fm��6�^@m��Q���
�P���������3�O�^VP���
�P������Q�[t.�8J���\fm/�� n�@!%�����
@/���j��Q��J���8BHI�>}����{.��>�6�^@m��?�v����P��j@/��Q��m�|����m�6j _���j�����m���Fm���j@/��Q��m�|����m�6j _���j�����m���������U���;��Q�S�/�WP��E��^�d����r9��>>>���-[�l��i``��Y��L���Q@{��o\�����Xn{��9��i�%�
j �hS���O���/\���Q�j���d����__�=zl��E7����j;`��XFm�P7�X�'��*�k��"�`���Y/z?�����
����Z�d���w�>�����Vr��Im�l&�{*��O���5���1���7>���>��V�3;���|�m�NU������kt�����;j������}���Tzm����c�Hv��V��$.�,���L5�JD"���>��=������ns�&�d�r��Mm�<�����g-,,��-�y����oo���g���6m������u00�kr�0h�/0=*����R��s��Y����a��}i}p����IRV��y���z�����ML5���rrI#��L�C�s��������e$����R�H~�3� �4*� ����kj����3.M�Q
$B�]�,C��G7��T~�� G/Y^<�"�0o��O{};�I��ZiQ�"D���'�)��s��T�^j}�e��K���WQ��]J�"��M,�>�,�#�O|
��
^��|s�������,C}w�J�E?�]cx� �R�]�����Dm2j�����RV����b�IL�^��Ec���������04d���-�&*�����IM}��Q�H���=..._rKmj{���<����'((�]�v���������K��e��i����Cx&I><�$����!���q7"*5�F.�\���_Wx&	@~��$�j�{�m�M���}i���T��nKRkQ�u�	�O�^���$��4j ����j�{����Ke��Z�R���?�����7���#��4]&�5u��Q���fBm��6�^@mo���L���*�F���Wx�q��B��ZrOFf���3�V���Vc�ZZO�	�
�_P�z�����������Vr���.l��U6���������V�GEm��6�^@m�������C<O�k=*j ����j�G���t{�.j��WB����0�h�2��
�P����������s����Ef�N��^r����m�v6�}���d�e�<A�� g�m���Fm���j@/��Q��m�|����m�6j _���j�����m���Fm�\���x7(T�m��d�
�P���j{mr�U�v��g�o��K@�P�z������W'9����P,?�~m�C/.���3��>#a����i�u�
�P���/��6h���DC�M����	���5r^�\��!����*ll����U�m���FmMm��JQ	��I�����;�`���ND����!IKX�(�I�	��P�z����������[)�o(��2��z'��7���{�E�����^��-��+�A�	��P�z��������[IDC���/�^��;�j�=���j@/��Q��m�|����m�����"tr��:���?�
�P���oQ�"3:���Jp�$r���/Am��6j�[����N/����C$Fmj@_��Q��m�|����m�6j _���j�����m���Fm���j@/��Q��m�|����m�6j _���j��C����H�p���@����g��l��K�wI
{LSo����S�����A����1u����l��������L/.�������������m���R�����\��b�WG5��ZX*�W�2 A��L��Zn��Mj���e
aIL�b��s�C��$������R��=�v�|�����
�M��5���:��z%�f*5��xk�L�z�V�����%�R����$w�"���fqq~�gP�z����Qj�����D�A��k�&�_gA��9�>mD|�4vCj��w�	$\����$!j1s�K�aoT���kO�SR���`�.<���1����^�e6M��A����L�oG~I�J�X���;.��^T��)�\M��
�Fr��pN��a���Wt�����wC4������>�Bm��6j�G������>���6q�?�����lH����n9s�S��?���]�Q�$3���})��5���?�����}em{nJ���/d�ld��
�Fr��d^2���nrw��=a��$�[;y����_�
�P�������&G}um�If����Yy7�s�b�
0H���144l��R�q/����'��v��9�v�Bm��6j����n��
1�������b��[n�v��9�v�Bm��6j����n�������-�w��c�-]KY�����L��%�����emY�� �='����	�u<cQ7�W�
��S��b�k_/���_�m���Fm���qm6���+KR9����G�I�Fu��r����+��x��c�+�b�@fs+ ��1���t�w��f�b�>��:����S��n�L�������m�6j���QS���}V����>y��D�Wm��/������bd2����]2j{����	�2�)�C�/j;���j����~k{�/��������]\\j���b�
����������~�I���Fm��Q���|�Mmo���m�����qctt�B� �
�:uj���,����=�6j���Fm���IR�F�.��9����c��]�`�����u��m���<{�,OG-�%B�m��@����Y$�tF05OZ��������G�<��A�(e%��{:��A<E�;5&����s3�5���(���B-�X*He)��K����N�B�+�eK#vk~�Y�z���".s��z>������{�����_\,=Db�������&�z���w����x��+]/D$��9;1A������l�>���Y!��F_"J��Wy�U�e�k�?�D����cV�~�m����H����Fe�������(��#t��/�R����V���,�.��9�����l�8�a�����O��r��X4��n�[A���\��a�TV6)����,b�{a�f��5bFT��T��!���H������z�f�����%gK�d�F���Te������k���$����������E%IB�m�M��	KS�D��l.HD-JE�"�z���Mi�n����	1Z�$b��;�tQJml#>-��"^$�bT1K���&����s��OAju�z�EAo�]J��	�+�8-*F"�����h�3	)I������kF-���������ER����1a��W�=�&��E�L�:^�M�*��B������,��T<}/-`���t�h�H��}��"�
��8���n�/�C���V��e|K3�*�\�]����I �
��K��BS�w����D�$���'1o
<7���S���=��;Fr�<��&��@�f���c��nLi�T-V�f�b��6{�hP��
�R�{�����B�
}�-���.]����#55U&�����������o������bA����������r�`�������f����F���*����,��v����/{?��LQ��QtR�n>������V=�[<�v��{��5���<=������duO�M9���}Z}\������g�#n|��=3���(��������V��l��,O����;1�T�����vG�&��_���$5��_��������u������Q��m��_c��o{��"z�	�JM�����8��f�1m�R���m�����^��m7q����s��+o���E�K�����fo��}�Y������U��k����m��DE$#04�.co�D��vj6��6��m��B����z7�ye�m	�!IK+\5{��i����H�
���)�>����v�����to[�4v}Z��w�j��;��AL����J����"?�������!7��L������xh7����e�L��^��j��K���4{���I��&"���;������yM��0�S]�~����$9X}���!��m��@������$�L�nu��Z|���W>�mW�E�����@��LQQc�_���
*r��V.4:�u)��j����=��A��l~��7���=z�:u�a����
*����={v���g�

*Z�;���3<��$�3I~�g���J�{i�Yc�!���|&����\�
Q�����+���n5]�3I����Fm��Q��{�.��u���Sl;����|���'���'�`�r��7�"�m�Q�����R+��=��S�R=}d��x���0��=k�~�����	*�n��Am�	���Fm��?�����g5O;7T�<4�M����������qDK�����	BE}i�#�Uj�|����y2�Q���x�<%�aB�uS/.`vP�.%,\���S�m�!��Q��m������;�3S�`�����{zm'6;��F�����_W�y���Wm��0[)�6��Jl�ta����uw5w�n�Oe�vM����c�������E�b���O����j���FmAm;��n�Em�7(�v���������o��
�$d����A*f������X/��C\�����O�m���Fm��Q�?Vm����z���t��4��"�����G4�$��V%�y�u��R\�t��4K����|b"kj��.b�4�iS������T������P�z���Fm������`I �Oq��=���p��m��Ds���hAd�[���T�m�������b��A7n8��Q��"�3B����>2��41+�Te��������P�z���Fm������RhO+���}� �+���*�4�\Fmw��:6M������'��~�2h�V\�Zfm�8;�a��m*Z���Q�j@/��Q��m������T��vsf������	����p�a�m����Q��m��nk�����ey����W�QQ���Bm��Q��m��v�"�zJ2c��M�QQ���Bm��Q��m�6jt���Fm��Q��m��6j���Fm��@WP��m�6j��U��E�C$[iN:����m�6j����X�%�V�@O������z�/����m�6j�������U�u�B�������z�/�����H�M_����j��i������Hq��U�4Nu<�|����t�I���ntj���x��Dt/�37b
Y�|L�:���\bX!�c�J2���������%�S����P���Z��2�y�7(-����y��W��r7���@��T�q�Z}1*�����e�s�2���T�K�W$������wo��W�.��x�z/�
������[P��m�6j����R{���0���N�\��^��B�����1$����������TD3�k��:�}���ne��@�5���9��qc�3b�A�6b�B�TL�c��f�YQ�m����''I"������9Ij����5u���"�R]�\D}�(��U�wZ�86��nt�N���;��~y��2)�]a1�P������}Ge�R�[%������Hl&*�|�9^�;m\���~-S��3�X�7�v�&���tj�ne�+nX#Y�����j���Fm��Q�Y�J"�F6��;57��~��\������X��^|��R��B�v��~��n�G��/�=��NM�di��`�uAi�@�:��F
f
��;>����1n����b��X�Y�GA��f���i_�M�&�8������yJ������A�qcbYq��]�z�	��;�KN��_��LL�hi6�`�5A��wGy��a�L&)]��C���"N)�xl�����t�_�P��?�`��I����k���Dc9S��xj[-���7�m�6j���Fmg���}R��X�'��d��/)��B�������+Z���/����j����q6���4=9�W�������~����~�q����z�`N�m_r����v9�F6��?�K�|T��6/�����(�����?\m�+%���
�7P��m�6j�������{m�?��{5�P��]�QQ�����Fm��Q���,�m]�v��
���DN��E b��G`�?���4>�j�����_�wt�	����GU���v���ac���?L��Q��>�����j���Fm��Q�YP�:��V�KS��� X^SeHD"��[��,^W��iW�%4�!��I��HOV11�R���������r�g��������o�=��X{����w�1���\��A�	3���j���Fm��Q�YP�:��6�V&���^�W�'D��o���o�B�\��n�>����_$�LJ�GmO�RY[;���5�
|������yus��
W�z�L���ZM"�f�O�q*�(��0�Sl(����������2����@L�B,�=y$.S�14b�3"32�����b6���D��D���ZM,��������l�^<#9�"��q��6e�o�Fb��Sx^��F��#��T���������8��gD��rb6���j�����$����C���s/"�3��-�^�$�X��s|�6j���Fm����������v;�U�]�*��C�"T����f����!���bM���	1���;���ko��||l���fKiCr��>���Cx�R.muw�u0�Q���|�|���Ra:�/�p� *`����b_�*������f�>��K�GM��
�QM��
��6����i�Vu���6IX���;h0CJ��Ww����05^[���]J�o=Y�����d5j���7����~���y\h���%h}g����G��7��mU�R��������?�����	�7\��P�kU�-Q��������j���Fm��Q�YP�?pmG����I$�~{�5���������L�9_dH��-:������KO��D�14�y�{Q)g$m���sH)
k&7]��GG�6�����O�f���S��$�v���6�z�LwKcf{����=����-����D����\�w��Y���~�?0�n�tE��Q1��Qs�N%1����5h����S�>]������y���"�v�i���4�
�F�>�q$_��Y���NX"�Cy=�S3;=��Qy��I"��d����j��Fm��Q��m�v��\��o��zI���G����hN�"���J}}mW�+�9�j[a�l�����+����MW4Am����m��Fm��Q��m�6j;'�m��~����dQ�*t�zT��.��u���Fm��Q���,�m�����d�fT.�!V�����u���m�6j���FmgAm���Em'%��������o����UX�M��q���*g�����dSzt����.�I����F�����P}[��{�ij�B&;$5������/����A]��&Dnzt`p�.
��v�}
���m�6j���FmgAm��Q�:��������O�P�w��p���3��F���	|p���P���S}���F�x��6�O������z�>�5cjf:jw�71"Q�e3��M�@Dk��c������nN�����Q��m�6j�����Fm���^�>k{C������34�����C$�X:�����	���J�h����5g1�|V�<���e)��l���v���!bF��C1lz�
��*:�H}7��@*&u��?���HJ��Ci�)�S���O`�<x�>�6j���Fm��s��Fm��Q��umG������	�xc��=*�����)\������,()�Z�~\u��kwD�������"�0��Im���n�$�1�9�n���/��%�����Lx���n�L��v1�����
4���K"�Id"���J1�0�����p��+/D�ly��r9�5�S]���Z-pjF�^'\2�Rb�*^�	�\�7����Fm��Q��m�v�6j�����v���.��e5`���k��������v�N*D)qT����{g�},��V�M)������P���ym����Ep�C��lR��tj��4N�9�j��A����v���Y,]�>��8{��x����(Pa��<��F�6�M��R8���iO�;/���4��<~�hR��y���B�f�������&a�c+�w4��n�O��Q��m�6j�����Fm��s����v�d��k�]����.��v b4GU+Y���������7X+k�&_�����V
U��67�Q��/P�V�H���mo�L��
f=�����Rq4��<n�PF&�7o#��Bq��a�e��>7K��:�6����=T�%,����m�6j���Fm���Fm��u^�E���a���'��2�LS���g�hY�-�����;kV��?��B&v�u�v���Q��m�6j���6j���Fmg������2��S�?����Fm��Q������Q��m�6j���Fm��Q��m]Am��Q��m�6j���Fm��Q�YZ�j5k��s��
2��WCm��Q��m�6j���Fm��Q�)��#F���={���i��/X��lt���sR�������2M�6H����ZY��n���]^����������i�,�,�&�T;��<n� ����yk��#���������Y,�t.������80������s�UjaDSy������`�M�����%q��.W<�{���[�v��2MmO{8�e����J���\}JsR��`0f(#�I��4���V����1�����K�z�5�H��m��yk�������"�o	d}������X��F���I�:�^�C��&�iO�$�LL.jb4������S������T����/��#�k����U���s���o����������S��eT��d������4�}�d(���b���J{_[a,1j�YS�R��/Z�}�^����Uf���gs��j��;������N���OM5�T�s���DW6h�p����62�
vG����=4���.�\�����|}�;[���r���Cr���[6������M����5h���}�sQ��]�;?kN�n<���"G��#
T�}�4g���I�������G\Q;vU?���U4�}�&w�QH}-*<<����s�9����
M��^k�y��z��R_j\{Bzm��'����{�������������g�>�OD��{�������V2��n��;���s
fW������Y^K�m�1�p�i����*�;�y����G��^����;��"�`����*�RP_f�=-������q�-4��:g��7�����OrmI~��!���@T�[s��������i�����-���>��^�\�-��$e�6�����5`����"��*Z��l]|��4U�;jdr�V��.���a�r����3���[T��O//������JDs4�=��A���R��������f��U��>�}��bo�u��^c#S��l������50j0#7���/u����L���#5�-t*����s�0 ���h=j��jTsy�S>)��kj�X���L3j9�����������d���ATD���������<�9xw�'M�)	�"F��q!3�k��y��k�����C��������$���N�����%n\����w��,#v��(�r��C2��B�m�����~'<*5�V.[R�������F.]�����
;�i�=��h5��L	+n�X[S���H�i��G�;3���EKJ�������/<���
\�����D���w�
�jv1D����cC
d����5&���:���|�u�+���Q��Q5�}`H�t�HZF��OlI��J�5'�.�)��y{fs����Z�Je}����	i"�q�S+/����F��4�!1Q+�����E~P�/7�=)���z��v.S�z!��3����y"Q����Sr��2�i%���_4�=�0��F�Yy'�s�f������ZS��������u��f^�/-L�T���x�4���CT���qO�m��4�)v/�B_jj{�$�V'���_��JhB���lj����Gzm�k7g������������Ai�n��v����Av���U��.���7p�u�������mN��+w�w=M��5?v8I$&}�EH����Kn��eX��}���������kj��lMmO�h��c��L����r�*�I;M�����g���
G�x�wYMm/|�%�uLZ�Ov�MM\v\Nc3��OV�-������A��}t����0�J��pk?=?�^����Ym>|���=z������r��Z�����'����I�H�����}WJ�
r�	|�W�?!���	��i.0�7�xS�B��
�__���Q?^�6�j���Q�������$����/����~U����
l��~~T�������'��3�h��_.kT�;U3#��Q���2j�������/>=}��J�����yw^�tm�=Y���oK��@'��G��dE�f��d�=�W���*���o�|����Ko��E����$Y�`A``�����dA��w�*I�����#""
�L��v�����K�����o�/���kAAAR��F��lTcccSS��w�����,Y�e��+|����5k��;w���MLL������
���?.^������{�l���WxXYYU�TI�cj�x����9�W�P��k�>}���JDG�����R�J��t����+��������
������)R$��?|��X�b"Q�K63��
*]��v��/_�����\�x�oA���BCC�-����<yR�X�O���c��w�h��r��/�
		qqq�������qx�����c���u�������d�1,,,�������g���d�y�f���v���ommm�������V��y����;'����BQ�re�F��J��Hr�*��������iii��S��B��a����R�444��'''g������w�QM���/K �D��b�Zd�D��&;D6Y\�������,��d	aa� �p`�e��*" ��.0������v,Z�y?y4W�����>�s�{�200X�t�fyyy~~�W�����|�|�J�jhhHKKoH���!+?������������h4���{�H^C���BBB���o����������� H}}����^����������|r||\D��?�R]]���� �������:��g&&&6o��_���`���!�����o���^�������������A***xyyutt�:���1U�/_����$��SSSx<������������������(,,L[[[DD���A��������'O644,--�x���}����{����-[***������)��������G����bccuuu����P������������h<��
j���^^^������awww"�������w���/��RFF�]����������GJJ*--���3��������B������
;99���8p���������>c�H�=���+++���KHH		Y�H�Z���*,,lmm������`cc������kaa��`._��{���_��K�6o����;22B�R���������������������s��p`dee9::�H$

����k��}���3�
~��}~~~kk������{{{2�<99inn��b===544���7o�DDDl������������nnn���������
e���"""�Lfgg���{��99������N����

ZZZ����X,������A[[����={��������M���ccc���N�<988������077G�P,,,-,,DEE�����������������VL0������p8�Je2�---���vvv666��o_9)%����ccc���666���yyy������x<^ZZ�������wvv���P(MM����+W���4~uH$Rkk�����7o������dee&''������AYYYA|}}�������t�����Q������zzz���JJJ---{{{GGGgffTTT�L���^]]������������k������������uuu����~2::�y����A}}}���R���oWTT�������7o�D�P<<<���VVV
EYY�����������?�```�Ut:�Y������_�lKKKjjj%%%�B)**���qqq���������;v�����������ggg����p���F!!!� �������L����?��sQQ�,�>>�/^�h4
��`v�����+&&������544��x����G�m��UOO���Y�'���w��A�������NNNk�Auu������laaa111]]���q^^^����cnn��������b]]]i4ZZZ�g�}fhhXRR���F���o�������}��������6���������w���DDDP(���	�r������`�����������(000333==]RR������R\\�;�����������1U�FFFqqq������g��MLL

mii���pww����x�bTT� ������\\\_�upp����������6 *�L��sgjj�����������;v�h�3g����Z[[����>}:(((--���oe����HFF�g���SXX�����d&����naa����������e�;;���$++���\����{KG��P(�@ ������������������?}�4,,��o���c���RNN���Gjj*�i���?��SARRRH$��Wx<����]]]


���===���SSSW�^������.**���������co�����������������eee���P�Tv?��O�������^^^>>>CCCJJJ���������?�����[[[@@@uuukk���~tt����555,������JNN������������
@_�����^u����ldd�`0���Y,������pnnnSSS����������,//;;;'%%������`�X///����UUU���N$���=z����L===p` �o������EMLL�������z��Uww���&��		)..���*,,\��8� AAA���t:������;w�LKKkmm
��hjjj`������B^^^ii���>�%������������222�����m���-++311)++C����H$"��b�d���+�@Xy+f����������������tvvrpp477c0������o;;;?x�A��{�����}u���
	LMM
�����t�����`�9�e������[�����S�����y��!��4P��������ijjz��Y%%�����VVV"�����������?~������J�oo�����/_R(''�;vo������D"��������I���{233A�A\\���D))���OO�5�
f,�����������������g2�����$���������������W�^
���W]����FFF������D���222rbbBTT��O>A��111������ccc


�n����KJJ��};�KOO����{�.�*((��������_�p��o�/��^^^���������sssRRR>>>)))KKKo��a2�������RRRD"��b������[ZZ��MMMh4�}O>66v��_��o����1�]NNN����������6m�q��J�b�����DFF����VXX��bm�}O2�����B����#hFDD���������k�n�������'��lm|�mnn������p��QYY�/��233�����b��������H���o#�c������`[���c�/s���"�hcc�d2


SRRJKK)
�QWW����y������1,�ijjZ��$x<�@ ���fdd�9sFPPpii����uuu7n�puu���^^^��������������T"����K�R�d���DXX���gjj*�622*//G$..���3�=��s'h��V Cccc\\��+W���&&&�������t����FKK���Vmm-�N��y���v�����*
����_RRF�]�v����8q"--
A��;w��������m//��������p8ii�������������s��s@����lee���USS���������t���7�����O���#���6<t�Pii��r�[;������_}�������FuuuEEExx������0�D}5�V��������


mll$�HO�<����������|��W�%+������bbb��_������=z����������tpp�P(UUU����<}�A����gffn�dhjjz��mtttii)�9`AWxx�������gccc�VH�=>>���������L77��w�jiiijjFGG���vww����z���'[�:u*!!a��VUU���(((�~�ZII������FQQ���K��y__���vmm��w��������~�����A�=444������n�����������~��]�bbb�����580PPP���WHHhzz:##����+���jEE�O?�������p���P������V"�������������!!!anns�����]]]}}��������=y���������\�p�������u��-,,�TjmmmBBBnn�������� ����|�2����+**����d2��8.''�N���		a�0����{zzzCCCDDDAA�_��W0�


�X,II���v///0v466��wo``@TT���s�lb_���������7l��U�zzz��������666����������p���sssCCC���15��h`����sppTWWoLiH&�1������|ee����������A3���RRRrvv���4�-Q�j����\\\����.V~0�kii�����*,;>>^^^������������;>>�i�&po+44�������w/66vqq���rcV777}}}���R&�YVVv�����*III555AAANN�����E������9�����qqq����6m*..600@�Peee��`mmM�R���%$$�����]�|911�H$���?~�dZZZ���������D"�F�&���������E��xyy������rpp<y�D__�F�h4�����loo������bW����III���4-''GHH���,--���[��m��@�������Y\\d0`��o������lhh�F���3==]UUeffF&�����l�277'//6Y^^NHH���������iii155mll<t����O

�7=�������rpp����n�&&&�<y���BBBEEE�f�X#000&&�������"""���+�[$���c���w��QPPx�����kvv���Uzz:��������Y___���YYY�P���jnn����%333���,���Gxxx{{;�����&&&X,���r���MMM��\��uqq9u���'�6����������������Tjcc�����_��������
KLL��m[cc�FF�����w���rNNNDDD}}����544n��%++;44$((�d2������� ���`mI[[�FF333���`�X���eee���bbbQQQ����������~����'''H����������� ;;;SS���4]^^���������add� Hyy�����g��t����KJJ�����{WYY�r�����FF����U[[�������`����?u��������X,����=<<�?^^^�����L&�������,[[[A��������d�>��G=v����c||<77wee%�������>r����|EE���?X8 ##322bhh(++K$�����-����###���rrr���vvvk���_^^�L&�����t==���BAA��rss���;99=����^QQ�����	dee������}����������]����T����������k��Yyyy���?{���);;�����N�A���?���e0���;�����|���_yJ��#>>����t���t��t������/����������`������������p���2���)A��GSm�|���offf=V/���h��	�~�=33�����^���V���sc�*��uvv\����~L��*
�Z���/�@���������a��Vmqqqnn�#��ET��|L�6AA}\`�
AAA�V�AA�^`�
AAA�V�AA�^`�
AAA�V�AA�^`�
AAA�V�AA�^`�
AAA�V�AA�^`�
AAA��_����
endstream
endobj
48 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
53 0 obj
<</R23
23 0 R>>
endobj
54 0 obj
<</R52
52 0 R>>
endobj
52 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 24770>>stream
x���w|�����]��^���2J�e�! K�d���L��@Vd)(�d)""��Y�
�@�����M��~�O�_)�G�����������O�.��+	)��<�� ��� ��� ��� ��� ��� �
����N�S'��
dA�lq�o����}K�K��UT�����K�(b�z���TRq�j�k���{p&�%����/U���'iw���3?�����Y��O��7�����^��l����������a�{IY/6������9K����h|r�T���#��M�F���y/a��3.5�k�4C��[�y/�T��+�������B�]��0����N�����,��)S�Je�V��z}LL��`?~�g�}&����������g����Cl��5+����{��zm���s�s������cYv��������7''����u������h3fL@@@�7a���U���UF,�|�����B���i�&a����m����3f��=����t��}��-p���~s�K���sh��������K$��������a���'\��[����u��-Z���mtu?����.�r<}��0ldddTT�3�c����f��R�J�����/_�-�?r�Q�!�%������~���o���w�;w���[������^����h�w����`vHy����^�z��u����|||�����m�|.5%s���\�2n2��)�\�^�7o������{�>}
���E|����0~��R�s�3�^��;RRR������S�Z�����{���uj�r�����v��i��N�*����`��V��;���
�	
��8���+W�w�>c�B��i���Y���Y�n�j��-Z���?�?~dd������3�M���_T�T)  @6''g��i3f�9r�J���g�����u����w�}��s�������_�vMV"�L�4)::�a���p�L�|��O?��������k��=zt�a333o��!\8i��%K��L��'��
���(a���X??�gv������'CCC{���P(
���@�����HMM
k��q���������G����i��B���3a��a��_h</K$���wO�2e���b�������ww�<?}���������/�������Y���n2��a���a���;�9s���[���
��8v�Xjj����.,�f���y��.���[U�T�1c����g��Y��j,pqqi���0,!d������x����{G�-v�����;YYY;w�LHH���p�7n�x���+���[����w����;��w�.p'\X�L�n���\������c�v����A�gv����b9p��������@������y����~+�������iiic��}fw��1���3w�����S����������wuu-U�T�~�
���@�{9���]�������;7}���e�������������ag���b���;>|8<<<�AW�] �������+W���o������w�]�ti�v�._�\�j�k�������9~���������<yR���3S�N�5k�����m�6j��������]��ryff�������.��Y��{��n����c}������3gN��??�f���S'N��h�"a�v��}�����!!!:�����?~�2e
���`���������w�3��K�.F����5::z������@������w���N�4i����g�.p�B w/�h�"���������7|�p�Z]������r����m�6y�d���o��P�^��;5jT�j���gO�>]*�������/�:p���7
��;V��������w����B@!��������/����d2m���V�Z�����sgJJJ��}w��}��e�eG�)���>
�l���U�V�������y��?����Kbb���	!m���Z�j�w������?�,���u�/����~�����6m�DGG_�p�C��p�"����3G�����V��6�BOO�H=t�P�&M
��0�yg��������y��d����\.v��@������w����m����#@�{9,,,�����-
�>/����^�z�u������Da�����w�m���y��U�T)p�B ###���;����{��	�ax��aw�@���Bl8b������O�b���y/��W}��$�T�_��t��o*[�������@:E����e?�-@�B��R�b���4#���e>�-@�-m��?��y(\��[��%@8�A8�A8�A8�y�8{�lzzz�f�����q���������]��a�!C����j4�z��=z4""BV|x����l^�reRR��)S�/_>`����G�������5�8n��%<�4h��5#F�(�����E�HKK[�x��)S!<X�dI�F�<���:n���;wr��,��&�i��	/��?������+V��6m�^��j�qqq��\�r\\\PPPll,��C��������d����������P�T�^�� ))i�����<y��%K�U�&�HZ�n�i�&��<n���������?~����������t/�8�[���� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� ��� �����X�������|���n����C��`wq}(�7��������(i��`<~�7�����]��~�h*��O�W�]?��VZ���&c���/���sr���bQ�o�h���guc�N�Iy�l�a�z��P�Ng�v�Q(���$�4����-3��V}�����h�={6==�Y�fw��������m[�z�U�V�d�������_����W����2������&���4����x����8���mUF�[����uC���e������O�����p!���5�U�#N� ������y�����
x��T��o���<�ie���g���
���_JQ���}��1g2^mN�>�p�LB�������o�I�V�����zj#�sQ���$Y?�6_:/%����w�S�o�{.�6��4q��e����l��
N���l6�\�2))i��)c������&N���uk�F���\\\�\�����A���Y3b���y!�\���a!dlX����O�xB:��V,�d���.r���O�
���n��M�I	��.�������e��*��e4�v���"�t=�m�BV����q�N2}�n.��MW/3r��R8��G>i���Ii��W� �����v��%{�^%�\2�>h�I��N���+���5��S7M�r����I�����3��I	�����c2���W����O�K�������Vr��?����%��=��g��?���������5����i�;��E�HKK[�x��)S��7o��������i���t�M�6�,�q��<��L�	&�D-	w-���CY�o�G��u(��_{��C	'�~�F�/=��_E�������I���������>������%k�,���F�$��_��[���2(+d����	!�r���#�}�������T�eHO|���1~;�
O0����l:u�>�t�-���'�j}��r�����_���^W�^���/'L�qfo���r���ZlJ��3W�� e%b�I���u��+�U2|v��A���zx�)s)�02w��D�w��%�����E�u�.�ty���U���.=�z�9[f��h<�D��dX���a2��e�R)����3�@��g���n�N��5}���a�{N��1c�|����&Mrww���OFF�������������bccY�:t�\.'�$''�f��gVSe�����;�������B,3gy]�������o��S�,z�!�:���8P��"�����j���&����������"�T��iN�*�����Z�o���
�Y��2f��_!RF2�������Nff�K�TK�~~���jOY���\VU�!�wX8r=I"eI����m�d�:�	�&��������4�������W\�N�=�h,��|�Wn�9�P�R::�+)KJ�X93����k�C�V���^�V�\����	!4�o�ji��
�i�&��l����������&�e�P�����������+���xzHw/�������z`4��-A
��HV���9F!��KS�[�ff�,����~���@?S���e����KrV��+�]�b��
�i�A���t��;B�9n�d;=n--�����&���gM��n�?=��~H��]�����#��7����w2,��!_W���G���n�*m���A��-���.��!Y�v�M�'�$�T�w�kf�{����C�Rz�
����?~|���7n��~��B�8p����y��9s����������p�����������.��~���e�y
n�~0��zy{�[g>�'\G�2����g��!?�_P�����)�?�������J����w^3������nXe�ur�H
�1����+�������)��ry���x��OZ~~���K��?�=�D/#�7����	o�M� �L�\!������V����H�L�X�u�*�y���������������\^W�;��F�X�����L�^���O����,��z�L�����d����SWY�5�%�����z~3PYz���s�����g�����������U�^�/l�+��VS+�I��"�i�����:�{��S��������3�+3�>!F#j�$>�s�s�U.r������(��q������h��@����K�����;�J��~��X�g����gee.����(���h^�r4�^����
7���?�$�����S�����k�����gR�.KG,����]����=���
�X�yB��j�a�Yzj&�~{}���;{?!���<�h=�|��0?��a���@���������C��F����Y�y�����{��
���`wpB����'��H����Xe9"�J�"������X�I�OrdL��t��	Q�q
!D��uEM%g}��h*K*�*�<�d�H\�n��I�_��N�������5�8G��������'��vK��y|�yQ�Q��z����P�D���H�H��(	p$p$pB�888!@q����<c�1�-���7O  		��8`�9����g��a�Ie�" 	��w����f���XK*KO2���0��]g�����8(@q��S���4
7YO �� ����0mK��+�I��.�����n ����R�[%���U�O~����O
�:��E����@�A�w?(��
����G�G��p(��tcf]�j�j���X�((	p$p$���y �R�����f����@@I@�#A�#q�x�].����v�|�KBH�
�_��\�O�!
�o���W�2u�.���*��p����!V�V��h�|�)��Y����g[�%����@�^�]�o0��=�:�I�N�����
�M��~��0���u���A��f<��n����t���K������Pe��^��q�����$}j�!3@���*�e��K�G�G�P����:���_�C��@������75^��o����������L�a��
��o,wnY�(���������]�q���������Kk|?�=�X6;�$p$p$N��0eY��&w�y1�I ^���7��3���M��VD�lvxI�H�H'�@@I@�#A�#A �%���pB�88�	!P�H�H'�@@I@�#A�#A �%���pB�88�	!P�H�H'�@@I@�#A�#A �%���pB�88�	!E��c8���.������^p$p$�B����RJ<�s�=�+�����`7@�������LB��ZU�{��e�88@����������6�
;7/�mv�@��G��;����x��z����s�:��?��nr;|�,��5d�Z������G�G�(4v����>Bi�_�G�v����nr���d!��%RI�W�1^�up���*��y�6;�$�#�e�WH��vEs�G�XS��<k�w�^HX���988�	!�=v�5}����OC���r.K�g�$������xc�L7f������{l���J������A���������P����?w�M����~��J��a}:3�um������pBg�����o"���&�����Q��S������He�e3�uU~y*>�%9t�����6���i����h�L^_����]��>��m�|�!�� �� J&N����`vS3���
|N���'0��`F�dzC�j��j9��^CQ:���������m�u������U�G�G�(��F�.]$���h*�m(.�$ �� �� J&F����.�wW4�����9V�\�����{�~����6m
2d����5���#���;f��LV��!���P��P��m#����x��6 �-����8p�����������������-K�5�^���i����p$p$p$�����F�X6�������N�����=���6���������E�S�<x`�X��?o0n��)��x�g��:F�)��o�LNN6����M���jBH��92��t]��������Wj�I���Xd�YOY\���?�#{HC)=V�S�8��M����M/u���R�?������^B�9��&�#����?�e�����0`$c��m�����WUBj����~I���g~�u���Q>
��f�h����GU���4�!$��/Q\Plv!�=oC�o�C�����%�����T�i���A�MFY�X��uw���jX�[��y�9��St�4�`&�#YB���T�|���f�U@R�����;���ni<?pa5�@����������m�O�c��&����,�C���s�/���S����k�B�A^-�m�m�|��<�:p�n���r��Z/����5i%��v���Q]?�s��3�!^r�����ZX]����o�����CuV����������+g��"�~�n����!74zzlg�ZEs��&��-M���Z�������%%�~�����J+&��/�s��q���U�t!���xuv��[;�_�k0������Haj�u]���O�$�l��wa�a�R�+1f�z���`wn��r���6bS��kY�<���~M�JE����mU�y��%�^�S���T�N���#���VK�o���V�d]}\������a����[D�Wz�{�o	�F;�0�'�w�b]I�-���W������Y>�g����o�VxI�?@�R��o��?�$!��6��_3������{��C�[]����K���V�G����k�=��g]ZA���0����������5���+��q��n<��7�GA���e�\+yx��z�.Z�CJ}��t���s4���Jm|��6��;5����kt����a������ux3�:��w������zdH7������:��^�}��=�� 	[Q��? V�&�&����
�M��xxx�T�B�fKL�0��/�<p�������W�fs��]�-[��O���m��B��;���#y��H�p�9�=y�[&�Q���+
���;5��0��M�3t��g������of�R�j|��,����S1m��^�h5)Yv{��?�7^O��]������s�C�d'�?��-7�Q����������m����~�����o��<����4l����cm��-����t�,�{������_�^�s?�[!_V��k�����MWbV"�T�)�����)�����o���^(�����+
`�P�t�. ��
_�3�t��8�\�#�<���U��-9�7������pI�t���Y��I����z��m�N��p�8�x���~U��!?2���+!����O�~li~�*�$��+���j(W���x�0=�wV4iQ��A�#y�V�`	�`�J���P;��E������-@�@ 3f'u�1���p���
f�����0�B��zh-�S��)��$��VW~z��_��yE�v����.V^p$�\�<��uV:��bd��t\?�B�9"�r7f�{��;�%�p���y��t�+%���k�����������Gt�=��*���x�\�����z�#?S�'"�	�	!�����Fx3�WDL1��88@����=���Z���JW����@ ^��K���6�<]L��~�����XbL$.���uE	yu���}���}�p$p$�������[����f��~��
����77	���m������,1����f	����3�����Vl�������F�����J�4���xxu�����!Z��!9F�~��"^.��uG�G�@��<�nJ����~S�%��S��[L��7�6�+�����9w�'{�j�n�<>G���WV!�r��u�T�_e��r���M �����m��$�I�EJW��=;�sL���M"~:�%X_D�����9C�V�Rv����������G�@�yI�1+���k'��I��k��	�j�;9;u_����q�Q�!J&xS2��e�����:E�unA�|J��&��x�X"����&�����y����k��f�_/V��'���'�$������}�}��� �� b^0�F����|���7�<�"��[�y��<cv$���p��1����m~6�;t��h��uL;@1�����N�Ur�O#�[����������X�����L��1��O��pU���\c��!�.7���O�o�-�<����s.����Q��%���"�������B�Ty�����>���<�n����9K���66���+��A8d$�
�{�{�ea�(���-��ez����[�\��D7�����R.n�pC1>S�G��������,�v���G�@��Uts���Qi�o�����^g}�a1�6����-�W��;�o_��|x�9z��I75��w�g�7&2
���>��@�h\u�����	!��>��k��%�����7:�����K����g��}���'�G�@��!+���'����v�r��ThB��n�������w�8v�|/��`T��
���u;K��/U����!@� ���T,��x�(=A���|�K�u^$�7�;c[�p�i]���O����+��Ob�}�/�@ 
�1p$����L�b�F�V]�!�-4�6���/�`��J�Wq����#��p��
<i���8!b�/8@�1�	 `������@��H �Q�O��H4�w������H.�$�,��h��32��1;@ � 2,>�.�\�p�0��j����^G
�j)����
_����P��!�K�V���lx��B��d����x���@�A� ��p�\���������Wh[�m/�C��+>�7zC���>��;�	jh�^�%?"[�qN�w%C������C�mY����Y�tn���m�;@ � 2���q�>�����'�c��9�<��T?�� ����etF&�I 2��Z��?Mi��Cps����@��G�p��%�;|�@�@b�H1;@ � `_�\��#�;�$�J+7'<������l�$� `������c1x(\m�0�`�0���L��O�xjP�%}�Db}u��%|Nt#g��12�TC�R�PI"3*��S�K���~l�z2��B!bv�@�)����c���<���0E �����k?8q����|�&W
�������ak��i��]?���CLIt��_�|�j���,aF���R��	�-���,�z�Gj�����C�1;@ ��p�U����vBH9�����m��#q�8�u�����I@���\}�.:g��L�,���N���!���R���
6[H�9Y������������@�2�y���s7��+�jDT+�	�B�9f�W_�'��BJ�vA8@�1�����U����L�:u��^�B��n2��>)���e�
�Mq�9O-3�5���=�����EQ�a����U�@8@�1����**�O��Sz_��M>�G��R}�
d���Y��w������M���������7f�j0�=FF�W��/���-<�H�U(��^u�#A � ���@�Ym\s��U���T>kf���I?��P����>S�}M��S��r�|�q�^&}j�X(��=H�C
xOf��	�H1;@ � b�����Cw�������<	�������3�^������>��g�*��1�*��y��G�@�A� ���������?}CQo�/#$�-|�Sz����y��G�@�A� �S ��O?&|����`�<�$�!Y��="��'�M����@�A<1r�H?��^{��/��;r����Wk4��#G��;w��12Y1��B@ �Q�j��i����!Uzt���w]���*�O/���<�1�h5�2i��� bJ8,.u�d��I�e��l����%*V���a�����={�t��F��R�F��W����7m���3C � b^�0�����Q����J�r/�tu��StQ�<��W�5�r������Q���OYx�W�X������h���y��M��'#-�e�B�|�wo2����u7��+l�P @L	�/I�����������6��?�����U�6j����K�;633s���f�Y&��<�0�	�����m��Zlp�*��'���p������?��zx)��Vu��p}DB���t�lB�M�*��#I8Ko���s����fB�	*�I�=~���NYK������/W�v���O���, �����t�'e$Sk�m��'���9�
�?	<i�N�����/�*��!���R_�^�]���on� ���l�z�������.d�bdH��^�����T�4L<@�M�j
Y���G�	!e����}w����L�q�A��y�I��(���a��?�y����r5�'����9��V9�u������w���ct�f���(��M��O*��}���C:T�pK��������h �+S$��sV�����6�>��*=?�c��Wf~=E���5��Y���.�s��H�aKE\�CU����
8@�
�|W�e.S<���M�2�zZ��.>/w����:�N���g%�N�97i)��.O��S���)Y����b���p�n��y����Kt�}���o�3����a�v��+�������������������`���r��f�H��������U+�uJ�(����9��r��K�6����t�crk�������0��>!6.R����No���\�S�-�'I���
�-����'N<1��R�������\�N�L�u��0���a+3r�P�����\Mn������7F�I�i�������h�������9O��\�Gohl[����,�����(=�����P�����������Yia@k�rz�sn�?:7��ea��U��[f4���45{lZJ=�s~�m�z�4����^�nml����H.�>p����vx3�����-�������,�p���8���z��}������e����mbS�Y7%�Up���$�>N�~�2��;�!��$�8s�����/�'9����2��I?���j}��l#��^��'��j=��!M���<9�:=���������8N��5��g
�����y��.�sJ�_�����t_�Y=o�F[���K�q�3��������U.�$���c,;����tQ�3����t���2U�t���YW��(X��$���h���\��mi���L�9^��)kI�s.=m�����h4�(���GE&l��K���5C8��^$�����%l���WQ�>"�����fz�5�{./j�n��z<l�u�[m�Dk�v��=J��Hkn-O��{�~�!@����(����c����^'/X�+���C�Q'g��z�5�[G���'���"��4c=��;0��N�����
�q1=�7��-zC��Z^d�E_��t��lm����xf������S��Fx"X^���������L����%a�5���?��6��52�l_A��9��S��_�E��F����<�������)����������&v��l����$��2��,���i�}g���/�����*�|�W��8���Fv6N�Mu���-�x:���&�u����/�{�R�E���#����2��$�YS�:m�m�h�k��4zC�u�*�0�;}��7>�XYa��
ES��6��.��/v���?]QF���Ph)S����_��v�J%���/���\7��I��K�M���;������o��I)�S�1��t����l��R�|�O�����?���o�����S*�~~~�Z�Z�lY�>}~��6��#@^��0~�Y8�����?yi�E`m��������G�2IW|M2���/�}�H���l��`��M�����9���p�6����M������~�I^�?��@�A<q���;v�n��V�Z�V�R(o����b���i����c�222��mk����G�� 1�>$�U���x�?�
3��L��D����}���Uo����!��|��/O�$���7��z�G���:���3tcJ�mM�Gn}��']�EQ|z��Q��g>��}����h��me|��?d~�3�����W�����^��,���l��=�h�W��-�#@��������N'�6���e������[�@�%��_y@�/z+C�$7u�����-��T� ��H}C�
���hV�9#B � ��( ��*����x�A8�����<P�w{�u�>f+i�@ ^@�1��0& ?@	@ � ��( ���D��(�s �1�@ � bJ8t��N>�k����r�msFc"��(1�� b1%�t��������
:��m�aL�3��&��D�&���hl�%A � ��( �S��{�r�����Q���]��D<�9�f�Vfl�u�\Z��-c�]����m���@<@P@  ����Q`���m��/L������l|�p&�y6����
U�s �1�@ � b����5>����
z�^���g�n��I���j�����J�$�6mKQ�s �1�@ � b�����"n_"L���`������:��������;dL��A�6�����#r���(��
>�_����L%���m��x�0&��@@���@t�9Q���������+��
��"�0e|1]�m�\��^�7Y��bFY��m��x�0&��@@�����V��\������������
��f�57>���6liLj���.j����+�
��@<@P@  ���Ph����A6]H�_�W%���@<@P@  �������=�2l���mF8j|uq��,��Z�g�G9��L�l\m<qLR&P;z#��0`�@ � ���@�Ac>'v�Y����y�
����m�@��
�I��������?n��0`�@ � ���@�Ac>'~N��<��I��m�ZP#���@��w�;7#9�caL@!1��0& ?@ `�@@� ���@  ���`�`�G[��WR.��Jy1;@ � b�����
�nW�;K�k��x��a�����@  ���`����d��.������_���������n'��!C��Hj���F����@  �����	��W.���u?�r�w��>I���;��3�a����.�~�cB��'��dip9i����IvQ���.�K�"C � ���@�Ac"�C�w������ �L��^�s�s�����]�����������e �*J������9}�i����Z�=~�@Re
���0�g bv�@�A � �1�!���=b��+t�7��|����#t���1�J�-�Y�P�M��@ ��@ 
�@�A 
T������Ct���3�j"��@�@��@��`���$^�P����R��a��s ���@�Ac"�C8dL;���U��k�
sF �`�@@��!`��k����)��� ��6��x� b1aL@~�@�1��0& ?@ `�@@� ���@  �����@�A� ���D��@�A � ���@�Ac"�C � bO��sg�����7���]�v-�0C�����h4���;z�hDD�L&+��1��0& ?@ �����0z��7�x����Q�F�-Y����A��Y�f�W��@ � b����1��S<x�d��F�<x055u��q;w��8�eY��M&��	�>3@  �����@�A<q���+VL�6M��k�����C�U�\9...(((66�e��C���rBHrr��`�mf�o��WBB}���g�����w�����RC�Nb��>�"��zB��:�����C�J��z����Lz����M/u������sZ��*��9��&�#����?�e�����0`$c�����'_uTE�h�wD�%=��������5�?3�Ga���-\u��J���Z�$�����f���=oC�o�C]�%�����T�i���A�MFY�X��uw�	E+Q
~��9�8��t�n����d{$Kh]��*��zy��B�H�n����'���4���k�'�G��39C{�@�W���M��s=���������Kv���M�kK"���!y�3�7�$���-w���9o8����8�*�~���B�V=��&����
��t�������V������	��`��������9�l�������'�����]�]K�c�F����,�&�+�0�~HO�e�����6���U�����VN�[�%[h]pMZI{������~����g�B��n���-��������_o=��*W:��WR����3C�o�^�������!74z�m�U���j{i�T��Y��ms^�O}�6�8�����V��V����u���)�F]���|4������K�{�~��?2��qW��|��8�l?��i`	!��g��Ir���6����r���N��oC�C)��s�K��}�6�Yjb��3�Yv�}��t�X��U6~�M��������>������[���M��t��q��6����w����?s7Y����N��m�$�>��t��u�Ki<K������9�FO���[xE�6�����S%���MV�����
���~J����P��6Y�O��xU��,����&�������:-�,

Z~�S��zi���:�8����U�Qn��t����/d�c��g��m����.)Yt�L����b{���e�8M���t:O�����C���I\~
&����)A
c������.
����A�b��1�5J��X����O�w�;�Tf�G��n#6�<��EwM��5�*�6�!�\
f�2�[f�9#�*��q���2E&���RQKR��������F�������>�����,�K��
�G�:�F_(�pC����/����WV��K��A�>��������i�6��	[8f�W���}2=&����W����t�N�g��$��7_>�������l��w.�D��<�z�]Mf|�@k$L�����E�
���^�L6k�G��c����<��p<�H����M����t�����9�80]>�c]��.��(5�B���N��LO��*��u�o�	z���
�Vk���tu\����w^r�Os��7UV�s�\���o�H�&��\z�O��>����_�m��	!���\������q�%�7�s]�v��>�	���Y�����O�-�A��^�����t>��?����[�Jjn��=��R����F����?Kn���}����Kj�������*�O��5T�w�kf�����>{����4��:���K����.��s_�k�=��g]����K����>��F������s����Y�����D�=k���s���}����h�)
�y<S�uO���T����Z������;5����<�ev�?��G���� U�xM��z�~�2�]������f������m�!��3.�����]~������D����l=lr������0�sr�K���3����Q��g-��<~��A�������n*{�>IY����_����P�?#~�N�M}jOaKynB����N��y�nL	;�\��c��C���Y����U�����G��'i�C��e^����{���^��9{xx�T�B��?@�=�U�&�HZ�n�i�&��<n���������?~����������M:/��w��� �c������w���`�@@� ���@  �����@�A� ���D��@�A � ���@�Ac"�C � bv�@�A � �1�!b1;@ � b����1�� b1aL@~�@�1��0& ?@ `�@@� ���@  �����@�A� ���D��@�A � ���@�Ac"�C � bv�@�A � �1�!b1;@ � b����1�� b1aL@~�@�1��0& ?@ `�@@� ���@  �����@�A� ���D��@�A � ���@�Ac"�C � bv�@�A � �1�!b1;@ � b����1�� b1aL@~�@�1��0& ?@ `�@@� ���@  �����@�A� ���D��@�A � ���@�Ac"�C � bv�@�A � �1�!b1;@ � b����1�� b1aL@~�@�1��0& ?@ `�@@� ���@  �����@�A���U�V�d�������_����W����2���3C � b����1�@�m��V�u:�����+Wx�4h��5kF��q%�@  �����@�A`����F���t�6mbY��8�ey�7�L&L(��1��0& ?@ P�t��-##����aaaqqqAAA���,�:T.����			��o�%NI��i�N����ar�j�4U�����c
<�wA�)��	O�R�l����t)��%��c�|X�(��(8z3��e�Jf�~���2)����Bfbm����9=+15�S����,kP)5�4�U��3����pJS�s�9�f�T��3Cd<#�����g�3k��J='Z86�,7K�:g#/3qt�H���)��$�J���U>OX��]�>>p]8����X�,�{�T�=���!
��Z�lPZrd�9�X���
)���:g�0g��������x�u�&�E'7�S�E"5�$*	}&�x�������B�����,��N��-Sp�b�������:��� ��������K���`I6'�Hx���Hm���S�y��0M�Nf��Y*���R�����:�K�0g�q��Z���<S��-���9��&�}<g	�WZR��0�Vf�l�H�����9"��8�Na��Y��&��D,f��W�
6�� ��,j=k�n@���jLi��T'�����q�f#��yz�&.[e1J���Y&��G�Y����u�,G$��Y�)9�nkPuR3=�L2�4���;��q��^.�Y����3�1������9k�F3��>t�j����9G"��LR	=NfYa9�LY��X[���x�����t��F��U������Sf�
C�,�%ZQ��qna$:��^*����l2�ZaL	�e��u|�D!7IX:�������-���q���p@*Y���c�ujP(�r��z9U&/32�����y���fV����H�Ye6[�v���Nm6	��HU
��;�7�
c�BDbaX[��ey|C*V���d�v�R%������ ��x�0g9/��-JK��$&���
sfyNi�,��:/��4�cL��Z��Xg�mT%�g�m�j6[��N�9J�B���{<T����}�VkF��v#�HEVj�W�I��EF3.#�qkf�F���T�������%J�]FD6K���������O��7��~=�?��93���<�����������XR�����/��F
�#��,�	r������Br�
#ml����(�>���ap�(�<�w:8R���%5�A��+�K���x�q�X���;g�A��$x�X2Jp�3��q�"���3o���F���C��$����0�H��:�4O&�.��Y\��p����>��,�w���s9�����Y�;�u2p��?Y'#%7$����Y7�P|�
�s��w����:~>�d� ���t1����ag���H���Y�5$��g%.\��|��:���8�u|g��-���Y�y+�?>Y��*�W����Bf@z$���Y<~>+q�����p���Y~�-��|f)��u}����]>���G������1����^�x�"**����H$*�������}������/�-@`��
�U�n�Z�x1
��������|9//OYY�������N����������'9WUUy{{�8qB]]]R�R�#k���CYYY}}���-�'|�.<~�������#0����D@�����k�b0���������Vq8�X�����~�I�>~~~
e�	Btt��>2j+��:v�������������s��y�Ds~��Egg��$����C �-[��� ����kkki4��	

�6m����d2����XleeeJJJ___pp����h���x������������r������`iiiff&+�dggo��yd�G�y<^DDDPP����Fc2�			4��;::����H�����E�$(���;�������@ \\\$ YXX���9u���������5~�r���������>|XL��;��w������H�����cq���������d2�w&����������h���l6���Cbr�_e@�P��Ys���	&���'&&��h%%���Z--������gggCC��3g����%22����@���EFF���#ss�������F����{����:���������H� ��m���@ x<�D"�rvww����`0VVV+V�t���?���}}}#""�'�?�snn������O���>,�,%%���K����T���e!��W�N�8��pP(�A ��?oii��ukZZ����������uuuh4��X,v������t:��b����H$55���WS(hkXXT�������_�xQUU��� """<<\4�444���@�������������ccq677�q�FkkkBB�����WUUU]]�)�\ZZ�F����W__��sll���Igg'$�z���+W"�H�9,,���CP��������#����!//���3u���q|�.\���c��{�n??�OtNJJ�b�***���(���S�����by���
v����/����oc0���"������cc��&'�u�������{��}������������cbb�a%���AKK�������:/Z�HFF�����/����
����g��m��q������P!e�d��x|NN����kkkg��!�������2����Z����3�OHH>}�4��B����������������g��7�������~���J�����;����������8cc������������ajj�`0���v��`cc�1�;�U���U�V��5K|�L&�Fkkk��u+&&���������V�y��yg��%�H����,��
E^^����x|DD��������t��� ��MMM���qqqp�`0��?�N@mm��<<<�����{��)))���&�H�7o~�������=�KJJ����Zkk������go��!55���8�������������������X������ttt@	.yg==���:''�1:KII>|��^^^���.��������b����F[[���P�xt���B 666_�3t	��-[��������
%00���'�<yB&���h�
yy���_�������h����I�&�}�v��e���555�����������p�yHHHvvvqq��m���?������m�b:<<<00�gdd��������x*�*�3ta���*�����~��g�������������]KLL����,%%�������O/����H��m����>�<��x��L&'$$��=������_�snn������y�����\.��f��q���������,<?}�(�z�(��������hP�zzz������G�YYY�����(8~���9s�����3�_V���������Y��Xlll,�B	�n��h4gggeee�J^X�9(((55�H$�����!99�D"��0��i4��]���


���PX�ov	�eO�>MIIqrrZ�x�$�t����������K������������������c�����<����Fc�X�u�����*h4D�s�Z�����s����b$>>~�����]+++�������,�Ym
����9/^�������DFF��911q��������T��u�:;;����/_�`0�o�����wf�XT*����"77rnkk�p���������"����Z���O�0��dvuuCI�@ ���<���N&�����]����]������l��i�������d�X,OOO*�����Q��Xl��}�������2eJee�(�+V���^�|��:_�z�YYYIR���P����w.,,�������tGGG2����>�S#��������@ ������P(SS���6cc������:����u{�&+W�������G��rrr����
�4�D��������������WWW���S9�x���X���ZSS3gH$q8\__�p��*������utt�d2��~���� 33���Kp��}gWWW&����WXX�������O���=���r�J55��������)�T*5 `x�	������_577;::~��� ����������vtttpp�2eJYY��u����"##�pxXXXPP{��U\\���;������$�����e'�lbbbggG�R)
��T��~�������D�����8�y�&222""b�s���������wVUU����������.MM�3g�@3��Mrr2��P����������_��H$���N�<9##���y����S�-���`�����w�����`���~///
�r��A%%%19C7	����LHHHLL��h������\����+�Ym�H$����������NKKk��IiiiuNII���k�8�Y[[����N��D2�������h6
�<a����\���t�933s��%�%h\\�����={����H��������g0\.�����������6m�������}�����o�������b�htuu���1���a9����~�a�W�;w�����������~�w���k�������Q��'O***:::��ys���i����^�z����B�����jk����k�w�}��8��������KJJ,X ''��;s�������D*

�����gUU��g����hkk���khh0������/_N�>]p�"//2������600hoog���_����/--��ZXXp8���r###��������pss���
4
9+((\�r���x��)ct����������r���������233y<������".�9������o�������������ddd$������p�����_�KJJ���`0��5kD������
��r�555����������=�aZZ�tvv��������zyyy����g��UUU���)))577kii����Z�
���sg��w����WVV���CAAAV��{�.t��mlltuu������o��	��;:thhhHWWW�����9''������l���B�kjj��;��LLL�7J*���f+**.[���l>d�+x�


edd���������y/^��}����&}s��9r����GWWWff��i���]�`0]]]����,YRXX(--�e����|h��`����}��1CCC�:O�4	�55����;vtvv�����jmm}��u!�/^������D"QL�"p8\�sVVV����N�����G_�^,��H��(��`l��������K���_�|��������`0�����X�|�N���/����CX__��r�n�������
�F����={��>}�s��|�����rrrc�������5���������fff�t�����E�rNNN����5k�,I>�$2������������_>_�3�@�8q��9s6mq�o��5:�7|}P���!@|C����
��7(�oP���!�}��
endstream
endobj
55 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
60 0 obj
<</R23
23 0 R>>
endobj
61 0 obj
<</R59
59 0 R>>
endobj
59 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 18729>>stream
x���X�W������A6�����p����U+�V\ Zq��{�*Z+������"(�"���w�{�y^���5�$�������p�'��BB�D
�8
�@
�@
�@
�@
�@�!��{�����WW����C�9;;;88������:�����R��{���5k��?�l��E�����������^��C����O�>������U���Q����|���'N�v�������'�������������aChhhff��A�(��Z���lllDD��Y�bbb.\���A����_�l�2WW����x����B������(����t�PX�S��p9�`0����z�o����W�X����������������(++�;v���"�h�����'���WP �P �P �P �P �P � ���a�����h����Yj�e��U!�����wZT������{#��i�i�����@q!@q����������[��E:w��Y��L��bD4���7w<;B�;�O.�S�s�D�"*K�kX��B�-%�����P�
JT!x�Ha8>�z:"3��A��N^�=�W�X��wt3����[#~'���a��{�/U�
����w�������U�O
�U�s(.((Q�C�'q��>-�������<��x�QQO�R����O����I*G�;((E2

@��A��BP c��A�((E2

@��A��BP c��A�((E2

@��A��BP c��A�((E2

@��A��BP c��A�((E2

@��A��BP c��A�((E2

@��A��BP c��A�((E2

@��A��BP c��A��Z�H$:r����<y���CBB\]]���:������������3t�PJf�E2�V���������u���E�����������^��C����O�>������%c|/E2��O


��{��9s���]]]===������off��
BCC333
T�3H�"S������x�����]~~���=<<�t�����-[���������c���{wBH^^^QQ%��
�<
�N
��=X%a���!9���y��F������{��h����O/��������S����7�����������-	!���2���G\�_W��\
��������ju?���7;v�������|>���%777$$DYYy����7o�D[�n�p8R��n�'�Od���@� @A!(��1PP� d�"���@� @A!(��1PP� d�"���@� @A!(��1PP� d�"���@� @A!(��1PP� d�"��"��.a9�����%DA��{t

����������=��%�H S�X�})��j�F����e�o��YJ��F�.
�;�E����z#�FG2�%�{q���L[;
�i���.lXIi��9aL[���;���)�h�8��)P,j d������@f!@� ���YP,j d������@f!@� ���YP,j d������@f!@� ���YP,j d������@f!@� ���YP,j d������@f!@� ���YP,j d������@f!@� ���YP,j d��U�HOO_�j�P(\�t����i4Z��=�������l��q��m�&
7m����(�?BP �j"�(88�N��}��������F����x{{s8�3f<y���`8;;S26���@f��)@�^�:v�����\�������A����[������������=h� ��5/�������t:](��u���p�6_z�c���c�f�0BH���EoE����l�����������B��he�c��K�&����w�r��f6�U�3~�q�0��)���p�.��-��'���� ��p�u\Q�4@COO��g�U���l��u��U|>���s���.--�7o���+������BCC���I)�P'�j�'2�V����h�����L�6�������w���������<� ��@�,�
(52�@
��B�bAP ��X�@�,(52�@
��B�bAP ��X�@�,(52�@
��B�bAP ��X�@�,(52�@
��B�bAP ��X�@�,(52�@
(�uO~;w}^��c�]�{�o�����I)�l���z�����8�5���7��{�t���^��|(5��Qj����A������V
S�]�-~L�����;��8���j �W��+�g���������VTB�bAPP�:S.��j��[�Q�=K� ������~!���$^������O��E�:5�BT�Ac�:W^@�����|���d3?�	������CG��� d��	3/�
!Cf���'��E^N%��Y��]e�X�@PO��K�Z|7��-�����:��3j ���@� ����9���zj ���!���3j ���@� ����9���zj ���!���3j ���@� ����9���zj ���@^���Ld�I	����.��D��_�������(
�Y(+����!!DG�.*/�$3�i,e3�;5��f����DU���{*!%4+w���]��3��*���4�� �=����a}_(��@����a��1���f�K�ejd�U���B�G��Uv�&?�
�g_��	e3=��#���bU���`���������2�=������7��*3�{��+��2��		���M|�k)�i�1��~��2�[��a�����m�F��lfy�p ?���F����%��P$r�m�L��	�5a4�*+9s����!#i�3WLrd���W���UV>�X�i|���b(�d�'��9=�p���~�O @�|%���o�������CEbqio��\��O����9�-���"N���~�(����P�������'���e��uj��t!d`+f���C�54U-�l��@�9����|?�XIk�03�VMe����������}j���;�MS�	S)�Y��>B��#�Q��������$������I����i����B6������Z������BTf/b�P?s���%L�����/3���#������5K�@�q��u�C���@�b�Jh����������	�%=�u�y���T�C���z2@aU6����D�,��&>��t����8fp4__�6�'���	Qqw�p��e�
2��99,k�/�&��1i����$6�j�$��)������->"	��T>�������x.C�v�<AP@���G��W���P�SM�A��%,^�_����	!3����!oj<���E�}!�Q��-�XT^�9�)+U[���s\^ w������V�����@
e�Fff��P�TM|�&�����@3�����q���XTX�9����\��5�	��O��Z;�{ �
��d�vf�#�����_�dI�f��N����A��z������f����m�6�P�i�&�G����� ��������<8  �������F����x{{s8�3f<y���`8;;S26���F��P ~�S%-�gC�%�Yj(\B������{��y��777:��n�:���l__�������A��G��������&�Z�N
��=@]����C�W�BK�����]�wU�N�{y�En����%+����;a���Y���������������JT���J�Em����|qnV���#=�<{i�z���
����;{��$;����:S���"����[omc��_���w��N�eq�������-����My������"5u����/N��j<�F���N�,ei�,��=N�����F����6��>Mx%��!O����wu ������*,��R�;C���.!dd�R�����D�������%'����=�F��[�T�g�h��p'�m����q��r�B
���\\.�o�2��K���"&7�+ar�kB�$#]V���D�
G��"�~@����C�ms�.CQ!�p���:�g��
������=���[�������<��U��s��~/>����W�I��F���/����g��Jt�(���*���}+��>J��mYd{}��Sq�]�WM�������Ks*���5Z��o������%���h������S�
I������z�o�I�&yzz.X�����������[�r������^hh��!C�4iB������������/����^�G���M�W����%�]o~qf�����t6�7c����%���H�%��u����1=�L���{�_R\�h�������3����>/b�������WJ���U>t����"Mh0�/����B��T_�M�|~���[O>������i����`���8D�����5�s�8-���/���^�
��5s�����E�UT�0����&;|S�?�6R�;����I��3��Z����g��4����}��;�s����-mX�'~�8�M0�������7�����^�M!��`���c�["�0�Sz^�S.��l=d�����DLn�����#�?`13bf���U.���S�k�x�I�kA��������sf,������.�����>��"��h�����V���B����/{<CY��2����lb�CF5��;Ba�����S43!%�����6J�0��������������:��k��N���1?���.������_{!��K����i�g�S>_�F�q�����6��������D,))			�6mZYY���������8���/y�@���S�B~��[�L�K�3��S��_���;^d�Zb$���u���k��N�_&������)@�P��4�O!��y��@�!���W�O!��ue�$!��Xu*.�S9��Tn"��#&m	[I�c�m<�'Qg��1�G�3	�S9��T���0��I�Kq��
�{{I�%��M����_����@��m��@���CH�~�����R�G����|i��B|
 ��6�-%)/���FZ#�YPw�B���
���Ir$ ��B�9��d;8m4��7;�����u�)��S��U�e_�M�k���	d;���DiA��f���:�� >��s�
���(�.M�5M�����$w)K�iv������A�bA|
 ��L�;��[D��a��������W�O�x��$��B<�V���#��Z����'<s�����#�@^I+8��EY�i#��ctvljqf=��c2^�#�B��o�H5<#��V��|Y�!m�q�$��L�:�}];����J��B�b�V���E����Zr3_$d��-)Z�@e�Ta�t��i�E�f�gD���1h'�0��0�W��6;6E_K)����0R�DTB �����@�rX��-�^po#!���S�����S2"3_��/0�$/������lG��c�@%�[!@� 
���X� �@�?������\W��Q�u�h`�"JeDD�L��@����QG����ZOa���\@�]���-��aMzu�{��+@#�(2[���P$\�f���-ra���� ���~�����!�*�(�
K��Y�@40R�d�%[#�{���(���	�B�0�4W��U��|���:�B�v�Y�����R�3�_�#$��}�����8 s���V�����aj���&5Z��|��L����uwG����S���C����IT�D�=�z��u/B��n��z��=���A ��n����wkiL5����f%�
�p��qf}�U3��)}Oo�;����3�,��a��{���.��@����5@ �[!��0��0r�x�tz=,x����`��M���3��%���(7k��5�,:�V�	E����/l^.]���������S4�6/�����M"F��">�Nh<&��_"R��g&��f���@=��A ��n	�O	�F��H��nn�_4I��!���y�����y��M�	�N�@��h�(y HO�t�AWS����	�A���	)�$������=.�����f�/�U�tg�������Z�Q1�^f�>�m!��.3���8+4�M����������B�fL'�������<������ �
P��>���*��
��C_��L�,+�F�4�j;�{	���q:����V}6�����o�����tC��y�����.c;�Lo���`46Uq�[/3W����;�i��?�C��������|��	c0;������%+����z@�@|+@��>������K�e{[��45�,7��\�E�g��XL��q����^�F@�@40��B�-ePK@�@|+@���	���::��sr��t�GWQ��(�!t��Iwfa^.�Z��$��HyQ����P����V�:��vY��k\���cm�i�H�	b��������W�l\M�l���%�@f99�0�d�)6�0/7{�`Qi��:�����HQ�:O�W5�/��A ��N�.�^�j�Z������6�:������Iq��?�����s��� 9��_&����DUOZ3R�����_�������A���#��I+�*�(j�nR���V��|��w����8�y���7��F�h>�Z���gi ���V�:!���$*�%cC����P�?�.=R����_�n<����>��)�G�������,?~���	����ZXQ����nmH�b��pTRK�J3�
�(9Gd�Ic1h��L��h�]�3��?�~S|���+�$<�D������x��b��1�^� pn�4b��^<CSS�
R�3B���R��j7r�^kr����%����Oz�[�feF/g_�7��w�a�\8-����H�����e�� �
P'�,�B�w�f���e;��u����_CS��7���%."�F}��3>~@uDy!��$�B8���-�V��5��
r������sg���
�hsk�����oM
��m��o B���%�}��a����4�)Oiv��{����5���o!�;�&���Vf����%'�����7]U���R���R��f�)�z��HzM.���a�����I�������L:Y���`�*����Q�����Y� 
��uBj`�&J;JX:�f�?����Mg�I��0!>#��C�f!�5Iw��1��!~4]�4�#f��pB���M
5���t��M>�KQ�2�iaY�O"e��D�WiN���]��xl���_������ta�a�K�������|�	/�%�]��onlTl->t��Y��[q�� 
��u�. �$����T�H���k�T�n�j�J})/��3������FC��D|N�Z�N|��!In ��(���@�7��o@H�� �
P'�|���<��	~j:���Ey	�C^?�z�V��XUSy	�����w	![~�D|@
@�B �[!��� �b��P�]����������6�Cu�Y����D��|���D������Uf,�����`���w4^������������b�%�������
)zA��@S���	
��/�k#~w�)9Tv7Xi�D����	���������3���%$#�h�J��8��� �
P'�&���e����.)�4Z�L��J7u��z�Y���N��������,�
�����M����>F�&qN��moi�,��8T����=@y���S��('��_��+��(~��+�8�<B �[!���H-��H��l�������i$?�X;�n��5�����6�@����q��i�u�_2hN�Jqfy�@40��B�	�����J
�e?Dz#��'����y�r)E������@ �[!���H=�L�~y��`Mi6�ut�\��������o'wy	�"���Cv�hz����N#$���V�:�@40R�w�n{�	!�gh�\(p���������C�hp�o��@�@4$@�@|+)���o�m�&
7m�����K����A ��A ��`��M?����'O������o@�@4$@�@|+)���+�.]���=h��F1**���wR�P.	/�!i)D������%�
(����Q~E���y�
�~q�R������
��NEI[��)~G��V��[���M��:���"�R�J����Uvj���o��qB�ri�!�~���5E<!O�{b�tn��<_���R���6���i�
�����;*��E%�LFM����Thn��yN�����Z�q
o�/�J������|M�Z�6�������El�����_a����N$�s7��SF�!�(Au���	!�w��:���h[�m>&8+������-*���
��h/��Q''N���X�6��-L ������������+��u���rKKK�v)����_����-�"�YpS������X�v���I�"�FF�'S�!�4���*��t
%aK��?[K��-�������l�'�/!?���zu�VQ��4/7�(��D'�_>��;G��<^�6�Wx6��?{�������wB-����?���=��-����	����x�]8�����
��+������FG�g��W�g_�o/��	�\X@,m��5X9'�9t,x�����v��9i�X��Qq_�?��	!j���j�Y�CG7�<�#q��l��j����t�\��:��71�x��y'T�-m���1�lvyy�/��_&���.�W[v����r�)'��RK2M�
�u9�G�����=U�����(q-�R��������&���2X%�#������!V*��2Y��":���Gi���5�,>��8����&,�tZ=HJq��zu%	+�z<�iUP���F����e����?�%'.W��7���"+w/*I+�Vc)w�q�\��&=*R{�F��DE����Q��u�o��7
D1yqj,��z��w��?i�����	�e:lV'��	���V��/Ch���F�t[]v��9U���yW�/1�����	!�+�n������rJt�V\Dl���������k�b��3��M#�N�i��`�`���L� �8��2�U�r����a��]����=K�-���������(+J�(�V3��'��4��~Y�<�K��HY���C��w�����Y�D��L�r|'����S��;��h��QB[�K9��^���J�%4-%�/�)��"���� _�6{Ok�<?����V��9��B���Hjo/#����#)I4Cc��-;KH��3�%Ee��6Q3�'RbG��d��������
�P��.>t8Y�3�^��6�`��J���P}xp�8�l��,��YoD,��/���%���W���M
p��q==����!C�4i�����_\\�����wA�����3��a�@����N1���M
Gy�Px��}}}��� ��0����t�R�~�j8��������o{��iVVV��=O�<�f�W�XA��}��YYYg���2e��f���m�6KKKJ&����]�t�p8_:�'3����/X� ::���~+..��e���
U������@__�Kg����;wn��6n��x����$??���r//�����l��������5��[�h��y��'._�|������������U��i��E�������A��z��{f������������������l�Z����~����s�V��g��I�&)++�>}:(((00�{����-99�����g��z�������������O.��S����---w������f�=<<�t����!AAA���-ZT�����\yx��]qq��u��s���}���N[[[$��\��+s��})))ZZZU��^�zUQQ��y���\ye>{�l���7o�d2�N�Q^^^hhh��=�:s�mJpp��k����v��]/O�D�N�6l�Wg��My��Y@@@^^��)�{���<y���a���Ry�(���{������U�L�����_=g��N26���r���^:�����+�Yxyy���aaa,���%  ���d����W����������[;t��b������s���S��x�e�����x{{/[���fS<��Q��=�d2g���i�&:�>s�Lww��]����_�r���>77w��5���h�����yaaaPP�������;r��]�v)))���������������������$>>���0!!a�������������gff���������766�r���@[[��jjjN�:u���M�6�t�R�v�X,�/���������b���_�xq������������������<x������������sss��__y\�6mZ��Msrr�u�v��������'K6��:$�<<<$�x#�������ww���2~+�}���xMM���{_�zUIIIUU���7���/_�6l��x<��9slmm%�x�b##�������<x0??���{���JJJS�LY�t�����a�~���=�%K��_�����U{yyq�\%%%;;�����

���poo�Y�fYYY����{��������h��?��i�������W�\���#�b��9�]�v���c���:sRR���������������'���i4ZRRR^^���=J��f�������kWss����K>$((����III[�n]�b���1cFPP�������[�f�X����zLII���^�h����������s�v������'N�w����N||<������}�v������'O�\�������g[�l������srrJLL�p8aaavvv���������k^^����^����~~~�+�^f>y���k����LLL�lvnn���w������7o�t��)11q���UoS���
�/_�r�J���b���+W�l��
��ppp�����K��9�����]�����[�~r�����i����'R3�d���jjj�n�:22r��Y[�la2�\.W__�N�����X�{��-\������C��LLLf��E����a.�+���_�|������ijjZVV�f��6mz��]�@�����s���X�!=##C ,_����-��:t���TRR�����q��%K��i���������K��5n�������H����aaa��/����x{K����S###Y,Vff���==={��}����������������o����NNN�Cz���8�b�
SS������/�z����}�����������iii?����s�bcc�l��x�������a���sK�.]�n�����9s��P_������m��~�A 4i�DMM����o��]�~���Wiii��+�����WL�8q���-Z�h��YRR��A�V�Z%*g��y����W�X��p�o�������_�����aaa�<�J���@;;������
2$((h��EK�,�����UTT�M�v��)'''�����s��&&&����/\�����.f{��AFF�����4[������srr�Ba�>}$���)����g_�x���3;v��l$���D����������\�~}��	���W�\qss�����a���,	���,�7m�4���rf���d���`77�Of���8q���+W|||v����E�J�Og��YQQ�k�.I�W� �#�3]PP��O?m��M2������Sw��=f����O��7���������I�u��9g��y����I�������������%�Hr��C�H$��iS�^�Z�l��3�������3���7n\�lY�~����%_����W=DH������m�����|�����}JJ��������������������ud��]��
��m����d��7OIIi��u��Q��O����i�����kii�>}���FFF���6l�vf������������6l�������sg++�����]�v��������;w���h����2saa��;�L��)S6n��a��%K�4(###==�w����������\�v�����W�����fN�����������M������DEE5j����������{��yG���%;D��3��[��ccc%G�#G�<811���#NNN���777���o��a��m�����8e��Ooo�VWW����!Cn��1b���/����l�����]�tqssSSS�r������S�$3�m�����m���oo�������hrrr�V�������l�8:::;;;88�|�2**����������#F����$0l�0{{��:::������[XX<�|��ag���w�����===7o�\VV�g��=�+V��7o���;w��~����l��]�#{55����G��]����+))i��}�����m��;�`��a<����***			�|�rjj�d����U�V���l���F��_�����{����{uT
433SUU���JHH��������e�@ ��,�!���g���ttt*�������
>����$y�W�<xooo����7o��������y���[K>Jr����K�����S��CC�����k����kii������?  �{��FFF���,8|����������|����������8q"11���o���l6[2�����'OZ�l���Y����������s����������ySWW766V�}:s���w������I1`oo�~������7.Z����`���;v�=z��={������H>D��;v���FEE5k����0+++333""Br�|}}���3n�8����v�jgg'��]\\=<<�l�������������$_�?����������*�<(22R(���7j�������Dr��tz��U�����4i��'��'�������o��]�v��y+V�������_�v�d{H�[��\�x��+~���1�z.\PUU�z��d�
6���O�Z��>}���_��U��z�/^����~�z���.���;v��Q�lmm���*g�z�r�������;�^�-@Q�����\����%��nnn�g����NMM����d�B*oS��?��i����[YYu�(��	�����������&9�)++���l��m��E7n���8v���+.]�&y���!�B����a&�)9����#  ��������������|���?�|�������n��ErH����<���-��VVV���,pss�����g��}��-[���s;;���x����~��i�����c�s )���=����5k�L�2e�������2eJbbb^^^VV��?bcc�����%O����]I�`��9�����qvv>r����V�^��?��i�W�^���|���O�>VVV���VVV����l���{;88P?�����#GtuuG�q��1>�?v��[�n9;;��qC(�{���?JKK���%wL/]�D��f�����E�������O?q����4i��K�RRR&L�p��}��t:]��6�[���w�K�.�{�7n���h&L055�r���@{{�������I�&=y�$**j��qaaa��Y,Vvv6!d��!Ug���;���2eJ�����YYY���_�MGG���7EEE&L�|�bpp��_
����6m��p�={�����S���B�WVVfeeUyY$�+$$d��i����~��'&&�3&11���;������Leee���]�z���������� 66��b�1"::Z�}���"y�
!���?))���t��I�yU{yy���3**j���|>����{����l��mHH�����k�233'M����*�����K��i��G�m��e�?$_��J.Q�=bbb5j$yy��{�fee�����3��g>|���w{�������}{CC��/��'O222F�Y�+./^����NKKk�����g���������$<66�r����>;v�]�v]�v���k������Pmmmgg���������h�H����U��>=zt��u7X
���N�:eaa���������m�Vr���\���~����t����N�~�����{��h�������7W�^}����������/*g�z�"��J�={��F��s����---�u�������//������BKKK�!���������C���������${��fK�~|>?$$d��A���7o�0`���j```��}�����������	!�oo�n����Hbii$����3g������:��];]]���P�!���wm���p8�oo�0`����k��S�L�:533������7�����C����[�n}��)ggg�!���kkk��������{w���/]�dcc��O��N�<������#����p8��w���c��mll�}{W�������������,\���A*��MZ���c��1��0}��U�V��I�-l�h���l���W�y�����h������:B��C�5jT���}��iHH����+�T/n�������<�,66�������?�����q��oo���3--���S��<�������^k�����666�|���;w���O�>��g��������K�mR���Vpp��[�***��W��K�)�;!@� @� @� �@H2
>~�3hDG��w�Aj2!�a�K�E�E%���{����|������W����'Y����ZfR\;����;��q9R\<��_�,����i�������)��I���J����f��U��Z\u).N
��~U�nEhX�b����O�UOY5��������~�P����y�fB���S|||ZZ���f�-.]�$��Z�a��P��HT���u��w�S�g�+
#�{nn�����t�������JJJ�4i��r���X,���S�/_>b�''�j���	�.�z
�j3�dnmwqq	())Y�re������S�#��z��V��X��f�>}��<y2??�K��k��9��2l�6�6�/X� %%e�������^-��'|���)����66�����]����1c�SM6���_���N����d����S�L��i!d��e<���K��(�z��)���9��j�x�n�������!yK�O����z
��g����,>`��S�N/Y���������B����������6��������eee�W���
�#N������S������]�l����+���\n5�@��P��}r�l��o������i��g�rss���\]]��[G��<==��[G���,Y��KSS����5k��������~����c���K�.��s������U@@��U�$o)������>o�����>X�����_�����]%?�������rHHH||�d�`�6m�6mZDD�$W�������wk��B7��n��6�O�0ACC���#���w{��EP�S\�FNk>���kii���w��������~i��OU���j���y����S^^�x��j��#��%V=e������Y���499�������.���������l����FFFeee���...�.^|�,�j�S8����}jy�0�������W��0��(�h�Shz����k�������^TTT��}���?_���{W>��zJ�.}L��fqee����?|��q���[��f�����j3F������;w��=�������|�3���;W{�@�P5��������f�'L��v�Z��t�����������111]�v�����5�����=ztpppXX��Q����y<����w��mgg��plll^�|�|��j�9�;��m[%%�.dffzzzV��}��)S�<{�L(������^���������]�����O�>�r�����544�Y���y�&M���uk���N^s��x��}����]������9j^�F�9::�{����/^s|�:�r�:t��j�9j^�c��.\���X�h��V5�J�9j^�K�.������1bD�N�����v����o�����}�����������{W{�@�P5��������={��G�?��s��i�����]�`������Q�~��92w�\:�^XXx��yKKK����edd����d2utt��?�����G���\{{{�������Q�`5��g�^�~������._��h��]�vI���y���k���w�?@��������_wuu��UZs��x�n�,--CCC�����m[��5@��;;;�}����[��������]s��x�~����o����m[;;���9��xaaaVVV��-�5k���5�Ww������'�N��
�Z��������sdd���A�v�>_���y��S�2�'NL�4����9�]���K�/_�|������T��!2�(j�P�����w�'O�����?>''�������
h�����������---&����bffv��9;;���>����<t��7nt���������?��cHHH�.]��� ��>}����1���W7n�5j������������N�>M>|�������5/^RRr�����#�}RG�P��G�USS6lX��VQc����~���;v�X�u^s�fr��}�V�x�P����5;q���8p`���_�Zn����[�jW�j�����������<t��j��5@��[ZZ4m�������.����6{���G�J���t�@�P���_�w,��o����������?�����6���C*��������B��@�.��O�]?������r������f��>�:e�:�f���-.%��'U=eJ��z�RY�I��KA�Sz�3[�Hg��?yTz�z�S���[������k���`"M��,�(������2��[k�j~��_�\%����8�#V���848�P �P �P �P �P �P ���}�
endstream
endobj
62 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
66 0 obj
<</R23
23 0 R>>
endobj
67 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
72 0 obj
<</R23
23 0 R>>
endobj
73 0 obj
<</R71
71 0 R>>
endobj
71 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 23595>>stream
x���w\�����]&@� �`��QN�b]����@k��p�����p�Dp��7����^���=����[N}?�21|>�����p	��8@��8���
j�+�m�������
j�+�m�������
j�+�m�����t%���q�����*������Q��H)���'�7�|�T�%�I�d`��e���YBlM,)��v�^C���5
�������3|z��XXT�g�Y���"M����U���'{�
)���7si�>�N!|sB	�d`�e��Y�^c*��(�H'��(S��52����6@����ls�%����g�>f������9��Y�-Z�T������;���-]�t���O�>���w�F��<2�����:%��q�^gs���K+V��{�n���%��
�r�����������g�������K�k
��U#v�{��xi��Q���3f���s���������]��������X�"   ++k���I>I����y7���k&��r��k�x�>}�������4�e����ggg�Y��|��C�	LOO��c���g����W���2`;���w�`0���OYYY�v�
�i:88���3QQQ���o���F��Q������-�Ee�t/�f����e����3f����k�����d�
6L�8111188x���
�B ,_�<��'D��O�j-������<p������������7x��}���7o�|����{�V�\y�����>fhn��%��M������S��{��cG�9s������?�w��*�*22���&�������;����nB>�/��y�V~�4~������}�����8q�A�������S*����������!!!
���c�+�g^_��5������EB����W�\����������;������=z��M�6�t~�~{����]�5�8V���F����������AAA���c����t~~~�o�NII4h�T*]�~}�;��6@�W{r���������Q���� �,_����[���/^�pvv:t(�0k��5j����<2��IjY���B�2��n|�4|��%K�t��=::����3f����s�J���{w����;��9s��I�.���f��Kh�W����������v������u���-[z{{2���#�}�]������\]]�<���{�������\������(�BQ�\�#G�����������f��+W�3f��i�$��-[:w�,��Y���$��'[��~�I��y�f��=b��P�T*8�����sg�B�z���={n���W�^���������-]/����w5rW?�������;V�~}�R)�W�^-�J
4f����/{yyy{{W�P!��������#����r��=�|����[��z���{����S�Z���w/X��S�N#G���!G����������O?��v����;W�R��_~�;w���[�Z�����?����g���;�;2a4�{�	���^���H�;v��E�`ff6o���-[��Wo��)��"##�5k���OO�?���L�{s���������ccc<x���9o�<���M�������}������x������ow��)����$7���k~�B�k�������j�������}��E���[�l1e��e���]����~:}�������[�{��6@�Wpm{yy����O�>*�*<<����*THJJR(>>>e��]�h���C�-[�����v�z{{gff�>}:44�z�����������������s��u��i��a�,[�����Z�j'N<y�����X,���;vl��\���/���-[D"Q�.]j��Y�J�
6L�<y���o���e���s�|~>�X�c���>}������������m�v�����b����;wn�������4ir���|W^pmW�\����W�\qss
��L�2�J�*5jT�
��_�;x��]�V�7n��5�l�����w����o�������v�������A��.]�w��k{������������SHH����A�������n���{�N�:���7���m�F3o����'�������i����944T �����9�l��m���T����.����yS�^�i�����>�~��			/_�<y�d�-�v����Z�F��]����k���#����=JLL���S��5o��9d�����/Y�d���W�\
�5k���@q��(�
��~���Y�f�����5j����<a��'O�,[�l��}��O�2e��G6Z��{���q����ccc7m�����_�z���S:��y��
���_

�w��6!d��A�����7n������h��
*��={���������_���k�e�5k����...m���r���	
��q�������������\�bE33�|F/�����O�����'L���i��{��_��������

���oVVV```TTT���\�,5jT���w��5o��F�Y[[/\�p��5;w���q���k��}���|wK�����y�����ko��!;;�����G�^�zu�f�
�������k��q���|/��+V�����.]�xzzt��m��]�����U�(�����+W�w��k�e��5k��uk��Qj�z��]�v�8q�B�X�~}���:u����:��c\�*�j��&Lx�������n�:{�l�a&M���M��;w��u���':t������k������]��������#�m�8r��E��?���|����g���w,PL�����k[���y����Y �~����T�P$''����.]Z�R�������{����V��111e����x�^�rpp��1114M���?�������[XX�3������HIIqqq1�_��[dtt����H$���#�899	��������m���V�������������0Ltt��������7o(�rrr��X�y/��D"[[�������2e�h���o���(B��w�����&�DGG[ZZ���&$$�t:GG�����������;;�|�e7Z�,��z�*o��y�F"�XYY%&&��jgg��{���9������F����quueY6::���F&������z'''�a������v^����������E�}����������g����ZpmBRSS���J�.������JQ��W�����r���o�B���m�+/����T�i���.+++))�������:���(���_�d2KK���b�
P�uX����g*����8iA_PxCb�Z����?���)Vn.���	�3v�~�]n�\$cG%�81�G�3��e�'7*U�H�}���~��%d�sb�]�����'�S>^�9��>��>Me�Mj��|��L�)�M�'�"�?>��ZN6 �h�fu���?����v(���5Y//��������=���)�����d����s��l#R���	�6WP�\Amp�
��6WP�\Amp�
��6W������O�<���V�L��]���7�i��z�����w������/���9^3@�����_�5k�B�����[�g��V�Z������]�~}���{��4h���+9^3@���=o���c��m�6!!����=JHHh��ijj����u:�����w��w{������
��h�.�-_��j��m�\]]�O��d���#G������������9rd��aK�.���J����/E��H$*�-����^�:n���[�:::�o��K�.�z�
4+V��v���1c�;fbb��PB�3I����
j�+�m�������
j�+�m�������
j�+�m�������
j�+�m�������
j�+�m�������
j�+�m�������
j�+�m��������R	�'B�o��o�
\1\*�*_�[�
�w���+�+nl�#�'~��j���@mWP��m�
j�
\Am���+�m�/�m�F���X,NHH�H$r�<%%�eY+++�Z���fkkK�4��b�
�e������S'�����]�����������������!!!\.�
�6��������+�����s���y�������
6LMM]�`Avv���K{���e���������������6��+V������7SSS>|��)����'%%������K[�n��������$::�����m���@���w���J��H$rpp(�-��L�a���S�N��'����8q�E����c��MJJ
h���?X9�8��I\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j������NKK�\�r�&L�s������[ZZFFF�������h���5@������NHH�������_�~}���K�.���V�P!999<<���7n�����u�8^3�m�/;�d���o��=p���k��������������p�����������m��-��<�l���G��I^b��Gv���R��D"quu-�-�����m{���&M�����u��>>>���+U�T�&M�����=;���x(!��6���Z�NMM���
����R�T.�����,kee��hRRR���)��x�P<����$��6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m��7�m6�.%��5g(j���k�M�h���o�L�_mR�|Am?����~��q���K�����?t�P�Vv���q��>|X&�q�f(�Am�7�l�o�H��_mR�|Amo���g��~~~VVV�f�
���vvvVVVLL����:4l���K�r�f(P�_v&��S�t:��q������������M������=[��-^����o��my7����l���G��;5��'��_mN:���A�������M
�A<O,��_P���H$M�6���GXX��m�|||$Ibb���w������?p��U�V���@	��^��5I���6�;|Am[XXh�Zkk����2e�L�:�y����9sf��}�z��������x�P<�L|&	p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P���6j���@mWP��m�
j�
\Am���+�m�6p�
�����P����Hmn��]�)s��6##�m����~����EY�S�@���J������Im�OR��3(���6##�m�J^m�-t�v��i��)�{bY��O�����������g�F�M��NGm@�P�%c`�J"��3|m�V�����-gV%��n�tb��[�����|F�6����J,�P�����;g*wm���D��7M>�zg��6KH_�,�	K��iu�*��M��O������&��o���:���Vw�*������$����=6D^�.�p��W�m����%�f�,����������	[W��]�
��9'gu?Ii+��)>yUw��~QNN��OAm�Xu�d�h-j��j�f����Et��!o�/��8��+aAF�r~��u�i���_:�U��i��l@G��!���f��t�����"�BH�Fg\xXs��a������Vj��s�����R�_������T�����+�v�y9���u�M��������f����%j���������2R�����y�>�Vj���Im�i$���k��>j:j��Bm�X�m�����P�%j�#�m(<�6@����j
�
Pb��9����Cm�X�m������������{����H77��={��ys��)*T���[�V����:~�x������j
��^������u�����^�z��=�Z]�n�����K�&''/_��g���7o��1�0�-��i�����?
�G����bu�!sFDUZ�6�b��-��QVd�����Osk�n��do�f�dJn*���}�s�2��'�m��u|&��)��NQd����6a�P4s��d=��dkE����zv5b�p���Q�wx��	��f����|����R���%�s�G�q<pU�4E��RU,E-���/��4s��]������MS��)<]�����(�����y���&a�+p��P�����=<<��9s��%33����'%%m�����G�������������H�P�O
4�7����-�d;.����ro���Y�zo�@Wz��l��u�Y�A
�dk�R)���y�/YvKE��3���-M��&v�H}3R	;/������R�{C��YC��iZ��O��p~���S����>�^U��JP���Q=�KB���LsKW�u���o��#�o���E�0i9[f��*��tZjyW����E'e*��a��5�������(��>"���`h6N8��@ P(����������;}�t�Tz����~��������NNNs����������`����?[<
�I��I��w�%��l!_9�o���6>�������5{�l�D�NM�v�@m_�m����v_e��M�����Q�E�
P|��J�oV��4������_�v�@m_�m����6j��Bms�
���(�P�Am@���J,�6GP�Px�m��_^����5]���L�����AD�Ex&����r�����W����eB	O�$�Z���2uz	��/�?����Cm�X�������oG�������q�����m�t�MO��mB�iDb��"�_��R��X��m�s3��	e�����s��J������M$B��Cm!�6@�����R
��!:gCXnm�jIY�=G��|~xy������$��^��j;Pjqn-�
k
7<��R=��d�V�G��a^m�u�_2j�Q��Hx56Q�=���
PTP�%V���5�H����g�3~��R�a�ik�)]�����6d�O��n+h�A����(*�m���U����Vf�;���j����J,�6j�oCm�6@���Fm�m�m����(�P���vh���A�,wCm@P�%j���^R�i���6������m���mj[�J$
�%*#���dF�6@���(�P���v+s��g��;�3����B��]
�&(S�M,�B���|k�����3�\�AR���P������46�y�����k���8��j�Am�X���j�U]��=�\?q��1&���N�����A�����S[�@t*b�ht3���%��<����H���W;w9���QQ�v�;���vZ9�?&�������fF�j��nF��
�
Pb��?����RG>�m�gl�)M����M���o��m�ut��f~��n� L�{����_��
�jC��s�f�7`")k/B���B�DAm�X��/���5o� �G���n���S:��?h�S��P��3R�q]n��^�(�m������>�vRw��*���Q����=�[/J�6@����7��/����	'���.5���;���
OX���>��+�mm[��$��P�j��Bm��n�>z�zHK�����gO�,7����_4��	1_4�zo�������Fj��zR'��3U
�
PTP�%j��^����-}H�M��T�3���yn+'��xs��WJ�E��O�<��������j�#=�������� v(P�%j����>����m�����=s���z����^�S�vz�6����&�����cm���c�����f�g�.�o'=�����/&Q[H������Z	h��kC��<��;��U	�C��DV���+��R���������;��e.��k�|k[�7����jnZ$�x��n�	��1�"�02��]������������_s��O
���D��9ga��J,����7������gbX����J���������x�H.6������)��l��+���,O��T{���;P�����[�����6��-�:DdGVv'���.�m�`[Q�t��~Z'A��@�Wa��l�~^������Uo��hIzF��0:i���N���Ey?����|t�|�d���]o����(������Q�*���3>�g�S����f$�tZ�l���R�5'-$�6@��������gP�[�/�L�$�m�G�x��
�0�L(�^wa��6�`���\��[���?`��1���������<��eCh����zg��J�o�L0������4�l�o�/�����~����o�v��+eS^�t��~�� }t�I�������}��>����X���=9�c-������9Q��y�}N�i��_��l#�������j����J���&-�_��z��h�L�e'�*�f~^�����5e�������I�2�����f�Z��P�%j�-����0��G\�����];�z��6�[��������������r��AQw�7�7y����W�m��Y���l�yJ^UM��M~Z������|���Fd,�o}����9�8diN���G/w'h�q��2�y��C����&b!��a:!�p�+���y���
�)�C�^C��S���6��`S.�=����L��7��(�P����k��G�c|��6~Qm���[�}m�lZc�<����k���:��c����|�����X4���-C������C����j�b6-�t�h��.��+�<<�g���z������Fg�X��-�L:ta�y�)g�-�,ES�"1wH�kR���_��Sw���j�M����
�}�����j��Bm���Qmw�.����:�[�K����H���x��w�rwl��Fg�X��u+���U`1?��=��[����l�sy~6u��G��oRg�e��O�5��j>��AOJZ^�4���+�\��+��j�$g�JI�A��z�&������d	)�s�Q��Am�X�m���������6����s�3��\����]��Y�B�������y�����g��g�'�
8wi�{�3Ws�qX+QFp�Y���[-(y��VC������$���Bs"�-x��=H������m<mV��o=����z#�q����!|/�8��A���
m'�����t�cbV��^������^.��n]����>��23��ix:�����_���������t�9w����j��Bm��Q�+����jh��
��No�MJ�&C�����lh�r;O���&�V��xp��.�E�DDa�b������o����bM�J��l�d6k���e����E��#����n]����X����"t��r=~����*������u ���|��������<�w�-P�o4��B����c�S��{�U�N�����`�8=h������?�\&q���Y�����j��4��Q��
Pb��Q������ e�I����om��p�����]�?��%gK���~N������E�y\5vhY����,��?���?j���|%��1�Ku��T�w�k\�}��";���l�}K���s:rW��O�>.P���g}��_d���0�gE�Z6�?4=�g��?FQ�=#]g^_��� 	���;���%��n�>>R=�����N��g�Ac��yUw��*�	g�U��
Pb��Q����������� mg�z��W��hW������4����M��vy���{y�=L���gs�}m�
�0�WE���$<!�;����v)��<��7~{��(eBzm�V|��y����H���D��"�����f���~k��B��d��������cb��7���sCU"�*S�D��;��lC�����IRk���[{V������.�Q	k�b^N���}6&�VbY�������n���4vg�6��(c|���y��Nl�~��Z>7|w�9U�V�j�����k���$q��}J'f��j6s��V+��������W6���%5�}�w��!)���%�����s�aY���*��I���'o;�����`	K}��7�C{��n�M��BY�����xc���������*�b���Fm�����J�j����������"���d}7��n"�����}m����W_�?fu������KyUw���������+�s��4������m
jN�TO�$VMA�d������U����.�P������G~�KxF�+����c��!Tb�f���`6�<���o���;��jD�~4���+765�a�9���S:)+tAnm�nM��Z�_y?|E�_����=���)xF�994E��$�id���{���w
�;:<��n�1��,����o�n�`B����#����A�_W�u�K�?�cv.&*����I��x"�<��d��*��IR�����
��
���������g�eD�d[?�'"���I�W��������!�����{��'������[�F����d�e
������)j���Zb:m��n'^�����xsyD��dI#2��
�(#���~�M�f"����� ����G7����+Y���:�+��=��t�g���lJ*#���:B��z��oB����+<�Ro��y&��L������%_��<=��R�B�P���Q������Q�������3J-�NZHEV��-


=p���[a������c���]as���������0#=BC����������k[lKN�!?���v��9!Gi�6��s�}�C(�����M��%�t��Th��go��7O���U��2�e�I�������u��{����H�?��H���x�%��
��CmK���w*�X6����d������w�;���ud9�����JeM�-�<�OncH�<A��PJ���^�~8:���q������*{�eF����n�l{�oK�O^���]�/��p��s':�'�)2����4��p�Q�����:�����Z��G#��D����_[�,������K~�lt�{4���vb���^b�7es�����eo��na>Dl����~��j^�v����Cj�����0�������X�}wmQ��<����?��3d|�zv7�~��AX��p���c�F�1uliQ�l	 �x�����d�
��r�&q��� �I>H������.��(K~���c]
�Q�'��R���J%%�Gt)D`���������I%R�,�^&��h�D, lv%����%�8[��	$F���Y�4?7Z�~�_S�ym���0����������4���?{��TZ2s�&��W�t*�6j�]0�6j�8�v�n��o���_�5k�|��V�l(����������oUl��'�.y�!Cd�Wh����3�����
�*��0���d�F�w�Q��.�i*��3�%�f���1��3����C���tR����������t+�fD������*D�z2������=jF��9[�:�I�Z<W>s{�%���r���[W��vR�/����[�QQ9>T��Y���%6������rf��n�$#D��g�O}�G�m^�6Ms7A������M�`�{S��wfH����?��6%�~%���a�]�[�:l��Q�_��|�X�j�%�xO�>�}n$���l�SW��}g�mu��~��9~�|�J��U>��~aVp�!���r��tYi��l=_5�Gq����W,�<}�9A�e',��x ��)�
d Ef��~�����5�;>B�I�����H��O�RU#�]
+-�\����!��q�^����!��&?X��>h�f�K��{�]���,���;��$�;�������M�����	uS3����e�S�m�)�O���(�,oN
%U:<cBi7_$��Mk�fs�
������Z�U��{��dy���'���E������U�5~��AQ
������s3&�t�5�d'���������T�������r���G�J���Z���Z��3Y��p������^*nE��e�f:j"�^F�����:7c���{- 5����f�ZU�O��B�s�tb{�=U����W7��D������x�*u_W�
����y�rE��y3�8O�|y�Ab�PfG���=4;	j�N��g�3M������h���g�����3�6#�1��M
�O��lZ��G�3Vr��$^uh�I���bV7CR}��S=�w ��DjAxFN��� �BU'&�����x<������T��n��pi#q ;���S�?k�`S^d~����O�m��I"���_$���=������%��H1vF�����=�cM���
k�4�I�%n�G^���]��i	��22�P���]����:�����f��C��O#���>��S%���;����]Yk�rt[���y�mr��y�����<{���K�32��H������M�L����E��D��I��R�vL�����>@�H�L#��g��#����j����X��J�a���-��Nw�l>�"1w������77���S������v�<�W�vs���9�A��C��L%k:�F�I����5���T;�L�C:�,��S}�+M��
v
�lQv�k2�>Y��=~���J�U;����O�5e���7��S6���m���S�W����v}�<lt3��TU.E/w������}�E7cIF���?�;���>����=����V�6���6����<4�u%{���:��1�h�Ks��'�Ndv��j��_SQ���T;�Z����*ac�3v�<���	2�"��#n?{�Tm�Uk�^���B�-i3s�_{�~U����b��e������F�q|�������f�,7�f���>+��[Mv�����\wB��G?����������-gO�l�8{�o�s����R��]�X�|~t����������_$����G&H���(��l��z���f�n�����I�/Y�ms����]E�����i>���A&�	\���{�N�0s^�1���H�S�w������q���H_0���}���v�8������v� ?S�g��.=#_�^�l8i�v@s����LZ�l�x��7���9�O�]p��;���#����4G��$�WP�
���dS��=F"Z/.w�����)o��\��n������'?k�(�����f���S�a�n�[�d*k/R�.i`��7k^��Gu��k[��E��6!������^����'���s�����~��/cM���emYg:74��o������'���v3�5�8�)x��d��7��I&�u*�+����\p�=u�z7s�;MC���d�q��_��Z�#wt���c�]�x6���T���.����%�Q���7�c��[~���(uX?��Q�q�v����2#j�����H!�1�P�Az��&u��k�:]��{�X6~{���9�b��'�]�'��s�S5���I�hj;;;{���K�.���{x��=�`����8��sM�L�������T�YO�7��|;��N��K��e���R�2��"!)kg��O2R>a��Z��c5B����(��>�F�L���l1_�a�-��r�V$�&l�Z'��3y"K��x�tK���l)�V���T""��iz�&S��1�{
?�0j���V��DYfn�i�j�^��bS4D�B�k�f��Z����%�9���`��R��)+����X�������R|
�Z���U��)*[*����K�QI5�Z��u,�R�)j%���(�mh*�%f�Z��Is�e��kXB5���n�Ww%�����,S]�T�M2�8�&gi$)�X�gL�BG��@9��P��Jb�����"D�c,���t�A.��c�i��E����gn�%"}��J�(�XJ3�<}\��R�kv����<s��4��GbGX��J��H,CZ���d�DD�})�&C���b��m��dku&!l���Z��i�#���\�X��xb�4U��z���)�gKSJB�X/71�KR����ky�6�j.�S�l��X+�Z����T��i�m��yfZ�����5�f�R@K)��!;��w%�"��Bb�'Yzk����D�����|{�2�&���X��YbJ�Y�0<"��"I��4Cj��<�REI�e�5��X6�%����S:V��Z
����5R�Z���}T����B�T�-��q��J�V�80)a�o��0+�'�"��6MN�	�Sh�z*U-b%<:]� ��`�Y����1bH�Ue3�f"���&dYf��4CY�En�[�M�^+��+�91<:�!B�\l�M�f�6(�*!e��K��5��\�{���V%(L2c3��R�@G[k�r��e-UfU�g��,7'�Q���&2���&d
��O2��x�B��eF�*A)�!�B�*����2��t�,���y�\��'���,K�{�g�QR&�8���L��6�%�	�2e "��B��U�
���V�K���R�b�i��g������$*���:�o����Y���0:�&��"��d�����u�����T�9�����<[�!�J��J)������Y&zC�j�%�C�5f
�F%d��P�����HK&z���.���Di�$Z1��b3�|{�he�$�Fg"s�'��X�N`�*�lR&���-B�83������Q$��{TO��L�%|��V���
�L���{���1���4�A�$������@�g$9"G�K�4RK��99��
�N�`�!Ykc��K{�����l��Pg�K��,�X���U����
yM1q9
��N�8�XBx��we�Og���S�)��Q��T��,����?���t����,%���D�DkuCQ9��$#&�Z)������`I2b���T��i>H��x;��d�P������D! �z+�V@�8�?�r#��(��Ir�.Q$�3K����Y=� S
HR���X��bqF
-u`����GH
a�����w%!�l��Hk����P���"C��*'](��sD|ml�U�cD-�0d�.�g����9Y|�"�m]�$[��{�htT�RDd<���j���3�'��<K��K���je�Pbf�ISL|�Bi���FH����g4e��b��D�T3*wF�JK��D�	����L�):��x�4<K����yY��4=�Q�7��)��;��4j!#��2}v�=Mg"��L�Ll����Q)��8��$P�my�J�O7��M$�L��:q�N�$9�R$)��3$,��!R{V�j��~��R�m^O�s�J^��N���
�z�����������WT�m7k���������;wn�P�Ym���?~�G�<���X���3I��KQ_�~���U���������
��%}��.�������$N���v��m������
������������{���(��;�9_yK�<].�_Q�y/��������T�t����5���[�n�;�e���b��eK#�����w�����oDD����7o���'!d���[�l��{�M�wv�L��S/��'M�4s��O���~�*U<X�T�i������K'�j�B�_�F�����5s��?~<!d��i+Vl������wb}����NN}����������Q�.\H8p���c����r#�o���/_�-[��t��a��}/9r�u��y�:��I�&�_�.x���[�z����{�����
6�x����|z�o��!  ��}��={��|��,�������5k��m������3���?,,���T�����|�&))������M}:B���������������??�N}��k����w����Z�����d��^�z)�O�tww�������?���������:u�\�v����y������f�h4.�0a��W.\�p������0��w��@������G������]d�
���/��'O��y��Y�w�,��U����p��z���e�B�������<x���v!��s������l�W�^����}�DDD��]�e7n���#�y��/_>y�d��=_�������W�����?>CBB��)��ukK�����=
0x��+V��6m�$����x���\�bjj����wtm��aLLL!��������+�-�6mz���|��]�v��O�t�����;u�TPP�_��R�v��U�o�u������D������������;99�Z�*99����������� ������~~~�/�^?a���������a����<y����s�CCC�n����?/�V�Z7n�`&  �}������V��_�s����������u�����7�A~�����?�U���0SS��~�y�������~\@bb���M��������������j�z��9�V�*`�/�o��������;v���M�<9::�Z�j������
_�ti����,Y�R��U���l����=�����<y�_�~F�w���[�f��c�N�8����U�V���?x���)S<==k��Q�B���������B���mBBB���;v��y_��G�m�����;�b����������{��moo(�J'L�0d���66l����L&c�����y/����iPP���srr~���]�v�\�r���=rtt<q�����G�������W��'�����E�R���f�������p����~�����i����q����W�._�����w��=x��OV�j�����_�.�����U������wTT���A�
j �h4XP�r� �����^��+���.e���Pd�0(]x�WAA�8��2�j�Jv�<�D�����/���73�9���.kfffee�����g�edd���HIIegg���HIIQ(YY��������������}������������F//������x#""���r�����\AA������Z�PZZ:99�B���J�F�.$$�D�������H���~!!!��leDVVV===___

��'O������yzz������C��?���-**M233n�����������ggg���+++322"""���i�v@@��������������������,//�p������<
566��}���������?���������|��eTT�����$%%mmmDoo/��ngg���Ruu������@KK���+]XX���������������c��������FFF�����466


fgg������333#�����nW������KLL������KOO���*,,������rrrB�P���rrr��������]\\:;;���sss���ddd8p��������d����iTTT������}}}������������Q(TII��{�LMM�T���IRRhB&�KJJ���+,������<22���YRR���|��U*����522^����GO�������:u���MZZ:  ���[���---h4:>>�����o�Y^^VTT$�H���~~~AAAX,6##�������,$������~�������'O	

��������DYY���?6������p�@ &&&�D������Qaa���-'''�LNHH������NOOG IIIX,������|:����xZZ���������\lllNN���~���C{{{yyy%%%���������������Wvw���G���eggW]]������������zzz@A���-00�@�;wntt���wc����������Q\\������%""���/,,|�����A��������?fff���:�������[[[�������D{{{<������������?������LNN*++��q���������g� �011���t��}��?_-����DjkkKJJ���*--%�jjjT*����Dfff...���rss;99������������***?~�@�����������?88��{|�����XPP���g����T����������v�������w���������I333p�����BUSS+++�
0Lii�����O755����={��}������������%**���v�������������O�vvvzzz������200,,,p�F �o�>|�0�j������0LUUUEEETTTeeePPP___qqqHHHcc#�H���F��)))"""������111�����v��
���tttMMM��[��MOO�����szz�����������������������+W�����;���$'''++���NLee���Bp����600X#�'��d�������


���eee(��������Z[[������DGG_�zuii�����}����HUU������~������������������������CCCKKKo�����AA�j�Ed�=�����������w�������,[����DA������;//oOOhkee��9�CCCMMMvvv������!!!zzz


###tttmmm ����G"����^^^�����{���g�~d���(++��x^^���xgg���>,�F�������-,,bbb�\�"))Y__O�PjjjV�N�>����I�%$$t��9fff//���F&&���//������$�DG���������_'O�����hMM������������\111gg���h77���T--���/����&?��Cmm���tee������������?�@x���N���������C7 ����D"����T*5<<���������������H$ff���y�$%%�F��:::SSSW�^E"�������			zzz �8x��{5'�m���677��{��}||�={F <==���������i}Zm;--���'555����������������=<<TUU

>X��������v���?����lmm���SSSCCCaaa���INNN�=#AA���f###;;;����w����_�KII��{wKK777GWW����YYYW�x���7P(Tnn������������6�,..������������M��BBBCCC}}}���111ccc���x<���������l������_^^����H$vtt���������566��*��������


���uuu�h����O?�$((������w�|�����9r�L&���a0oo�_~�%555$$���������LMM;vlqq�����������H���'��`��������}���Vy]9P�W�m
�������������������:,K�Y|���>�����S�� II�����w�������>z�H@@@KK�@ �D�c�����eddHJJ���������~��<9�����.\�PPP���n���������yyyYYYMMMjj����;v,/����*^�|�k�.---"��===>>>,+++���������D�����x�������{~~�����'O�?hll�������g�����/���:::���$$$h�
��%Deaa���Y-������������?_RR����������w����322:::!����k�����EGG[YY����899?�mKHHTUUIHH���xyyuww��sgaa��ZZZ��ys�����������������������pA��\�W+��.onn��;}}}zzz###��;wFGG��x0\/$$���fff6>>�c����___P���������L&c0�>i�KZZZ���GGG���0������;8��W_MNN�����YhhhLL������W)Jnn.8H||<-�G ���jn��)&&v��������qq���zgg���
����bee���RZZ�����H$������yxx�����S�]�mkkk::����+W����ttt���\�vMYY944���FSS���������+%%�`&���#�f�N���&&&���zzzg��������~�;YY�G����yxx���211�����<�0>��[YY������5::
F0�={fee���z��-aa���O���Y[[������KKK%$$^�xq��G6d�����������������ZTT���=66FGG���:44422��/_�����'O�|��
�������ccc���(*::���HXXXNN�������D���������a�X���������L����WJJ������0..n~~�������`&CII���������nYY���O���DDD@�uShhh�����r===���,���������.**F�PA�������]�p������BMM��={���������nss����{YYYeddBCCuuu������[������B�m�={vhh������%''�����1�Dz�����xUU�H���[XXHHH���%�����_����VPP�h.\��������o_xxxFF���GmmmNN������(h���[''���������kkkCC��G�����������\�@����;;;_�z���eee
����VSS��AAAh4���)???%%�6�&����<�����G������gOKK���]ss377�������H$����S���
���������Rxx���gA��������bbbx<~tt�69
��AAA'N�@"��/_NII������[^^vttLKK366������{/oSUUU���������_~H"�p8���Illlxx������x�+�mFFF11���������`++�����,...//����g'�'N���8::JKK������NNN��x��AAE4QPP�����p�G�H$mmm2����_������c�X0��������6�=z�(��:88�p�O��T*��Q(���X///+++���!*�ZWW���kdd���BK�@�@ ����}��ml����DBBBkk+;;���|\\\^^���fFFFNN;;;�DXZZJOOggg����@D���yy���j


ss������(
�"%%UPP���iiignn�
�*..vtt�X���_������URRJMM�`0���X,VZZ�k����Tjvv6-
uuu

Z���D==���kk���fFFFUU����'N����NMM���-,,TQQ���x�����OGG�{���������������^-"�J�q����8��������?~�����YYY������CCCAPP���K���w�����A"������===���=�azz�B555�]�?u����������7��f������+555���STT]uuuUU���J��4�XP^^��#G@�w�����D"qzzzdd���AAA���666QQQ222���'N�hnn���%%%���>loo


�6m���_~qss355%��


�ELHH���

���������
������������6&&FDD���p�~�!99�������&���'��������/t8���T����������|}}���#""

{{{###]\\P(���bii���������%����������6P�[��%���?�����WSSI��}�k������Yg��ZH722����_9�)++��6lO�S�N���n�����$���]{}*�m��|�qW�������
��/��K�����kzz��+.6��grrr�n�����-�;�����woHH������C��?�y���������Z�����{��,����v���������?������>���������������EV�����[��^������U��G���E�hg�M�����������-���xnn�j+��������X��m�4���XZZ���������P(7o���������(j�5�@wwwHH�Lnnn^XX���L�Gjhh�����o�~����K�v��}�vMM
��XDP��
��M]]������;h4:;��O�]���9�[��������������k2����w��}���zVT�?euu���/kkk���
���

]�����V}}������#()m���������97���=  �H$��h{�)++������������#G|||�-"�@(((���g`��S�6�����RSS���A���6�����������n��A�.�
&&&�TjKK������t��+�����a�tLLL/^���X+IJJ���l\� h+�l�6���LNN�����K��b�{<m�
�V�1�^����.((��Gh	

b0���)0�x{tuuqrr���k�"BA[
f�����{������	�
�����m���dffnsR8;;����m�tqqq��]k,�
lll�� �?f���\\\���*++iK���������XYY�������{:::������g����������M~~����D


��������QH"��-A��`�
AAA[f�AA�U`�
AAA[f�AA�U`�
AAA[f�AA�U`�
AAA[f�AA�U`�
AAA[f�AA�U`�
AAA[f�AA�U`�
AAA[f�AA�U���H
endstream
endobj
74 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
79 0 obj
<</R23
23 0 R>>
endobj
80 0 obj
<</R78
78 0 R>>
endobj
78 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 24455>>stream
x���u\������l/K#e�]���q�w��gw���r*v`c��S��=<���Y:�g~�z'�������Y��gvv���u��8�(�6OP�|Am��
��6_P�|Am��
��6_P�|Am��
��6�����2��o=�S���fpN���>��h[����O ���cS%.�Wc���>\���\����I�KB��7��D$���9Bb�����,�B������H���p,3Y�(�pZ-�������<�'�T��>����*��N��86kS@1
�<��oj����_|�������/����Y��������;wN�0A�T�;644���k���]�t2d���k?5xl��\j��m��r�a�K������q��1�a���={��k����&&&�?>11q���,�~j|�q�>�&�V��y:7n�X�~������W��}9����/!d���K�.}���?�0~���G6p���Z���8^���|������_����^�fM�D�~�����K$���7S5r��Q�F�+W.��N�&c�����y��.Y�������G�)[����g�=h���}{�*U���&M����w�����a^��������9|������n��1<<�[�n�+W�<y�T*]�zu������v��!�H>��c��[�/i_C��,����W/�J�k��!C��Y�f=}�4  ����a��[�n}����;���/Ei������������c��'&&�5J��v�����d��E��o�}�v�lmmw������q��l8!�G#����mN��8��u���5k�H���+Wv�����t��-#G�LLL�9s����CCC�����_�l�Y�2h�6E�`q�)Y�������5j������A��r����3����������{����a�l�oj��2c�:������Dm��������;�������g�������B�����f����_���+V����u�K>XR8��r���K���a���Ys��]�L��a����^��_�V�Zm���d�������iS���OR�� ��zA�4��K�5j����X�����Y�f111�����wrr���?z�������c�����G6p��;�	�Y�$iS�����!b�����;wn����-^�x��-�7����J�.��������W�B�l�;5�L*F�<����:.k���{���c��)IIIk�����_;w�������t���m������=:�cr?)e����v3�_����~�����{�Z�*%%e��)%K�l�����k/\� �����/�H�u�����O��Wm��"h\��]p�����K�-[��������5j��]\\���W������E7n���K�<m�z��G^b2z"S����L�4�������+V|��Q�J�<<<T*������gjj�����C�0����'��]���H�V�-N�>�`��.]�8;;'$$l��Q�P2d��y<��m[�
>|���O��T��������6{���c���-[v��}���7n����qrrZ�h���3�[�D�7o�d;8|�P����k{��u�j����;v����E"��Y�����\�������s��.]�������gJ�R$U�^}�������j��y����K[�n]�b�l������_�x���omm��S'�Ry���z���(Qb��Q+V�X�vmll��i�>9������;w>t�����^�_�d��lmmk��)�<==Y�>|�P(,U�T6��cm����c����O�4I��/]�t��!+V3fL||��7��+gii�e��l�;��~������;u��d��a��S�T���k������J������i�PX�V��������[�N�R�J]�|Y*�V�^���-))��������<yrHH���U��\m2�����������*^�x�-�/�}���c��n�:""�����6m�i���s���m�?~|��uU�VussKHH����^�z��=�M�V�t�[�n]�t�����v�s������^��{��5k��4iRhh(M�������f��a���,Y������$�6��%��NIIy��aXXX�N�7n�d���w�����k�����)))
�'_��sm�:u�a��^^^K�.7n���EQ~~~;w����+S�L�V��B!�e�$���s���';v�8{��/^��sg��1~~~�
���2d��K�~����G�~<r��Mi��epp�o��V�N�.]�������+W�l��-��qqq�?677���_�9(9�6!$s��M������;w��Y���Saaa�o������7o~����={~<v��M�������o���`��3g�����{��?����������u�V;;�J�*e��9�vdd���G����5k6n��-[���u����[�n{��0`��^����>|�p�2e/^�~���.^����s��������c����m�6|���K��������G��b��������_������U�T1b�����'��������]�c��e��9�v�:u�_�����k�.SS���7g�o���#G�l��aHHH��}�n���#��6��%��&��y�F ������EGG����t���_;99i����BH����<��&��x�������R�T�T�%J���FGG�.]����:���(�l���3��upp�J�aaaB����6666--���I�VGDD����d��=����u�y��������itt�F�)Y�$!D��P%�>����V�������3�O�%D"����E"���]dd�N����c�����2�@xxx����sKKKss���(��������	!Y?����������q����%Itt���m�#��	!�_����VVVqqq)))NNN�&,,,�@�y�F.�[[[g;x���y2O��������������3���K���lm��/_f�*/_�411���~��)�0�$""�S�>��Bm|_���O��!��{�3p[���;d���_ u����
��
�W(�i�n�R�K��~�&�zBe�����8�n`���� �9��'���M
mJ><��z�>��d�(�����b��G����e��m�d�'�����_�1����:oN�����IV�R�|���>��+��grIW�oSB��p��L���*�:�����X�Mg�e��z�L�m�������j�/�m�������j�/�m�����?�����o�����k�4k�L��_�xq����Z����C�^���g��!w�=x�`ww����M�6�n����<22����"����Wu��)_�|�R���g��!�=u�T__���@__���o?x� ::�Y�f����W�NOO���*V����}@@��-[2�`0������(���������h\]]AXXX��}g���l��q��������EEE5m�t���3g��r�����������E}��(Dh����\3w�$IOO�q�F�&M��o��K�>}��5�`0�^�:--�������>>>=z�([��W�?���I��6_P�|Am��
��6_P�|Am��
��6_P�|Am��
��6_P�|Am��
��6_P�|Am��
��6_P�|Am��
��6_P�|Am_���DQ�{P�P�����U��B�����6���h���Y>C�����6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����rW����[�l���o�6m���S�l�}����][,�������q��������;y�m(�P����'O��-[���J����g��)_�|��%�J���[mmm_�xakk[�l���k��r��
7�&�U=��g)S������W�t���k���-���������I����u��EFF�)S������l��%AAAOlp\xx8EQ|�(��ZQ�_u�;8EA�
@�D666�\3�=d���k�6k��]�v<xp@@@pp�B����+S������9s�]�6s��c��}�M�"N��u�ks�11-�](0�������g��j����*((�|��*T8s���-[���t���%K��|�r�����*=��^�
��$���P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�<�m��(�����oj w������������s�-[��?�������wppX�v���c�����3u�T___�w
;�6@�j;11���|��A��??y����W���+U��T*W�^������K�Fcaa���������@�����+I��o���z��]�~���[j��A�J�r�����?�V�Z��U9��9s��'2_a������{������|3g�(�]�cb���'�sQ�c�����#�l���Q�F�:u�y���������kgbb���5w�����U�V���+���.L�������oy�mjV�����|B9����3�G ��z�-��@a���l�%4M��l��^I�O���P���6j���@m_P��m�j�
|Am���/�m�6��
�����P���6j���@m_P��m�j�
|����RS��u��}��3#�g����_���	#�Y3���3#�g���P��n&�����B�
�@m���/�m�6��
�����P���6j���@m_P��m�j�
|Am���/�m�6��
������h���N�o��M���������m(Z���J�,4��w�5�m�j��6��
�@mC��������m(ZP���6��
Ejx��^���hAmOP���6-�m�	jx�����
<Am/P�P����'�m�j��6��
�@mC��������m(ZP���6��
Ejx��^���hAmOP�D}d�!2\>xtA��7�
Ejx��&I�kB�;s��w�������
<Am��I�5�7���������1��U{�7��D�8��\$�Xj�y������Y�"�
Ejx��~_�\|0e����2�4��0�]�,y��Q����G�2�/h}�HR7���V��^/��.*�}(�P�P����'�������R�+e$\�?\���~��9�$���m:���*I�),_Y��*�������w���;�m��g��#%������6�$���qEQ�\X���p6
�}m�
�D���g��=JX���S�6@������WI�g�����6#��
<�Em/\����G�9z������#n��5~��r�����g���4MO�2e����{��y��@b:�f~�\y��#P�y.����)�&LS)�E�oW��!;��r���<���j����`�P�v�����������o����������n�2u��Q*�~~~�K����b�����U��=}q�8�{��:��_3^R:�z~������G~��������<g+:~���K��JK/)�B"X��:]����fCKB%�}�"���N[K,r#�e���y>#j��(�yy���g�i�M��6z6)������m�B$�f����'_R�����x���+W�Ri��5cccw��Y�T�������y{{�={6�&���4���1���?�^��)��[��n��������eH��s������q���E�SXn��4�y�V����q�c�u�V�����^,/r�,r6���j�)'���"�G����%����O��UL	�L[������������~��/F������lM�/������q��[_>y�Y�t��R�7^i�M��I��r�DJ`��k�q|�����G��OHV3e��I��R�A��#������&W���6��i�.�}k�'b�I�^7�A�&j%�g��zV�c
R�����L���@D�*-'Q�NG(�8C��5���T���H���'��WQ�#G�������aB���5��Z��8C��X�?�l��
���`a��#
'�����}b�������(���Z��T��|���oVn+�6E��0����=�e�d:����Z�I����K������?�wZ�����'�N*��k�&W�+�����$���^nuW��[��X}��^(�d2Y�r���f.j;,,l����/��URR���S���3x�`ggg///B����>k��}���������v����Mh'n|r��m���?E*����]�y��kJ�����
OW�n�������H��sh33Y�F��IL����7�0�Z��1�����O��$A�d&R������l�5�}��~��Dj�i��b	AKC���;7�Kd�`�2��+�v���?uW��I�8i���.B��\�~������u�[R�ON�5�M����n�d��36�n�O	��S_���d�d�5�y��I�R�=�����j�n���^�kM��iL�}S��s2u4ru������_�0��ed���6���'(���%��gf����t��4����-�^�l^o�.u�*��Q�4���f/6���)�|���i�Ky�v�����?������_���}�g�������m_���P�������K���<�O��vI�!�X�C��wn���v�
(��[�W�CJ	Zj����T#_�$��~��@mC~��m�#>��?m�\����[}^']���g�Ed�9��a���q��M�g�6Y>�����Z�}��T��uu��i�A���j^�6|�����9i2jg��Y�����j
�B�	�E��tpi����W��c/��sj���@�AmC����j;o��>��J=�����m�J���������NO��m���	�
Ej�W���UHj[k�V������>�������.��N��K��#���S�z�;�lig��P��7bB�m(ZP��Bm���mY���;��/���qB��^Dm�WBm@m���N���`0������NN|W��v���#�����J��3�k~t�u�5�/�4a����]���B^BmC����j;o�ms������@�Y��VZ���l��j�m���|�m�=jB�o�G��=���sO�S���J�~W�)R��#�W���4�U���U�O�i�����k��.9��n�u��j��6�P�y+�k�������f���X�+�V�I�!#��.���)��|��>�*����=������AI�WjMN������e�?�n�����l~���y�
Ej�W���U �=c���&�k�����u���.�I���#I�'�N��$i�f�+9�������_%Em}�m(ZP��Bm���\���T;�Y��u�>�&C���z����]��Q��
��}��&�\Zkh6����W��R����?d��p�WL��y9�7�
Ej�W����]�v������)���~�
��7^�z����n%&,K���y����A��J�� V�/Jiu2dZ��������k)���
'�
��1�6-�m^����wU�1
�������e?Ej����o��[Ug��c�5�w�+�����M9������]>%�Wd���U��>���Z��C��9�l���X�v�P�P���y���[E��O����D]���S{���zQ-�/:TPd�����~���:3U]���������lAk]����V������v�K��_��w�&Mo:�w6j*j;[�m(Z
��Y�PZ�6�ZQ��7�^KomZ�`�����������W�,��m�l����+�����Z�������8����e�����pZ-%z���ZBSt�
�<�:)z�B x�XR�����D���
Ja�mm:�2��H������H�N�������/{�1�-Q��;E���%|��X���$�6d�]�j�)1�@�(4���zc��1�k{�u��qm�tIt�
����+�e�,4������_6������^�i~*$�i�}2��mD���Y��|�p�giye"u�����0��@���@Z�Xx�y�Sv�'�$�P���=�5��)������=����ph��
������v(�� n�"�f�v���=	���{N2r����
�t�.kOO�4�!>AZ��f��yz�]��q���D�L��ow��Z�j��		����L\��l�|�y?"�d���jFB���j��&�g�9�P���zV�T�ja��?eU��%���\MW���i�I>����{GN���>�V�����H��3�
OG)CZ5��W=�����.>N�9=WS��'�k{�A��g�CcQ��k�m^%Mm����?�f�6j����oDh��_��	!�|�����~���h�Y;A�{��1���z��e�
U����R���������.Q�i@1���M��QB����
���k��e�\���t�������L�����,�{��.����bG��q��	#w�+����s{`�����j;o��y��5�a�eo���m���S���H�~�g�����C�NP�Z�k8��Z���J�T�Be���/=���]���q��6|�z���v��i-*	Fh������c�(�����Q�C
���-���s{`�����j;o}����n�Y�)/����q\������U�X��;��`9����������FmC�=��!6�z������ok�
j>��+�v��j;��n�im��k���N�����%���I�Q��
~Z��k5?}Qoy�u����Fm��Q�y��f�<����J���U������g}\��z�L�������wX����DeZ���|���I$�U����/�k\�e��BNUp`8���	�#���cV��c�4E1������H��^�����Fm��s':]i+�&zM���,\���n�%Gu�<�K��D��<O&��R8k�z���o�������e����B+L�>�0���D�����T�L���1W����7��'�H���t�/��P�y�;�m�����g�������~O������_���f@0�r��
I�1�l���nWJQ�v������������Pq����.������hZD�)��M��i-+&����j����f+l�sc��:#���]8k�)���2�Td�����3rIRq���>KPu������<�E"��������m���;/�����t��Ifs�2>d��	%0��
gm_���;�����f��9�P�\"S�,5�����.>$L� �����z9�������f�����Q%JV�v�eC��x��6�Mzo����I�_��'�6��\�o3�'�v���nH���Q����(�Z��w��������,�����~4�\t_D���2U�Kw���������b.=�����}X�j��_�d��i�>��naZ�2���OCm���sY��U�)��0�3fF�v!���+=
Q����n��'���{��gN16�<�b�������iy��
�N��y=�q.d[���3���$���x�����W���T���v���B*
m;�P}J�N��*���[�i�AR�������4��/��A%R�\�)��m�+�.������!��~���^��v��U��������j;?j{�B�9gA����d�v��d�Y��|�Nk����{�jgw��������������jA����H�Yi!���ZEh��G��?�]8k;�e��U?�Z������j��m�^3��0#g�K���&���Ae��+2_N,h�U�w:S&Ui����Q>�j]��$&V���q���V
N��Vp`�����7�c�<��[Uk;(�����+�L3��s{�Q������c�~��q���^�C�m���>��E�sN��mwi����=c������5�������V�2�N{����	�U0_�:W?�eP����Fm��sP���`�@6�S���X�����=�������������o�8D�R�����w_
=����clY{�g������V�O�S��(��������Q�Ej����-6l��=���P�}lI�vl����U�g~o�4�U����<5��Df`I\j��N�/�	�\��}����Cm���j;A�l!65�6��������~��f��Xn�C�N��4�=�_���Q���G�ng�v9{�'�W���fs��nG,��R�Q<lJ�03���j�]�������d6?����d�����|m�N�_s4Q6;��e��S����V�������DY]���t�{J����c�_q��t�e���m�6j;���K����nY�\|�:j���m���}�oJ�O��Wlc����eJ����U�����cr5#!�e,�d���K�$Y������.�����eJ�r��o���%M/��WX�1��$�3~
N�����e3�m�j�M<[��Fm�-��1�*�����E��Lm�y�d��^���������L3H����e�&'C<�+J
}[��&����@.�.Su_�f��m���	�[n~���I�3�"k3U�7��M�_��	���.t�Wj��k���Ei��o������k�Q�dQ�J�s���NL�:���o�\B�9�@��;�"�_�3�	�9���mz2�c	�:�����E=2d=���]$�S�@�f$����=�6q�vY������5����	�i��c)Fc��������\�������~e�>;U���[��o����7���}m�����S�-��������`v>y��W��3��������^�c�l��G���y��r�q�]��$�s���TY�v�����{	�����m�6j{S����FV�Y���QG�����NO�,�x<�iQ[X����>��--���x����C�����e��'C:����j����o~Am�_�^������1R�����)1����w�1��8W���2!�~�i��
%eO����2�1�p����<J��Q�|���v�fj;���mmG������|�xA�������3j�E���$������>N���}�J�	I��jO����\R���!*����p�S�`�}}����Q���ML�c�.�k^���s3��*j����\f��Tr����W������L9�Y=��������ra~o�*nfs�j������*����>N��~;�
]�)�����W����%x�>����4���S�l���o�Y��$������j{4�'�E�6U!�SG��D��7=��Q��_�t�����x����C-cfDm���Fm������j�e���N2�������)6��fz�����w6�����+����i�*�-x���t�,��������!WG�?�-q�����a�vY�g��O������\��Yo���Vt:j�]0��G�r���s���Fm������}��c��]P���Am{����������x��G�\j��'���U���f�3��]j��t�������2�6���['qO|�Y���#�����"���,H4r4#5�M��D����|�����u�4_�)�Bm��Q�<Bm��Q�9@m��Q�9���>��y5���R��;����Tml&��0�j����6����q�	VN����m�6j�G�m�6j;����6e���Y�����\s������Fm���o���2����m�v>�6We��+K&�`)6��
P����Fm��s��6���	��zrlyc?�������OF�6��NW�dW����������e�	�m�v~�����I�����uycfDm���Fm��]8k��z������}J�����j[� ���kTz�@� ����4y�TS��l�J����5r�,�m���V��+w�j]�����4�z���Z9��P��m�6j;{�mcj�Pe����+w�j]��IQ��Om�$�8yq^��O�+���d����fa��\���bF���b��O�v�6%^���(N.�%
�9��P������G��c!1���(#g4j���Fmg���Fm���k{�J�Q��%f-��vr�A%����j[�b�������Q���M����cU#G3j��]�����Vs��#���H�m�6j��=�v���T��%��0�����7Z�'�'������������g()�������5��m�6j����Q��m�v�P����G_�o*V�v3��Q��QmO�5+^����Pe�~��nRj�/S�E�������Q#'�����t�r�{����������3�6j��#�6j���6j;���\\`*2����j;��{�r�,�T}�Q�wg�`�%������g���\��3�����;��ri���j�k?2��l������%��~�
j����j����j����j���sa��]5������_���_����J�+{����D|�����nT6SL*]i�Q��^R��P��m*�/RV���(j����j����j����j���z���m��*v2rFcj����R��r�+NjxH�gM5h,*���=��!$wOw��Q��m��Q�����Q�����Qm���Q�e�q������f
����M#gDm��Q�<Bm��Q�9@m��Q�9@m��{an�2m�W7rF�6j��#�6j���6j���6j���6j��6j���6j���6j���6j��6j���6j���6j���6j��6j���6j���6j���6j��6j���6j���6j���v�����k����/����I�&��������k���Fm��s��Fm��s��Fm��s��.�������%K*���[�����x�����l���j����zp�PS_������T=`}�6L��7�����]?�n%!'>�{�4����������ID�����^����(���W7�T������R�����%}e����u��l({���%�>2^*��1��^��������2� �r��j}���'�V�*�R���w����K:�vo`g�"��_#��
�(�t�>z�{R�H��w^�����]��i��h�5s�=��������������+!������R����w60���Y,�Nun�4�ny����	�p2��s��mD���^#g�4����
��V>��E�K�������S|����A���i��^g�����7�7\����(�W6�)�{�}��n���������Z���>���Z����~��W����"���7�����7Y�;��ym��F�!��lU�������(�#��S6t�}���8n[�^dVU�J�����2�������.J��"��PXK��YO��9����%��T��1�=W�+�y�D���Db�m{��n������S�9DA6"���_����C���;�?:	�'�u�&py�E����2ob����bF��f<���H��c}�{N@���i�$t��_�d���
3d�NE�\vo8m�vps��!V#�4������v����]���>t�[v\}�[��W�|�?#������3n�����B�EL{������j�9Ck�d��T�O���nGJ�z����;K�xK(���u�
�)�X��f����U�����<B���qpy:��o�o���w�P�����L��S1������jsM�,�����=yXBZ�Q������',S�tno�J�U����
jf���^d9����_K:�e��_�,�����x 9�%_��u~���=*Y�`��?-V�)�L-v)q��U�u�E�0q��������O�14�w����2�`���vk�7����������/T���/�1���klm7�UM� �w�w��cV�ne��U���N��w�s���q��)q��������*���f�m$��r�'Q���\YS�����]��q�!����'��kZao�B%��h���d���j�be�u��y��3-����u)F�V�U����s�e�u�����z���#g��Fc.�V��������e����R+O}��6�s�����^qo8������jj�_Q�1����4yG��������O������������I�$��}�V\��]������<���[���d��Z��5�fX�3��1:���(6=���9�Wy��(3ju��+7wz��hy�4{J@�8���kW#o?�u�����;�^��-��:��vzW�h%7���=�@U=<��7{���w4~�]���1]���}�3�t�k��8��-���L+MVs�N_q�IV����vAwQ����FM�e�q���j�sqN�����V3r���i�Y�:*�Og�37��eU������w{MR�fcwU�������<wk[U6�:�>}"c������j��sa��~
������i���3ei��:t�����xq�4�x�<��y���d�����n�O]e�������&���$bHU���ym��F^�q_������&cu�_�b�f�L��/�-�j�;�w��8��������K���z����6\����9K��(�:��N����k2UL	��Qc �f���I+�m�7���=]�n��w`M���.'m�j]����X����qI;�C�6}��zmqsk��J�����|z����]
cf����\����8w7Q�j��k?��R�����4NF��#����.�r.^�P��M��_���P�=\�m}��c��V���Ecf$��]�nT���8�4k���k�k���K������:(���t3>q_�Z���[�J]��
7�����HU=2��,k��O��T����k��i\\��u����J�.���lnn�d���� B��`X�`�P��S�"�J`/dS,4�R�!F*s0U&�L��� �����!i
K��������U�ZKH�Z���D!e`�4�q7�V�D�x+M��5D�MMc�S-S%�3U��G�4�5�4zJb��E�������T���D���S��5��i6�p&�hK�&V,�6M4�L�J���M�E�$&�HbE�:�L�O��9m��>MFQ1�`�ML��B"�H���9J�a�%�;uz�@H�SIZd�u�L'�1Vz��SEP6S3�*M@�)�T�aOS�,ge���,"���'�N=��Sr�>��:=U dLtr�*2�:M����vlCog4��0"sB�@��d����X*6U�)T������oO�!�Z�Js\�Tl���M5OR��weB�V�YP�8]EImiv4�NL���xg&��,�2���8��>��:��H�Dl�HO�Q�h=m��Y��b�2�`����D
.����9"��L$l�RD��$MO�FU���N��L�q
�6F"�7�K�I��2�P/O:��b�(�4EoONo�iB�)��\�$�I3~�cO-��
�!�N�� ����I��$�T�N���
�>����(�������U��R$�%��Z���S������k���	,��#i�Q�����T�6D�Z��t�4�#Q�	��hL��DZZ��m�)EE����6M�$��D�p�:#o#!�Z`+5D��Uz�N�1����d��A`��4z���th5B=E����v4���\'��bb&��
���u��X@liNk�S��4�y	���,����Y���S&E���%)����S'��j!!�go��L�8]3���l���\]El��&!kuMP)(������5���J���1��5d�u+}T���RzcWW�-c�RbI�:����lM�	!)��2�T%�g���X��&�3V+T�[P$�%%�4���"$b�$���N�Qel(�5��Xj51bY1�x+�Q+��w�k*����>\]�4i���*Hcr����H�X���*"r���U�X����5=U��R�>���8bf��)�""
���r�����w�^)�:(����	m`8y������4Bq������16o(S�:�aKLY2:����_�k�Llk�L5O��L���VW��X]%���p[�N'���M�����\H��q�!���r]��F-�;��j��hM�])K9p�4"M[��Ir%�'��>UN�}���$��Me�<�1�Rc!�8]�T����
qzD�u�\+�2�:����q63sQZ
#� ��J����x����h���3(������\]��kz�Xbj�*h�������)a����5%��Xg,�oW������T�VDd�X]�6����`�Ib8.Q*�3��J�J�d���t��� V�9m��%1g��l)NM�Nh0��#�3N��J�P��_]u4��QV������S:�9�NY�]]�B=C����av����N,`#�����.W��F��3�i���k�F�dE:!����HRS�������?�+�r\�Q����}��=�����m�V{���r��y{{��3���k3g�<v���m���m�8q�e��m��t���0gg�.T�R���"�v�H*D�I�qE����������#|��_6��W��}��?����.O��F��^���y��w�������~?wM�����;��g����+�����ZP��E,�>��yx�QmB���?��			*����!���n��!�M����A��������j���(;�O�<==���S���K�%��/=z4!d��	�o��t���������p�^�zdm�x�����2d���S�N�81000W3������{N��N�.]�����GQ�����-[������y���\���h�bq3.Z�h���^�����;���7o�p�����	!*�J*�~�O�cOO�p���s�T��?�4n�8���J���u7��_��o�.�r�J�z���~�r�?~����������K���(��-[6f��'�y���	s��3f�\.��dY�c������^��(�a�O�m������?v��m��=�^����QQQ���Y�899�|����|x3?���������n���V����DFF���k���Y�0`���^'""��e�cs������G�u��1�Y��T�����.�6m�������p���������V���������###�6���/��j����K���]����9\'K�~��l�B���_���SSS[�ha�����KG�������������+W���/�����c��P�b��^3<<������Q�FM�8q����]�2q�h��I��zo��������{��J�*M�6����\i��epp�1��9s����	!<�H$_��f�T���������1c�-[���g�
����?\���l}�����ljj��uRRR�����/��=�R�d�����0��	/^l�����+�?�Z����O77����uk��}�����,Yk��)S~������c������^����*����?���o������g?}��K�O�gZ�Vs��#r�;w�T�Z5�?�w�^���3>���Y�f	������9,S�tsdd����@ ���R����������M0R�������_�>88800p��	O�<���w���{����S�L���{�*U
�F��8q��+BBB��W�^'��L�0���X�n]@@���w��P���'������(��-�j��rd�`D4�Ak�B�9�$�'L��b�P���c����u�������~����fQ\�~���k�����������}|


���233�t������s����������{wGG(K$�����k�x��������������������FSQQ���e2�X,�����������D��^<������/^LNN^V�[�n���LOO�h4,K���������o��`0lll�����?�~<O�P888�������htNN��"�����[ZZz���W�^uww������b0.\���TSS���}������=p�����������'''

���"����]����USS���URRr�����������WVV������T�������.]e�����}���I$������Pyy����F���������,//OMM���-**�����s]�p���������EEEyyy���T*5$$$::���XTT����@��������t@�q���s��qss��d999%%���n


yyy���������psss@@���KKKO�>mbbB$����m�&&&���������;<<nl������������nbbBUU���$  �������>>><<<$�U����������RRR������d0���rrr���G"�nUq���7�_fttt\\\aaarrrHHHEE������77wJJ�������������)���M&��x���������oZZ�������������]�n�������.--��@ 

=z����f�����_G��
�M�]�&**������$CCCP$33������GQQQ���TQQqwwwuu>~�xww�����UUU�m������������arr277���M^^>88���w���


zzz			�����������������J-,,�P(T*���������c


��W��������'N�(..��p���QQQ���7n�8r�������qll���Xbb���)�����������d>>���^ii������������Daa��~�-;;[UUuA�����999��m������������y��!���WA���U+��k`` ((���������)&&������������M���9
�{			***[�lYq��"((H&����AAA���J+++NN����|w``���W


o��577����"�����������dNNN
E *++mmm��
$���������_y�Q

���._�lee555���C
u��%uuucc���������j??���!##�������������D��yzzrqq�����e��}��(--�`0Z[[������


�������t������������;i477����������1LNN[EE�������dUUU�\#�@ �H�=���]�~��M��&�H���;v�PSSKOO�s�������SKKK[[��+W�o����c���������wqq�[MM����322���TUU###[ZZh4�������������L&����UUU�T*�H�~�������Vkk+h�;v��� m���jhh@����XZZ�k��C�1__������]]������T<��a"�($$���eff&''���;;������D>>>����������H�o������S�N-�����dee���b��������s�����(**��������G�6m���
G��***���


BCC#==$�===���FFF�D\��+��P(������������			(JBB���[(*%%ettt��]������---[�n��}{yy��u�N�>�_��giiy���3g�<{�LRR��d^�r���b������G�inn���


577����R��m��v||��������dr]]]qq���g���Q(TII	�FKLLd�w1LFF(�D"KJJ������_�^FF��������dff����������z�X}��-���})@����������_YY������E��srr���O$���A����L&���y���cUUU8����b�}}}���?��CMM��]����_�z�D"���������;Pv�61@"�555���1�����={rrr��������y���CSS�������'N�f�411���LKK#MMM�������bae�
%99ynnNRR���������"""����������������D���Thh��Q[[+!!������q�����kkk66��;����������o�����������������322������6o������MLL<v�Xbb"��������x����Hppp{{��}�>|��o����q�o����?***666�����)���t�����<?==-""������	x��>}��`ttt�H�����/_������������N�������800����w��%���IIIRRR222���nnnx<�J��"t5������_~��F�999
<x���IMM
��%&&�����hmm����D"EEE7l�p����7dff���egg���Q�Tyy���"Pdjj����333������


���/^�������QTT��E4��vPP�}�����'���������(������$��@ ;w����377_A%EEE��h����cll������offell<�o�3����<^^^<___�2NN���eee��ibbB&��H�
����qqq���/_�<y�$�N���hooWTT���
@ �N����_YU���.K'$$v��)������P�����;w


XYY!OOOkkkV���Lyy����
"������������{������������O�>�E�?>}�����7P*++KMMm�QBdd��'P(���������W���{zz�L&�B���;p����|�2�@���C�����s�rss�����������������������������AAA333
���4|||���Z[[���Q(T\\���~�������m�;��������������� 33����lllfffT*#������x<~����������#G��!��/_vtt���O��h���)))���}}}������[�n��i??cc���	tpp����������{����I��h��������T�����g��<u�TII	�����544������o��[j�0��������,H455��������.:t����������>>>���UUU���x<��������!���~[[�={��p�o��vA��������H�����������
OOOgg����������������(**��������������;�����YRR�F/w&�������D277���MHH���LMM���-**wwwwvv���mta��|�bll���,""���aq������fff������SD"���0&&FII�������_�

rpp�sk�7o�����QDD����}�;4jjj[�l			133���������������F"����O�<�����f��������vvv+��������7o��m�6����[�n=~�XDD����������
e�m][[;11Q^^�}Yjjjbcc����TjIIIzz���GMMMII���_[[K���l�26����)))��;~
�D�����aCll���!�F777���wtt�����h�������R �������022JKK���o<==MMM�?���$--]VV677���X��@DD���	��{��}NN���'�=z�����\aaaSSS%%������>{{{"����Q]]�F�������

��9���@-���v���W�^������q������g�677;88���/IIIt:=77������WHH(;;����U
��������?~���'lll���H$��`�l���������$������� %������aaa���G�"����		A��rrr�@ dfffee������7���8���g���������4MLL��_���)##~--����o��d�������4����|*�$�s����JkkkKK�)++SWW���rttlii�yy9//����k�m�'NHII]�|���sdd$33����������YYYaaa��?733c�_SSS������������eeeCCC���������L'���%���/_^�F [�n���#x<^[[{hh���W���FFF ���m�������d2��%}�i777���'))y��=Pg���k������ZMMM�d�r�V�����������������v�z�����2h��zIX��&&&���[A��:a������x55���H����7o�����P�k���������p���+��@ ��������9��������u:�^]]� �����w���������k���,"''gWWWqq����L���@��iii` QDDd�����n�E222ddd��e������{�]��F�kjj����,//�\CBB�������t���WJJ���tdd���{A�������4>>~��M��HD$���������geeE BCC7l����866���,//oaa�g�*��o�>%%%������d�m��
mtt�N�+((l���/X[[_�r����###?���������sss322Z�n���!�J
���������ijjrvv�������jii������g�DEE������??y�d��m���d2YCC#666''':::))IOOOPP0::������d��uJJJ����P�hjjj������vv��g
�������^������`RSS�XlYY�		!�H|||���~~~zzz(*55�������UUU� �	x222YYY������=}��B�HHH������O���111�111ihh����������x����4�H/��/_�����������+��P(����Von>kTq���|��,h���r�������LCC��C���K����`5�F ���YrFTD\U}}}����Y��-~�sz���ffK����x��%�����WV����4������;;;�Y�S)5�����t)Z[[EDD~�������D�������o��V�?_��p����������OV;��������7o���%%%�`e�����=D�kx���W~�����x�����l������t���~���%''/����������Z����1==mdd��Q�cM�Z3X,6>>~)��Vfnn��~j-����6l������X����kf�_���e0---(j����������[�W�%)))77�����d����8������������o�:::���TUUMNN�8"�N���S�����.��L Vqdd�����l{qX,���|pp�����q�_�[__���f�rmDEE.�!�"\\\W�m�����?���\]]Y�x��������
��+foo�������5�A��}E�6�����O�R�P��>���FFF���{�����w_1&�����������/hnn���iii/C\=�����Y3+^�A}m`�
A_�nihh�~���	ZWW711�`E�jsuu577_�z�������qvv>~���D���|����Y��jtttrrr�~U� hU�l��<�����-�DZd9��������_k�m{xx`��+W�����w�X�b������[�.�jf�=z���n��AA�
f���uuu��hee��{w����:�o1�N\�����{��9x��������W���~KK����Y��~~~���x<~-�[|��w���k� ZU0�� � ����m� � Z-0�� � ����m� � Z-0�� � ����m� � Z-0�� � ����m� � Z-0�� � ����m� � Z-0�� � ����m� � Z-0�� � ����m� � Z-���i
endstream
endobj
81 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
86 0 obj
<</R23
23 0 R>>
endobj
87 0 obj
<</R85
85 0 R>>
endobj
85 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 20525>>stream
x���XT���w��LBkW��Xc��@Q�sCQwuQ1�U�[P@PAE%DAB:����{�,k��p���y�s�a�sf�y9_&`��b
�p @�h@
�A4 ��� @R�8�|jj������w###������������doo���;d��/�@zu
�W�^�8q�o�������m[�`APPPEE������o����������_?333�\m�F������/_f0...)))����/f�X~~~���������������8p�L�6H�Np�����2���OB<==����Lf@@���KUTT������{��I)**���2��_���D_�Z|)������������)*++����b�<<<~���@���\XX���<z���7�����7�x<��?���� @�h@
�A4 ��� @�h@
�A4 ��� @�h@
�A4 ��� @������6o�j(����W��7�N�s	!W��)���l.���2�i+hXi��xn���h��Q��L��G��!@����n���6w�a7Z���%]X�����j\�xfz{=��3d<w�y��jFk;����;���\��w������?����1#����#��������sy�>u���G:3����W
�Fk;{�x�����o�F|	��7.�7c�xw.(��W\8�4��C
�������Z��V��2���(���kC}(��'+��) �(��������T����PfPq�lI�*��'Y�dS��5V�	�>�Y���TRV�F}(��~_8g������/
r o����_#.�+	��J�#��Xm���p�;CA��@�n���9�@] ����]EA,���F�q�J��y|���\Q�:�Y���y���>��U,�����8�F)��(�%�A��@iG��������Y2qb'�N�P���@��V�d<�c��k�d��}EqCX�#�D�q�D	�m��� @��!J����y�NN�u$ue��;�$���F�h�S����d����������uoB����#����9�@] ��tA�@ �.��r�� @ @���9�@] ��tA�@ �.��r�� @ @���9�@] ��tA�@ �.��r���5JKK{��}��]///���w������X.�;f��-[��D�
6(**~��
t�kl��%==�������������������M�>���{,�����_ghp�K� ...%%%..n��q.\�:u�����\�v���_~~���ttt~~�����***dr��&�)�����R�o����ic�C��<����l����i���/>��5�������r���B���g�B7�r���d��,b��;�Z���b��� ;Ms�q_|���S���������S�N+x����$��x�|�R���0oj�Z�`����[�5����+�L����g��N�|xxxtt�����O�<9���~����b���+W�ttt������<x���	!D ��hT<f�Kw�A
����N����7�n���/=�����
�U�e�� Z}��n|�/l�/2x�t�f##%�V��n�v����F�����/>w�OAy~_������n�����v����
�=�)�m���3R/�#"`w��VZ&_z4�	�����H�k�7nt��="""22r��)���AAA}�������g�������)@x
��[|��=�r�� @ @���9�@] ��tA�@ �.��r�� @ @���9�@] ��tA�@ �.��r�� @ @���9�@] ��tA�@ �.��r�� @ @���9�@] ��tA�@ �.��r�� @ @���9�@] >-���H����
��T���W"n���~���������'�e�oY��������FR��P��X]*z��naF�"����k��%���HIA��I���I����**����h��p#�n�K���P�����_=��"8�cm�m���s?W6M��Ns�$mlc+2�d:(�.�����96������@�+�R������I&M����SK�c��V<���c��~2|Z�����g%�<�������m��/WF�R_�G�"D
$x2Y��ClJ/C]k������
|_/^�>
=z����gD~MV$�{^�ZK������%y�
����+�T�����#�f�[w�y�|O�VGo0�j3]X���!�^>LU���hh >�/��������Nh�D�X?:E�����o���+�C����d,kL:O9�z���Wa=;���e����be�Mu�:�� ���S�_��/{������	V�*�W-�u����_��"�� ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ] ]���Gyy���������^UU�o�>'''{{��� ]]�!C�|�+
��K� 99911������^�ti��AAA���;v466��~�~����dr��aA t��)@+W����������/^�b�������sss��[���;p�@B��/_��@R<f�Kw�A
����N����7�nf���U������kj�Q��*^�TQ�~5��A���?�����
�U�e�� Z}��n|�/l�/:�&������[���]���)�a������n���/[LJgM`62Rro�|�&j�p���mt��U���fV�|5�QFO=mk�������Y�v)�kON��4�]8��SP����x��u�[��&��~C���?����^C�����b������l�6�~���+Hx��������ZF�H������W���#�l|�����)?��
�6�"�����'<�0�0���/�cl���w��������������b2�K�.UQQqww���RTT���������������j���
��b2�"��k_�/���_&�6�8t��p�2��<ts��&N3^'+__��<{\|��������s9?o��+�oK��[W��
����r���B����q��|�}y����������t�u�N�+;����-����-&b�Y���;�Z���b�p���v�K**g<�<OI-���f%ht+�7�{���u���E���0�g�)pf�Py1������S������'�������X�(u`]}�0�_-h���V����?O +<�R?U�5b��������-�w�,T7��w��WY�[6�p_����;a�rX��l��R��u�
.���X,--��~Y�V����u��988���:tH 8;;FFF*++�=z���b�x���<����7<O�S���-���=�.��.��.��.��.��"	���\e����QH#s�f�]����)��V7V5����BC�/�t�%���������O��A����9�@ ��*a�W����Ud<z���n���C�s�=<p"����d<��p����d�a�]����S)W�+
'X
������!gg��8�_��2p������*����?��g����^�sT�K1M�=:&�������PW�v��|"(x-T�u�E�Gc���{-������kw���8D�����l�r��Q�:A �� .�e��Z�Q� |6O\p5����@P(���4����mW�&kv5B �[�@�W@\�&o��?����� a�8�J��m������
����;�L�C^I/_�� �2�����@FsR@ �C�h.@
�v��0w#YO�y���k
UA��L�T�Y���mj��p���A�@ �{ �������d�wnq[��z�8Dy����\k���P������?`�^��3q��`��������W�����<�%��V<���Uc��MI��@j<���" ������Y9��_t�*Y�Jq�hn�������/J6�U�?���F� *)��[�s)�Y�G������Z�%b�����k8�v����h����O�9v��o3�����}\l��_>�E���U#;qU�]��+L��C�����{��?��md�6�c��wQy;Z��V9��cyH���:gQ��Br]�(�����xq�\�R��x���if���:�����W��*���m�@���9�@� ��������������*BHe�'�B��,aF:�pY�z��+**��We�b���@r$�IJ�qj�C{�'	JU���5���s����4���?�����Gt]FV6#?���1�����n���1�9M���KU���a���e���t�+� '�����U"$�@4��E��#�.{������b��������v2U��F���'��v��4if�����v		�U2�Y?h�����f-�?��+��@�dc2�����B�x�����US��ZG�1T���X����@H?�O� �\�?!@ ��s��9�@�)�|�E?��s��� �OB�S������^���@�Pb7������� �O��H�K�[6O��4?��!�V��!����db�Ds�o4@�o$7��_%�\�4�O"=�s'�F��I�
2��Dsi���$.�L:.�\��
pe�����!�\��K��H��%-{I4�@ ����~�\@V�.'���r}(��G�!�'
M
��]
��@a:91��n"���!.�~����A�@ >�W���F��]
��	��H��MtM%MC���S��]DEW�����'������4��h��	��<2
�[^�.��X�v�w!���4rw)
-*JH�Ub+��a�'Bf�f��B�F�i1����g����=��o9������Yv��MuFC<�Bnl%c�e����'��"ZM��0{�(�O�v?�R$�
�a){�4��[�!�aB�%x]��)C�����]�������� ���x~�\������3Lg��k��U��=�`6[�PoOd:���,�]�.
�|�(�C���Ed\�|��,�B�!�#����K��Og���,��I��
�x��K�U(��'JZ�4Y�P���[v0B�|��7C��'� dL�����L�z�w!�X�\�h��d��3�u����]�l�a�/�o�.iq2�[����+�=��G��@��Q���i������Oe��_,%f��.Ev�W����G���������J���\e�
����F�������R��q���Q�\M��=�)�G����I�~oRu!M�X�R�-�1�@��"�	r@�x(.�-3�����r����2��������c�2
��a��
�;�x��-C �$u���&��1-]����8B��q��V����#��CH�*��D��I�q���^\�������h��7��*�+��_�b2�*�d�gdT
�����8�4^�>6h���V���!��]T��U5�I�q���o{���3�N-�qlO~�f�gh��8���,�.�k�N����w]����C�b�!����F�k(�,����xh�MH��Yv?��,�>�����Fg��2���@ j���3�MG@���]����O�:�T��3%��`M�0C�����Ge�M�&���C'�����@�M
,�+����v�*���z�&���;�Q;O9�j��'����\q�?���Y��z�>f�f.��@ tA�#J^����U���!�?��w�C�*���e����0w�SlF�l���=Cg@��o�NYn� �-C�Kw��`r	W���E���Ny���f0���W0��|HD\pC���9q��u��8���
�,���W��yqy9�q2\��@ ��xGp��l�|O���(�%*:B��\(Ta�K*��
qu5��4������{U��&��Y��:(�e/}]F���\�����In��,m	����24��)��..�!-{?�����7�����j��,��4��$��d?��>�ml�El\��`CB���OU8Z���ET\>��b-�s�Ec�w8�M����NK��b�c��K�*7��u�������P�����`�"��W���y�Z�Zw�rDG���S��/Tgy�8W��+z���HKx�[o��z����Fe�������%�]��q�q��0�I��9`1����N���p�bw��Ry�����_h��X������k����/:�� *o$���}�o������Dt�gs��S_�~4���C ]���
oh:�aa�6���*]�s,bB/��,�q.u�@ �4��~����%��see�~c�[�5�K��D$"<��(/|�V~����]����(iJ67'���}*3<����@�/�z�0�hty�)��"�O��zB^?$mGI4h�@ �4��������������R���SC	O������U��>n�"9��u�Ds+.�.Y����dG)����1��YL�u��\��$�r���h��i����$�����b���i�������Ds��ey�J���[&(�V��Mf�����L������a�}3E��+��NVE���B�@,eCT������#��bsO�E
���A�Y.�s8Eebu%�������h}U��!�.i@\�������xrf�;��1e������=��-$��G����(�'�v?�b"���t�v�V����s��m�8�w���w!��$?������X=:Ox*9��~{�6������"�J��������C�eI%S��
e����g�d��3%�t���/pn�&Q�*�\;Z� ��B\]�'�}����.{d� %Y��f�|�bb�&��,�P�N����Y����|A��B\��)�:3c��L���z7^����Uv��*|��yN��M�S���4�m.!��T�����O�yq��^k��.{`�J��������NW���iuL���:������@ � $��3EO&0�-e��'�
��]qi<�j����U���%La4�������@�p0C$�`��s��q�a���w!� �����y����������gB�=������V�"��1]X��OQ�\J�o������L����#	!�4v"�����Kt	!��-����������LMM��t>�J�t��0����w'�������M�~���|�W��?)O��n!����M\$�9�����s��gy=u��iiR��Y���ov���4��Sl�6:{�U-���T�~��/��@ ���<�L���+��T�j�/d��Ce]�@�z7~u�Vu�� ���iA�@]k����g����&��cc��Po�u��
#���{�?M�u<�$.���U�����2���]PG@�8�W�_Q4]�F��7��2r�KW�Ga2�S�g)�[�q\�4�����Vd�o�k��_n�����x��������CY@X�V��&^q���u������q��w�.�z�����>J#�1������e�����l�U&<�F�]�4|�G�~:��O!L������\���^)Q� ����E>���(m��$��'
y)D�^��,!�.��HnE~E�L[@Q�
k�i��!����HnER����C�2aVF����7��.�{���3Tl����]J�n��
i�V��%����h��V�P��\�������47��M�g1���1S���p9����C�;���l���R��%:���l�69����8w�vw��>B����E�]���a�Y,��������A���L���������"2(��[H4�kA ]����b��2��yw)��������4�SO���i�+�����<�Bnl%�vU}I7Y���D�����������F6��<�6�>�V��X����~{A.���O�p���b��T�����@ tA �� S���n�Ya�"�8���@Y��0G��TsB��c+fh�/�x�;mG��Z�5����6��|*	�~����@ �����[�l�D6lP��m�d�@��{�����]�Ox�8�<8����XA������,�x�3uW|�����~&�
�])�������@ t�!6l�0i��{���X,'''���� @��_�@ �BC�\�r��%�������}_��������t
�,��;�{w������*U���("?��d����fN:eT]��#�4-uJZ��c��uj�Su���5����*�1��S�����J����xp���1��P����cc���Zv}s�U�f��knq[%�V�ff)��n�j��2K[E�\Qfg���X;e��g\~���e=}��v�|�$�������BN���8�33J�\7�SV9UW�j�vs�2��x�����]t+�q�FU0�E����-4�V�eV���������6��y�����L��� N�ER��w�����V�������z��w-��`.!D�T����k������?�]o�h�Z�T�����y�8��������o��hn�i\�,|�������q��\�72m'P��r���$���,)V�cO���jJ'!q�B2.�h����C���v���9�+�f�j*�O�K:w�b��f<����{��i���rM�}��J��fFZ�(W����c�/"���k�}wK^�y��F���<�����
����������v.w�������b���oZr%x7j�+{�Sjt�[����h����u���*��������A����(~kU�(B�#��f��T
��I���?o�j�I�e�{s�y��=|��tWQ�h����Mfv�,y�A���U�����K4����Qq�4��zW���So�Q-x�v���H�f�Wvl�Y}����-~!�_�.�S]]-�����47c�c��}��ak��6����g�>;�1`���b%�J�VF	��Q���+�-��������i�lj���SY�c6D������s�c&kU���Y����5���f����t�����]G�����e-���^�t��W�G	���J����>J�zOC�X���k2�7R/���������
���������n_	1W�	4���{4���~���	�q��0>�����H����i����V�%��p��p�}Zj������P��6\��qoNO�
F\��H�Y�J�&�����X-*�{��q�"M���9�9�Y6n�eeg*��p��J�g����c���D�*j��L�\R>BS�i�����+A1�QI�����
V�xa��������������A������kf������}��&�	J'������w�}��TLH���,���
��o��<���F��\��+%)Zl�Ub��[��?�8����\q�OU���<�V=MV[�*��^�N����YO;/9^��^])?OK(f8(��v�qPS�#&�d�CT���jm�$�(h6�!�&s���<~X1q�n��b��Ky�ei3����.+�w;�n�+PV�|�i�W�4�.����M���N�[�|&��U���z)���ZQ����Dr����<�sf�*<W�W�|��@���lF^�vxUU=TM#"M�J����(fiI���v��BN����j��;SO;%L��=��;��l�JU��l�m�����e��M�r����6��+����+�z��:�g��_/��U�M��rRY��*~�Z��2���,���.�:w���/��8z����~tt����MLL��J,'�����
����W��|;?<����2��/�lv�f�>�e4��"�h��=������_�4 �J�;w���?�����'N��5��36lX�`!$  ������b�h���X���ooo��_�w��������{�z�j��[�l15��5(5RSS������>����vvv6665�Y�~���+**<==�\������o/�hI7y��]nnn�����m+..�r��-�b��+W�v�����87//���S'N;}������$�[\\��K�O}ADDD~~���s�g�����|�������`����t��}CCC�O}�����9sTT����e���Y���_baaagg����+%=|!�����MSS��_�}�Z������~��K�/B��ys���9s�|����+
�n�:����L����o���������a��0��b�?���Z��0�}����G/��w�O�]z��Yuu����X,��g��l��'bcc�\��+�2�S�?>t�P�)�>���@�h�"++���'�g��'\]]�2�f�\�z�H$222�����������{vn��y����������;w�������eee���]�v>|x��-��{��
kkk����Y�~���7o�����q�^s\s�~V��Q���Iq6Rs�~vn����\��F>���e�h)�Fj������N��+���X,

�����;�-���7��g#u��������������uJJJ�_�B���Fll������3��������g��?��.]�����~u���[�������|������3f����h��O�>AAA�������W�&�l��}���������!!!����@DDDVV���W555'O���O?ihhXZZ�;w���!--���,66���0  @GG���7aaa������,))IOO���t�������/

�������������_�z�~�zWW����7o����N�c,33SCCc���6l�b��������9�f�g���a�&�9c�����DEE5j�(--m��m�kK��k��y�C�%''[YY�q�~u���������o��/*))���&&&���'$$������GQQq����`��m5'�?����������Fpp�����3g���<<<���UTT����o����X^^���	!)))?������k��Qm���p8:::�_�.//����C6o������_���c����{W�6I�ne0�=��a���oUU���+����r������F���|8W���
��3g���^�~��d6k�,77��!���odd���B�������5������]S�L���,((�������KJJ����a6�=`��}����b///��L�:������7o��b�
6�=}��={�P;��M��`��y�W���Jz�P@��999\.WSS377WII���������`0jV������M�6UWW���8::����x���kkkMM���ow�����������P������K�\n]�|���W���...���������322�L����fjj��������d�l�@ x����;w�C�Y�`�������CBBZ�hQPP�����u�������w���W�^����_���Q���sO�8q��%55�&M�p�����[�n�=:99911�s������������L}}}OO�����uFR#G�?~����'L�p��a&�9`����������eff���kjj��K������;vtss��755�����K�'N�7����������������7n�������2���^�������=`��f
2d��9�����l��-���������_�x�����]������b��G��Y�&88�Z�7m�4n�8j�g#��jaaQ���_�~��Em��}��Uuu��A����q���@EE�Z��P(\�l�tg#��:c���w���lMM�#F\�x���,44�Z��m�&�u��s���K��;w���|���K�,Y�j����tg#�������M�����3gn����f+((0����T�s��mOO�3g�P�zXX�����1c�;���]�r���Kqq���RJJ����/^��o�[�nUTT��7/00����C������nee�l�2


)�F��<==������,,,LLL***�|��k�TTT�~�������-[������7$$�Z��O��t��n��I}6RG�P������L�0a��-�����3y���;w�5����s������������|����cE"������������['��!C�9r��������4iRlllNNNvv����O�>����i����_���n�:///�{���[�����q�F�f��`��E�1����oooss����H��_�C�u���I�&�����-[��^�Z��������"%%e��Q{��Y�~���K---���?~��m�={��lruu��;<<<�����SQQ�;w�s���]\\�����[7�Phbb���v�����dj�+**6n��l��+V�x��C���WL&�������o����s����STT�2e����w���������W�^�:u��;w�����@�Z�����Y���M�49z������������j�*�PH
RQQ�2eJhh���c�o�/^L�����o��q������S;d~~���)u�XYY��b������%00�����g�:u����nnn!!!��5�:u���u�������7W���
��R�jmm�������V�ZQ;��}�j���u���S?<����y���5kVXX��S��m��e��n��5i�$==]WW������5+�G7����-!!!5�c\\����mmm�2�/���%=|����C����FGG4(""b�����]�vOOO%%��+�g�9sFKK���������;w�\###ww�u���<�P�9K�,Y�v�������?��S]x{{�j��E��.]�6mZpp���}tt���������>}���Z��III�z�j=���-Z���q��aj%��k��]�x�����#G���S{�������uww����Q�=���{����u�/^L��]�t133�NIW�XA�����CBB��� ukQ�<x�`�|���"##���\�vm���JJJ���vvv5���OJJ�r�g�����gu�c�WW�-[�����������{�`����%}����{���.l����)SV�^�����W�%K��������'00������g��Aiii��^]]H��Jq6B����ZIBCC�
��U��/v��u���jjj.\���


�Y�'N�H�RC��ooowwwj�k���������lmm������tu��S����X,^�|���z�����I�����Pw�����u/00�m��-Z�X�p���s||���v�M=w��P�:�oS�g#�����m�����������[?~<??�E����wrr���OHH����b�Z�-,,�S��F���-Z���cgg7r����7����{����VVV�����f����kV���Dj��>��o(`kkPQQ�~��N�6m��m#G���k��y��;6����������[
��7o����Q��������zzz�o����666�E��=ruu���?###�����o�����r��]5�_= 55���6///--���'�/��i�P(\�f
u���RXXhcc��M��C���z���k���ze�Ny��o�����r}||�_�XZZ����i����7((���������[L&�z����;�L�2o���;wJ1����w��fff[�n�����VQQq��]j����V�Z����i�&�Q��W�^m����:u���U+I�fdd=z�C���KOO\�l����:;;��w�U�V�������7o�9s&u����&&&���u���eiiI�}ZZ�����������A��������S�R�l��4���KJJ��[G������haaA�)VVV���{$:|�o���m��&M�4j�(>>����Q�Fyyy���<�����ZZ���7�_�|8����-�S�����_�|y``��#�RII�f.������?�+����MM=�1y�����7j�h����@KK�������W�7����-�S����D"QUU���6��755�v�����M~o��M����M��?��I���o?�|cccj�?����������=���z���5k���;w��u&Lh����i���YC���9+V��;w���+�o�^�)�����m[����=z������%66���BAA�������jo-�d��5k����P�wc��9p����6v��9��j/�~~~&Lh��Q}�>~�8&&����k��������=z�?�`aaRsS�>�������w��9i���L���:thVVun���jbbr���q����{WAA�zb[�����I�/^���D��;w����������b�
///��������$��,Y"�9��%K-Z���N�{��?������Z�f�������������/\�p����������^�	!���Pw���)��L�4i����v����_�.]����G�Y[[����x<uu��'O�^�����>���rss344��u��Y�6l������}����6m����]�vM�:5??����VVV�Wu������������������������-[�,X�`���<8r���+��;C=�R{U'�Hw6���7n���Oj�[�j���===o��=q����������<��$%%=y����N-����xyy��=[(^�v���&,,��`�7���S?��caa��-[|}}����\���������:!D���:���w�����C�g�����onnn�v�"##���/]����������z����]�R����c����3*I���=����-[�trr:p����V�>}=zdii�������/^��KVVVII������s��1��3gjiiI:7"""//�����������w/>>~��1111NNNW�\���MLL���c��e05�����g�&����6y����n�������nHHH����ttt���


/_����3b����H��	�M}���>}��n-�{�S��������7l��#G����G_�~��d&����ahh���TQQO=����l��������b.����qzz��Q����o��9t���O�Rs���"##���'N�H=�X�w=��z*N][XX��w�������;w.33s������� ����O��!W�^��UYY9>>���H[[;22�e�����5w���n�+���[�������<|�p�V�
�B��w����U�^��>}������K����>�_��/_�d��������w�����Ovvv�\B��6Y���������������<����zzz�;w.--UTT���433�������-�?��U[[��?����~�����"�C>|��f�k�0�����p�999JJJ���b���:u���n�Yyyy���-Z�prr������MNNvqqi����SskoN~~~PPP��}����8��8````gg���������Q�F������������v���x���y���/��"�]\\�[�(<����p8NNN'O�d0�����w�6�C��=���C�3  @ 4m���%�Qee�O?�dgg������x�����Dwww��������������������w��)ccc������R����OOOg��:::������#G�L�0�_^��Y/^���������.++���'N�6�����RUUUDDD���=u�fee���P+��W����o��}���������S�N�������������W�������Pwk�n������z��=�'O���=|����M�.]�����y�6m����:99�^�y<��g#�N�244������>|���-,,7nRRR�������I�&���t������5�z�jo�Dg#����r�uO DFF80..���kPUU
����*++������j�����Fv����s�7n888t��y������///?p�@�F�\\\���y<����O?�4z���������g#��_g�X			�)��{������;�����c�&%%��y������-[�|��A�U���K���:�&���+			��HQ


<==���+���__�z�@���N�:���#��B�p������,K��������-����6m��U���?��k��=���F122rq�����u?|���._�L���?�����%%���{�n�PHi��}MH������7�<yB��\����V��';;;44���A�J�Wc�#��s�����BB�����D���}{YY�����i��������#6n��f�e9z��<������_Bpp���/��[�rnNN��U��������,G�^���d�l�r��a���|��x{{3����������8y�dHHHpp�,�R���t�RQQQ=��DR|>��%L&s����o���=[__���|���\w=EFF^�t���{k��y���_���GMMMUTTn��5q�DY�.--���,�R�.[�l���R�9�***���������e|">w�\��}�ZXX�r.���W��x.��W���N�:)++�8q�So���H����t��i>�?r�H�MLL��c���c�x���������������e9�r��EEE�^&�y�� ::�z#H����K�*�����?_R�^����i����!�'O���o�����������c��%~~~L&S�s@. ��C�	���C�t���^�~��iS777Y�����2k�,


Y�>r�Hvv��	j�I6�w4h����,�R����� @�h@
�A4 ��� �-�������DW�A����QN���X,]=�.������>f1���l�H�S�W�!���W�����P@������&�]s�X�[Y��e3�.�G���D������� �J��-"�������L�:}9N,$�y����]Q(U������:]NDB���5W�� �o�]��fu��,�s�fS�]XX�d2����{�nB���cjjjvv���f�^�v��-�|||>�p������������
�������l��a��9::~���^��R|j�i��i8�_.���$>>���8;;<x���|������^x��c??:\�3�:����{].�_�~'N�(..����k�������I��r��������������"��i�?������n].������K��O�},�=\y,���a�VMYu���'n�����t�����^q�R����?�8l�U��\x�=n�������e�/\�"���-�h<����.>���]LSg��S*�Z!M!@��P7��t��!K�5�0	S��L�2kjd3c�$~\�b��L�@�DBN"
R��4Q�1�DL�ua��4}y[6���Wo�?'���<P���r������#GV�Z��yz�����������w��ggg[�V��YSS�g�[�%���&q;%v��������q�
|�~���x�cs���}��������]]]AAA��������d2���J$���B��!
�l����b����I����:�������o_h�W�~T���{�t. ��g����EEE����1''G&�y���r������5_����K����������(��_�f�������������s�\F��x[��h��|����E�������HLL���&�������1�RS����W�^�t:���sss���}����Q�'A)?o�P(�X,��w?��y���1���W�T������Z�V.'����I��~����lM��b�8??`` &&&%%�p�C-_\U0kiW^^^^WWg0�O(�
0>@kk�Z�^�r���g�JeQQ��l
�G�5��"����C###SSSYYY�q����������n����b��o����GGG���K������;w����F�DB�u��J�2::�e���*���=��jCCC{zz���w�>�q�@���:66��?������

o���x[��OKK�z���������R�w�>���[�vvv�|�R�����.�:�����S�N��BB�o0��9��D"QHH��eeemmmSSSUUU'O��h4����C�����z�������?�w:�%%%g����oooii�b��>����������A�B�V�	u���333>|������U*�w�>���;w��r�e�j��������3��������7&''{���7��'N�0�L��k��
�"33����2�L��x���/))
��������uA�xyy��p��u�
|����0��f������v�}xxx��=/^���l:�N.�777GDD,��E������m6�0���k�����1�r�ZZZ�m�633�����������z<>>�j�������	����z��{���@ZZ����?W.���Z-1N�xrrr{{{pp�N�#�}���e���b����
EGG�X,��{7�?�>����������$���}
b|hh�n���Z�V�
|��Ff�ny�Jm�F�bY�3]?M_:�=	�0iy�Y"�����k�'�u��v,K��������I]�.{Y������y��3��w���6���1��#1�k�S����������sOvmZ��.�/��e�������v�U��%���y��c��BI��_���=OoqO�	�$i�����o0�W�|x����<q�����������������m�
endstream
endobj
88 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
92 0 obj
<</R23
23 0 R>>
endobj
93 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
98 0 obj
<</R23
23 0 R>>
endobj
99 0 obj
<</R97
97 0 R>>
endobj
97 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 28265>>stream
x���y\L������fj��I�=�����B��""��%K)T������}';��Vh��f;�G77������s�������,���E�4P�m������`
j�)�m������`
j�)�m������`
j�)�m��@F-����:U�+�'i�W�����J������?��f*�6�\L�5/��s�4!�*�UN�T�Or�����i�,��<�$W$��}eny-<;��J�?d���r�}�I3��{��r)��9����?~��:a)���iBgd��y��UcQ�rY8�JH^�_�Q�/�%���
�h4;GK�����B:�G�8]���{��;;�3f����hjj�]�v��u��?>|����<�D"����<��ss��-(��OI�������>z����IYYy���������E+�m���[��y�������g\���@~���:�����Gzz���oTT������o��W��S�jjj�[�n��q���!!!����\��g����7M���=�[N,~��Q�rss���3q�D���N����^�N77�)S�dddDGG���V}�
yv��C)!�#Is�o{H"=ztvv����G��b����/\�p���>}�������������'%����iBa�C�Tz���.���z������X,��������m����������={�l.�R���D�K�A ��W)~x���c��U�V�������h�nnn�������j�����{���9����D��i��+~y���&M�����X����p���#F�QQQc���i:==}���%,Z�/y����������G�g�NJJ3fLBB���g��o���R�9t�P---__�?U���,����3�[8��^�p��!��5k����&$$L�8QCCc��E6��������*)��G�[��L����u�������p���3}�����Y�f�D�!C���w/%%e��I���[�n-q����
�h9?GC�{m���������5a�BHHH����[�l���+33�I�&I$�-[�L�>=00�bii9}�����w���z��)))������������/_��p8M�6MJJRW/<K���'MM���g�s����d��9CI���b�M�6m����~����[�|||P�~��?��{������o���%n2�yKr�
aq�?��������H$��o����������w��mgg��_���'?~�Q�Fl6���W
6����?;�p��'���;����o�����[�V�S�N}��i��1�[�n������===�,Y���w��������JX��v�����4w�4�/z��������r����;6x�`'''��76l�����������K�-C��M)�Km��J����;t�p�����O�m�V (++�������?~�����]������}��uK\��}��|	p�~��v���^��;wn�z��y��M�D���ggg���}��yGGG�����9����������a�
�>}������>o�����7l��U�V���g�������K\ram?r"�s������"����+00p��	+W����K��m.\(
�bqTT�����7o�v���e�KYq7��g�Y8�Y����qqqvvvl6����g�����E�=Z�j��prrr>|���P����$wv�����i��?�={�U�V����?~����A����"##,X6f��s��U�R�A�%��5��~�k�O�>G�uuu/^��[�nrr�����������5k���k�����x�����/���=�|��9����7o�����+W�X1y������]�<������5����K�mww��k����/++������M�6���cBBBLL���#�\�����/_vpppww�J�!!!l�����scbbLMM�U��p�B�.��O�>]�vmss�]�v�x<''�-Z4j�h��m���[�t����w�����_rY��mOOO�w��=}��g��������Eg�mmm���:u�t����\Zm7h� ..����
6��xR�t���M�4>|����=<<���7p�����.��[�hq��???ss�A����������nnn�/_~��q����]�V��e�v�v��_��hQ�3�-
		�r�}��522��{��A�p����,��

���������y<����=LMMCBB����.]jnn��w�z���s��k������m��Y�d������{�v��|�������;;;gggsssKKK''��\fm?~�{�������=KNN�������{����[hh�����M�v��uee������g�����
���1c����M�8�S�N:t���,�b��y��=		9|�����.\H����r�J�5�={6�������������������Wo��5��u+:Kz�����o/_����*����c�//�����;w6j������������������>}z��Y���I�&�]����3�f�z��I�6mV�ZU�^=??�U�V-_�������O��M�txx������y��=o��1w�\MMM//�!C��;���4))�A�%,]fm?z����:th��������m�����m�6c��Q�F���L�2����%^G.��f����w�}��t�����000p��������m��c�Z�l��%�Z�vvv111-[���uk^^��'-Z4c�����9s�\�~�(K\�������p�B��]���G����|�����(ccc���������m��[�hq��]OO�������;;;��3GWWw��-s��-:������e��m�@0a����g��q���w�^�l�D"��������}�������S==��\fm�|�2666++k��S�N���		�<y���GPP��K�ttt����/_��\>�CP����-�>|�`jj������{uuu��������������5j]9�����U��{��z�����\.733SYY����WWW������qBB���A^^^�� Gm���'$$�������o�+++'$$�X�*U��z���fkii�������j;33355���\,���h���������>|�P����%���.l��_�Ba�j�rrr���k��!�J������UTT>|��b��V�Z���Y�E'Dy<���aVVVZZ���YAA��O��vT�1*q��k����������W�HT�j�������5jH$�>��%�x����i������U�r�>|���zzz���E�����)��)q��k�h�V�^��f������}��Y,W�VM*�R(y�����&���s��J�W
�����OHH�H$��%�$���&�������������~���������o����})�x<���\fm]��b�������������?~�(�(��Q��������tuuK^8�2P���~������+���Ue}@��H�l[�W�OJ��j��Y>~_0��F������e��R^9���F4�������e��z�1�������"���e�=����GE)Bb:�����TV-H���/��Y���N�f��N��o���Y[m���Y4-���N���f���G���3
��^b�������7vn�[���?��=rha�O����Z��[�}�����`
j�)�m������`
j�)�m������-�H,--<x@�Y������M�TUUi����<v���Q�^�~�����:�~��?����~�����[���,^�����fff,�������'N\�~=���{��+I\]]�l�B�D���;v�8�|vv���add�����g���K$�V��P�b��z�O��������E"������srr���<x�i���'O����]���������G*�ff��u*{=����x���'j{��	�^�RRR���9z��D"�q�����������o��5s����O����������.��`���������k:�1��TBKs�����XZ��R1��8�
��Y�D�����F�T9�=�B�=I��w�g/^���{w\��c��D���r)!���62������b(!�>���h�W^���D*�u�/}6��$���h�4x�����d��R'�9Up����4~�l�7��HZ�dW�D��|�F�nM86qL�`��TW���.�����K�P����
�
��Z/������>�c5���}%�N�]��%S�M���6;�H����������5��;H����D,�YC�d�Jr���;Z����^I�O����.��O���W�	u�����Y�FU����X�;��;��R�M���.N��Vaj@a������42��t�����6m��mo���������s/����(��/[�yB������*&�w��vin}V��
��Q��m*���?�^����P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��~��������	!_�~���������4M����������X��d@m3�
����^�f��7����n�:GG�!C��9s&<<����U�V����/_������ex�@.�m���@~?wn���u��-��7�{�n\\\FFF����rrr��];t��]�v1�� ��r<z*���*o��q�s<N�p�Y���t"S������j�3sN�����_<@9w��=�Q7�m�����j�F#U��S>��v�I����}X��b�mY��)�U�*lf����5��5��&zE�"�������R�]�t9u���s�j��Q�N�������k�����{����o/zq||<c�
���h��FC[�8�[�=u4k�I�x�T�����O���0\�,�9|�r�w����F�c��h�E�F)_5���J��/�6|�b��v�j<���[(%���HQd�-�	�P��1U1��/�&F,��b�.k��E���z�y]�+f�2�a��qN�����HQ$�������#����J�������y�O���U�V�^������1k����?���t��EGG'**j��Y���#G�����w+�W�0W����rO��)��NZ@��������
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��P�Am��P�
���6��
��*��%y�H	��bCm3�
P^P�
��k[r�1������hP�Am��6���������~������P��/�6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(���m�&��./�m��j@aUbmmQ[��E��+j�\��~_�m�Ui�=� �����������r���}��j�!�m�j@a�������j�!�m�j@a�������j�!�m�j@a�������j�!�m�j@a�������j�!�m�j@a�������j�!�m�_Yj{��-7o��x����/���chh�y�fOO���t�.]�0���sP�Am���~n���}��5��/�J����1b����]]]�l�R�+	e�Gm+����w���8�9�a�����$�������j�3������y~E��(y:��t������JN��S�A'�W_���&�u����b��"�}y��)|c-��)�������q�m�rP����a���+��M����7cccCCC]\\"""�^�����^[��E�F�rj����Z���������8QcqP��a��p�	5�����W���H��@-�Y�u�V?�S������\��-�{[�=��C�����;���)��}���?��c��Sg�VS��,~���1s�Y f�8f��������)?bQd��f���+f"�CO>T�g�����*f"E��7��&��
x�NJ�������yek���$!!!33���wiii�������x���?�a�P�d���H���s��%q��M����[M�h�jUr���m�ER�8�].
�m[s�[��6�����������GIJJb�Xzzz���U�T����~2��m��'�$�u���$PX�m���@~�m���fj���PX�m���@~�m���fj���PX�m���@~�m���fj���PX�Nm�Iz���Q�M�	!��?��n��^.���fj������S�{��8�U_\�h�<{n����6m�x����m����j@a�:��"��i���K�-���c�o������2��o#j�!�m����PX�WmK�Of8R:���M�����fj�����oV���(�^�H��o�w�fSly����6@yAm(�_��)W�O��yYK�k��*��K�vQ�N�vZ����g�6CP���
��~��>l�r�Ib�E
3y��m��O�_k{!+/���B����fj������_�&�,��j�f+���G�w�=���6@yAm(����;L ������fj��������	).�S�~�����
�Cm(��PmOSUqwb��k�+�����@~�m�������,=�vyAm��6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(,�6CP� ?�6��Bm3�
�Cm(��`mk.t�����$%�������������������D����#��
�Cm(,�6CP� ?�6��Bm3�����YgjmbnP1�6���
��P�)����/�'�0����A�P��/�6��Bm3�����x�6i��e�6���PX�m���@~�m���fj�W���r�����X,�u������B��M�n��1{��'N���+����fj�W��^�l��9s!���������			�����www_�v-3�
?���6��,��}�vKK�������������-[�L$�^����%""��e����0�E�M���I6��� ��=�g�n��j�[��w���3|Y���&�����h���8��������)���}7SZ3���c��p�����}D*�1��H�Yr�-���n�]y����T�sQ�<��Cv��Z�Q����=������*U����KI���e*Oz�5�]��zny;o�N-G�k���L���!�c9���m]U�~���������.IEz�:F��'�3&�^�O��UD�Kz�!miGK��A�(�,<H.� �.8(��VVV���e��G�5n�����������III>��i��#G&L��a��2�3���s��y�c�m�_��|;��u%�n�]�����me�ro����8�37v��{Q���UL���?��i/+����S�[���-�{�g���w�S�����6�b���s�=�8�}2�f;����@��vn����2/b���4'��~��mv�$�b��Vq��GvT��[����6@y)Km��5j<8  �s��R�����>|x||���3�
?W�0W���pO���&�&i���Yy��"�m�j@a�wk�&�����uz9�\����r�x<>�].;�
�Cm(��nm��Y�\_�|�Yw���mN_nn:�������&���S���f�~�~ZO5�1����&����cQ� ?�6��Bm�v��WF�4�!09�H�r�J���*=�rL_���
����dt��
����m�j@a���V�5�MN>�pQFm�
�
P^P�
��s���A��q��Yi���-8��0�&���j@a����;[�F��d��R7��j�����}z'�V�D�6@yAm(,������h��6�����M�{�N\����m����PX�m�v����j@a��Q�e��(/�m���Fm�j����j�bj�N:Hi�'\��98?�
��Bm(,�v�����������3����~_�m������>�b�\B������s*���W.V���&(*�6��Bm�������������S�vPH�m���Fm�������^��($�6��Bm����
P^P�
���.3�6@yAm(,�6j��P���
��P�Y�i�S��n��0��Qdh��!{�'L�gvQ8�=����$�+�C'�����{EKi�gsW9%j����PX�����5���q��eO����&���th5so�X[�����7/��I���oX��(����|����
��Bm(,�v��v�������fS�S�_����`���Z���)Q�o����'���t�9Vu�/y)�|=���
�D�m���Fm��/[�}/�r1�:����'�������f��/-��j���N�� ����j�7����D������b��n�W3in���tB��}r�e�Y�c�U
*o97^I.={��U���w��j�W�m�.d�P�M4�dODmCm�����_}�������Bm(,���^������d�6�UU�D�v�WI�Nr�6_�d%U*o]!��/>��+j@a��Q�eV����d�R��*������j=��@��%7��0W�>���S�jF}
���*�����b	�+�>���>���$���K��/�y�R��DmW:�6��Bm�����k[(&��rO{�
���TTt��o6�t7���[��\6�����g_�����|�xIv��i��x�����O-%��������7�[%��wM��/������T*lb��v���c��9ql�`TGn�:��*'�6��Bm�����k�����y���T'���|�?��z�����z�4�^�T���,��B�sO�K�s2fMU����_�������	gV�������
���gW\�V|mg	�CP���Z�Ay�k�+����P�
���.3�6CP�Am@%@m+jm�b(N�8J�A��Zf�m������J��V��>/>�e5\��:��U-3�6CP�Am@%@m������6CP�P	P�������X�K���TS}�^bQ�-M�B�UX|�pb���o[���_���Y���6CP�P	P�
\�����g\N����q��_���/�"��������WY}�*JYYuH�Ov�����4Q5�i9���+��fj�!�m��m�mN�
�?��p�A2�"K��|��#o�l�s8Zoe����Ub'y1���j�DE�=� 6��S��M��j�:��������d:��F��}�^|���kcN��J?ogU�����fj*j[�k{�������������+�_x"����:�U{e�����$2t
�v/z�wt��.<�s{�	��FK
f����X�
��L�6CP�P	P��m��l�j[z8syqm��Z�N�����uQmo�f4��>�����:����HSS5fM�<pa�[���P"��b�����f����������G�y����ZXXB:w�|�����`///fV~j����n6�<=N�+u����Wj�x��{�X�]�\���O_�w�*��_�Q�����P�S�s�����v������O�n��1u�����a�������R��\W~�E�^�;�;���U��]zW��xT�����~����L�#S�y�WQX�m�����
V���5T���/���Z� ����H�	uf�t�Q"�;E�������������V�`���cG������eM���]��i���n��4e�M�T����Y�'@z������j��T��BN�#�~<]�N�f
����`��*�������=@z��&���{}����R�;���j5�MN=�,sQ�[0S�{N��$�����YW}\8Y�����J'���,���	���V7�1ydQm������4����JS(b�rSu���I�����?�����`��W��C{V��+�cQ��j�$j"=d+�i"�P�X�7�J�W�����'�JN+��_������E�-]��"=G"on���������U����~9�����5UC��������/J�y(�"���m���l�����-�p/�����W_���{���O�C����}%7�v����JF$|	�����_U��*��$KX^�=�u���E�=�����e$+Q�]��dQ]���s�Je�X�{�}S�2��Wj��I�5�2�E��^��������u��.o���*MKU�t�e�Y���P"�m9VZT�l5E��;��D�g���
��`�������3T*lbv>q\-87��������j�g�V���|����e����GDDMJMM��}�����n��cm'%%�a�P.8l�3Xmt��������Wu��T
Z�%�V��C�s���f��=�U���`G#��5&v�V[M��Ju�JZ�m�&(>�HE����q38�q;��eL�(��2���6�D�����u��X��E�ER��vu����Z�ate��E����E��1P�8h=U:U0�[/�9_�������W�7�jt`Tv�|���73_D
��"��h4���L����v���.��n��^{[��e��7n9�����U����_�'����x��w-�`��T�������I��J-���TkJ}��%��IO�����������3v[�,���[Vc����Muk�_T`�W'&���.��^��L����_a�(7��1s�z������D�7�:[
{?
�N&�#ub�
L��|���|#�w��\n��]���<�4��|�5�am&�6.��o���4�������t�kY��_�4`W�������7�������_�&�������%�MY���� �G9�h�J;������}����=�UB�	��q��U�Kgk�h�n�c}�����r���>��o��J���i&�r3�7�*��j���N��{��S9��<Y=��c��!��/�}7Z"c��8mT��-v���J���}e���!n��������7���4�����������i87��vf;��!3D'qj���M�q"�xlM'���9���n��'���y�'��	�
������d��<���Q�r
��;���r�u���!����U�L�TPA#	QRR������e���+W��UKYY���z��5�d���������D"4hP����$��$�����D��I�J���^IE��'�,V�W�_Z�/yy�h�Q�M��od��7���f�V�F�cgJ��n�u�"6�,.uV���J��d�T2���\I�����G4M�y.*j�]~��H$��^r<6no����l����I7=�=>m�v�����1U_���j���
b�;�.09��J���Rg��k�v�g������p+"
Q���$PX�m�v�����W�u�_X��v���f�����)�m��9���V��Y���?\Em�Mqm'������������P���
��P��m��l�Y��j.
��������+j@a��Q��m�~����I^] ]����g�l=��[A�)'W��h)�.-���t�m���Fm��e����������t��q[BB�H�����:��������l�����d�zk�I�����>��`���syl�<;�]�P�
���Fm�V���D'��t�[������f�SI���s��\�������J��m-�Z�H�w��AOJ���������;�RUR�g�~����I���C�)5�vj�z������E��P���PX�m�v��v/��;i?�4u�=�?^�-#��Y�2z:S�=F���Zk�fn��������G��t�����������
��P���_��[
'-��������n�����2��6�	�",�@"Qa���b��TZX@qyDZ@X�La��7gG�w�g\��-�!l59?�lP�P	P��m��l���^����O�$5����kam�s�����o_�s��������8=����m���aO���6�U��S�V^�����i������Jnt������K(�6T�6j�-��]�Q�(u�_k{��E�'$Y\/�m�?j[�����#��;��M�Qq}cV/:6o���j��~,]�[J�^�
��]��"�����%����w������k$�����-J'Jr�Z�
��PX�m�6j[����;�W��B�����dO�����L�v=�7�M���������5}E'(v�y�Z��M�j@a��Q��m�~���Y��}�e]��2�����.1�L��('�K/<i��|��64*��D�B��>d6U�"������?��r������e6�]w��K`j@a��Q��m�P�W����no�Q��uR<�4�^_�U�bf��U����`���j�'�������)�Li:��NM9�~�m���+������Q��m�P�W�v�����`l9_�#c���M��~���q���VV��3Uu�8o?��C���������^=9��R�K��I�b
T|��������x��s����.�������8�"������(?�6��Bm��Q������sS��^��(����[x��V�w4G�}�kw72�MF�_H��M�<������ay5/l�$~R�9]�p0����C�l�e���9Q~�m���Fm��eCm�\m�v%]f��]dO�Ej�e6%\l������N��2��b�nljFQm��3��U��I}�����m(�6j�-j�'k��t�Oju�=����-b��60���Fm���Fm����P��m��l�m�6j����Fm���Fm����P��m��l��_��X��O�>�J�����J�g��J��|N���_����n�EK�/��5��Uyk	P:�6j�-j����FN�����d��Rw,j�%8�_�����:'��<���a���������Wo�ny:�e������)�����XJ��v�u��l�YT�[ j����
����Gm{��IS�R'���[�� /���V�f��/�z��E�2r�Y�m�K����OF]�m�"�<��J
5y��e���@����F�'Z.G�vZP� j����
����{m[�$[�Q���S�P9U~m���Fi@������cn���p�=����X�5d:�L9��T��d�h���9}�6��E"���%OG��Do�;���9@�9<��K���=1�����VW:��R3`cam���t�����S������3��u%fmeO��S��EG���u����G�;1S��*l&q	������f)����":�@5��v,�����Fm���Fm���G�m�����[�P9U~m�����Ni\bXn�:�+q�K��&�>gF�8v��z2M�	����43;������E�������������M��W���U6$1������${b�����$������mmou&�F���dO��"�"��lUA�.�����{}z@�n��YR����t�srOc�#�v��Z��E�*5nJ��P:vb��i��&zu���R��7.������������)��&�,8rO��Va~/�m�6j[6�6j�?W�W����������rm�$
{�����'
�zu�;x��8��h^��9���+N����n��Q�]�s�K��6
8v�g��i��<r�����z'�����r�U9j�&�������$iC��}�|P���6j�-j���V��~��P�+F+�^Z����{�K�Rj����Y�O��-y��bLi��_vSUF��t������p�}
	�/}M�6���R|!N<�A9c�������D����L���o5S��,��Dm�jP��m��l�m�6j�����@�C�IS�����j�������Y��9]�N�
�g����FB�����^�\3W��S������s��N�����
���i��y�k����P�v�:"��k�N]������!'��Ac��4�&%�r�t4����4gK�~&	��*�v���{���@EG�+o=�6-QJE��������)v�;��_{~?O/�6�h���������
�6j�-j���Fm����z�u]I������?�v3���v��m+NW�������G=8Y\�3������KmK�����f��V���������8��|G���w��:l#)��E���R���m�����W��g-,Pf%}5��h��[���U�����Xqm'�o�6�~=R����c�]=��_}� ���ih��k�'m��k��5��Z�@+h�$nK��=R/�����KYy�;@EAm��Q����Q��m��/P���D�KL,J=��j�����k��^�(�k�Y�S�D����dE�t|��'����g�S����Gm[K_�a7�����'5q)uV���N���;~E���}���e�C#�����2��$�<<���S�m��)S�Z�e-�Z�V�d(��������7�:�e�q}�o���Fm���Fm��Q��]�-��A8<2pC����k�������K��K��\�JKW_�����V�UG�/��Z���+I��>~i������z#�m���9�kgMqt������kV��*)�$�
i�[�D��l�$���+����m��}-�tT+�|������$��K��-_�b
w,���g	������B%j����
���Fm����>4�p�:�+�P�[m��K����R'V|m'�K��vNS���O�E:�|%7-9�E�v��4h���#	�G��������^[����"?'n���V�������NN��3���j�����d�	����k�.�W�g����on�F��������77Jp��������Oi�a�'f�U�������!�QO��Z���Am��Q����Q��m������)�D�p����V�����}���m+{���}`��ij��
����%���8�E
v��4h��J{jO�d��|�V_�����o��mo�kY@'ne7?3���l���{���R739��_zm���r�BU������N���������������Jrnn��vb�t��
���DK	�y�uP��m�vIP��m�6SP���_����#����X����mg�T�L���mqt���1���_^"�F����1W5���y��d�@������&�����W����,j=��/�7�6P��m�����m�6j�)�m�6j[���N������W$}9�cW�:z����G�
���y�r��\�Y�V�;��%����Q�S��UD��V3d����m&�YN���)���Gms<�grB���]�<ue�^q'�����3���)���"W�Mvn`�%��={+��/�(���Y�%��������LG7P��"��J���Z���d;�G�k�-�f���lt��D_����/���1N�)���O�5�zurH�R�`1�6j�-j���.u��P��m��oU�}��W�j{R��#~�����#�#���ck�)gZ������>�������|��������o����6;���E���i���7�J��j��<��nzx���M��M�gON.!}��:�[m�RJ�i�����Bo�,���x~�n��� 7;P�x�vD�-gHMCV_���#4}��/~����w>?���LB�D��9�vN�,m�}G$�{�-/n}v��A���wH�*������N�<����I9�s����l�;��hv�3W��j���n�R�}�����m�T��gW�ZX]����8�no3A.9@m�y���&�jp5o��n���)�?j<�#�LJ�k���n�](�oX��Fm��eCm��Q��m��,��_���U�)2bQ3�=������������:�v�����������j{MGb�N,J9������,�������8�S;��P��Z��d� RX�V�=Tk��-Z1H9���[m'�c7;�qe��y�����DL8��l�����S�����V]�g��A�����^�8�|���$
z]���I���IY�����pQ2�D��J5\����,E��`�K=��m�6j[6�6j�w���������G�6l���>���Fm���������Xn�����k��X��Y���W�]�Lo�[{�Z-yk��]��������$�_j�z
��C$�Hm�z�6�YX���k��C������R0T��"��*�<�/6����,L�Q��J�6j�-j��;���A����;f�������@K��a�6�lZ�w����eqxCb������
#njdx��N����=�����:ss�L�T���M��,>�����Q1$���h�Rn�#�'>0�������p�4�,�3m�xy�n��5W����.�<d�jg�Sg
7y���4�~���l�Dk��G����N��+��C�D���"g&�R���;v���KQ��UV�]Xpk�~�4��5u����o_��nzDA��J�R���8���~lOU�/S���G�%lCW���'~J'.����*)VM8�������l������������p���9wcv�Z9z������fyy�O�F����W����W��[���QX�N�'Z�����^D���=GC�j�Y�]�h��s��������K��O����z;r9�7���6��YT���77UW�2��=	�O&�,u��!��;�����N�t������9��[���?dG� �������1�]�y�s�y�BuejN��,_���t�J����*g?\��M6�"VI�R�~��Mz��E���Lv ����Lz������G�F��;��}������v0����<+�s�f�v�Y��Z�	aT��n�|���`|��d'���i���*��*��v�Nv|���,;q���S��+ �F	�mc�y5���s/^8���=C}�����d���0�����<O,.�m��o���G����at�������3�U���cOS_?q9F�(2t����������<.������G������/���������FN,$�QDEG�D�}�k/%��R�7�Z��XFg�Ko��!���s���>)���/���n��$]��m/��{p����y�_?Ng������$!���I���Y�y� f�Jv��,������,��C��u��Q�C68�
�.�ou4���V;��p� n��Ij�X:���H�����e���[��D��A�|�;�H��@���|qk�Y

�a��=ha�]=�6r�|G����c�Z4����5�^�[�5��?p��<��2�V���k���P��Q��l1?���r�[+�"�~�=�f5���}����s��7M�y�C`�,3�T�����>N�<�y,
M���t��i��;�UF��������34����K��	��Nmt���2����Yvt�H��vj-�G��>wq��;':�Yw\j��f��^�w}�3��������7tZD�H�I�����B	��'�����6�����h���
�.���	
tjn6�I��������kU��_���*n����^���/��3��{�^\4��K���$�f���,|G�T�����e�8�����(V-�5�f��y�r��}���Ux<��A��{��pM���z��j]��H����(jVu[1���z�������~�A�o�����3,V��>�~s�����_�~G<K��w�j�p���W}k �����F?�F5=>�����F3�:�����o��):�Dt���j�Bw��s_"�~��V���9n����������y/����a-C��j��7��XG�����MN����jYgG��"�K��eo��Kh���|�1�(������5Y��:����a��B6� �}�<�ojl.4�y,Z�_)��M���:o���w�qQ\]�������nA����X
`A@)�"!���b�*MAQ��"iFQP>R��FM��
,e�������>�@"<y~����s��3s��g���9�|W:(^	�: ��1/S�$}���W�rL�_/y5������cg�������ul��Aq�n���A�|���\J���+��N���g^�erY^�M��=���q�������^�=��K��b�"y]��I��,��~�y�eg��2���V�m�`�1bH��>LT!�����3�"i�<h�����Y�:x��=W���B�<���������.BDCR
m��,����q��R����m����K����[aY��v��+QZ�k��]"�
���:*��^
;�{�3�xw��6S7���0}�1��a�����k:CY;;��a��z"��q�|j����:x��%������Bd�X=�.~�rL��A0r��9��v?��
�9���7P������������J�v�<=�O�`��;�846l�^(�L����S�8�:�������oll���Evv6r����c�� :V��b��sXdqe��>�� �fc�DF	�����b�<4Z�������a�A��$�"��x��D,i,�����8hI.,"�y��������a�4 7D`hX%~�Hhy�8Qdp���x�W��<�4���0p�����0��0w���Q� H�C�g��E�d�����Id�p�h�i|
�RPRL�$���"������D&� � 3c�Xj���g��1��$�a-����"`����
t�F��i��	�1$�����������������w�� ),<�U�0�����O37���30J���J`��E�T$z�b�����p���w�������XL�O����0��<Y"�����0�s�EX�x�-�E��p:���.��,��AsH�2416��U�� TZ�%.!�<bJ���w��P'R�f��(*����4�3����(��=�*���8yB'Un��B�`):O��!��4����!'�&)�`AxQ�����e�� G	�B���X�:�����)�d0�h������rC�l,-��HB���<SBK�aQ#��P��
Cd*��@�y4��]��� ��(��tIV���������K�aa�O�M���!������aQy�[��!������:("�4f���,�Cf�8ci%-�h%Y�A��	�!,�C�����hX>���/S
�h%�%��'��;�$�dqP�_���=�c��_:�}�H��p���"����,�A�P��4D���4��<JI�p�
���p�"�����KbaW����xd&n��E���R"�3��a:--J����,,%�b!Z'$G�����1h	�[��@+�|O`c1�wTD�@TAw��a�8������ReX�b6/�"��O�Ga8h>��V�zz �A�����}0A�'��QQ<yI��>,Gs!K��d��.Qh`�����t
/-A�����C�����A�w9A?8���C<6��':�Y�0����EV��\��XZ	�(ZQ��-��A0M%G SeE9h,��d���<SR���xG���`h�aE9b"��~X�����u��2e�gwK3$Q���n������\
����B�~���S�(qe�W�'x`�U��r���f+��d��m-�{���
I�����n��({a,��u�!:Q����!A�K����7���I�I}��+��f���c[b�e�(���H��KQ0�4,Me�
e?�����GI���U .��H��kT�f1��|,$���k%u\Z�eb�E8��*��D}�4q&���frd�k%]B�db�h+�&)�Q�<>Z�+.���F.h���X?��:10J(>C��#�f���+I����(��aU��>�8 *����A�O�HCl4��|%)V�����0��(�X���\��"�9$E�+B ��Xf��	a�� ��Pr,�����0�K�/'����=��N�G���P�1n,-���b�<�����Ch�G�I�E�����P$�<S\O�G�*@\�E���w<���`����x�V�eR�X:w��#�T@L:�����k%���Y)�`�{��f1�%"��PF�����8��u��@�h�+��(���9!(>���H��]��i�CLQ
W����
�L��{!�A1)���J\F���� 
$x��(�J4��$�"3q��F.eeeK�1fl�j���m�����y���+<<��T�����m�t����~�	�{����|N���?��V��8k��Ij����~i��F����o�y�|�I�����M����	��o^�?�>����S ���D�	m�7���L��~,�+6��m#�qe������eee�v�;����eVV�������o�4�N�hoo�>}�=������������aeeAPQQQfffQQ��CV����#�=z4$$d��g��-^����|��i������_*��b�p�]F������1�HXX���7A����}��������$>���m����w���?*���A����===����D�Iljj���oF9������D����W����>�f������};���{���������F����s���##zjj��={�?	o����_��7Zg��Cbb����H=)���!!!A��N�%�������`njd
K�.}����_�\���wDD�'[���G���-,,��2�#T�����������yPCC����������


����f�Z����~�9���3g���D&����3�`TT����'���x#
��4	S#��[WSS��F~�P���"���������7���������]�>��8�~�����L�������?�������rrr��Y�7o���3�!���0t|>?--m�f��bMMM/_��|���������9�
y�B�5k������D2|`=F��������9JOO'�_j�����Z		��"�UKK���c�e����nc�:�����[�>����[��{yUUUKJJ������?W�N/,,O�NNN���yy����k���1c��c����{zzF���������J>>���ggg���'��r8������R�w�����J�9����g���Q����q�V�Xq��}��g��m����7OMM�s�TUU���#����LNN����d%������ ����^BB����c��B7l���_�
���RP�tFZ�$//��`�d2����;��8��x����d29???!!�������jjj������������111t:]MM������_�v������r���c���177_�|�������������������6o��l������H�c��C������;������S����]��v����<�p��g��������(++�H$ggg����s��Ad�IMM��c���8��C�P��A��W�������������/\�p���/^L�>������]]]^^^��������W#eKJJ6n�H |���MLL:;;kjj��������`���999/^�7o^uuuQQ������JOO��������X�O�d�+++GGGggg����i��K�.577o���L&�����}[����;,,)���3����k���o��������?���@KKk�����������LLL|�����POO/11QCC���g
%�=������������eee��?_�f��������bcct���]�v-\�����������~~~a����������/_F�dffb��+W����������������WTTdddDEE=�\x�CBB�=:Rb@@���k�{���mmmy<���{_�|]XX�����������r��V�***���?����?���7<<�������q�FWW���j��>}�x�������������X,v������>>>0�>}���a��-���8!!���3��&���WEE���iOoooMM�����K�`�����A�.]�`�255���OHH����u�VZZ�����=z������QXX���SCCC���S�LLLN�<��������������/n��EIIi���<HLLD^-�o�B�o�naa�s�����E���3'88���@CC�������
e��}/���������DJJ���j������������%%%�����P(,hnnF����9����1���m��%������		�r�JUUU}}�������eddx<���qYY�F;q�Dhh���UFF���LVV����������g������p����PMMMKK�{��	5Q__���������FGGO@A���o�����/**ruu�?~kkk||<�J��kWXXXZZA/^���������:�������`GG����s�����(++�������[�n511Azoee����R���[�n��;~�����w�vss�q�Fttt{{�����G��<y�|}}O�<	A����[ZZ�>}:1o���WDDd��]���YYYZZZ{��Y�v��m�^�z�x�����^M���			EE�qJ���twwwqq9q��J���uwwOII�����aHH�F���_���rrr:;;MMM/\�p��
ww���.��&$$888���wwwN"�D*++�w�������JKKsssw��A�P�]�������I��"""-Z����~���������_������o���pw���D������W�^�r�|>_[[;44t��U111
%  `��uqqq���������������otuuo���������(Hyy9����UQQQUUE|����Q��e�����"�iii.�{��}##�����_�^UU���g111UUUg��y��������V�X���S,;44�������J��:c����NbXX�����������]�SQQ���X\\^SS����h�"]]���$--��������g�644 7%%%2���|�kkkg��=��MKK;y���g�������z{{/^lbbdiiy��	�������?���c������������,@����������o��akk;��/���m8p���u���������...�����������p8o�����o����_����(���������Kjjjiii


�o����ijj�p8\.700�D�F�	8�UTT���9;;��{w��}��������?���,++������
�@jjjO�<A�:;;�9sfk����jkk���^�znccS]]������{�?�F������������w����U�V}�DSS���55����{yy566ZYY���n��������g������7n�x��m2�\YY92��l���~�D�F�Y�f������555D"�������������eee0�Wo700����2|-\�����c�077���Z�reQQQTTT^^���������Wll���orr����<���W��~����JOO���BSS������)888==����Ht
�0���9u�x�������;w��P(���)))k��mll��}{cc���OYY����JE�$%%���"�',--I$��C����uttjjj������mlloc��i��o���������;H�;00���+==�������uuuG���c������gee���b~~��������w��?nffVXX����������ys���w��EGG���*((888����]�v��-������G�g�455���������lmm��������������$�P__�h���s�>z�HQQQEEe��m��]�pACC#//o���$������B]]���x���'O�������G� 
C���Y���������|�����6����}��!1�����]TT�b���[�>y�$77���+W�<w����{������kjjFF���v{{��i�����?nmm����~���~�MSSspp�@ ���$3g�lmm
suu�|����A@@@rrrxx����g0$i���4-  ���SG��Inn��+tttZ[[��BYY�3g� ������0�:r���x�d2������'g��]�`���y���������[�nYYY	c�����������366O��l��1))���;����_�^�r�������sss���;{�lUU�����7
��n��-..�b�_N�$��������ysaaaBB2�EFFvvv������gee8p������v���	�B�x�	H��{wnn.�����,]����J__�����r8a.`����s�����KA����������~����'�������M��|�rqqq��"!�,]�t��
CCC�h�p��D��������������bmm�u������������D99�3f��x{{{A��#G"""�;�pll���3R�H`���cz�6l�~���
/^��������7o

!�T}}=���>}zgg���WSSSkjjN�:eff&��E^K�����`-2������{����F�����o�.**����L��Y�f��������(*###((	���?_Xaccckk���g�4��K���w��������o``���'**���$%%�����/#""��={���C������<������{�����������^�lYNN�����������������d���������Lii����9mmm�������)�������'N�E���O��]\\`>y���?�\YY����-[�;v���4""����������h4z��III� +I<<<�X�0f0N,,,V�\�o�>333����;99]�~���7������~����������#���#�JE<�/
�VTT�X����G---�fWW��������+W��]�l�2___������}ii��
����H"2�����{�������������/����0����������L�<x0..�			.d�X���_$���������kzzz������k��544�|�������zII����mmmSSS�����$�t�����L������M��9bggG�R��������%%%{��ijj���.//��m[[���_�B�9M�k������W������;7$$���j��111G�]�~�����d���1Rd��u�n���y���144LOO������^�t��[�lmm���������455?�����e����kkk///ee���������"��^�jUSS��#G�.Zbb"b����[VV�b�����_����{��������xkk���t$��s���H`dd�L�m��y��=�����������������Y�~}KKR���zzzFFF���UWW��������5+..���n�������	A���mJJ�H�>���/^�055%��;v����Dz`hh������.�H,((HJJ��"�C���������=k����
ii���z---����-Z��������P�+������������p"##W�Z���/_��m�6mm��������4DCCC������~�������.]����yxx������7+++33S[[�������=666���'O�,++�����o��s�"##�������k�m�����`mm��/_vvv���9;;��� ��F��x>>>��5o��W�^���{xx������VUUuvv��� /�P$"�122
{���'������LFF������UVV���q��aaaVVV���\� H����2k�,��9r$::�K�+�B���&��������wvv^�reSS�B�u�������}RR���C��$%%���'����
���


s���R�qqq������999s���3g����P8NZZ��9s�h���c|"��>|8<<|���7n���k�������cbb�d��M�
333����������\���Fqq�����^��l��7o�?���gbb���l```nnnee�����kx<�B�t���
���	

�������p���:<off���3o����_���o��9}�tPPPQQ��W��o�������� kLLL�h��3*++�]�����9�
�������?}���9s����m��eee���899555���kjj�������7o�<|����:R��7o�<ybjj
APOO���Fmm����%K����DGG#�f<��f#��K�.566F^-v��iffVQQ!\�&����WUU�9s&���D������������7���JHH>|8&&f��-***������KII)))IHH��������x��	���~___���V##��I���744

���;s����uHHHVV����SRR��������=���e``��?$&&������"j���1m�4����_�v-66vtC]]]�D�?~���w����� ���(;;��O��9s���[CC���������z��5NNN�������K�"�M �7:�o	u|||������qj'������k[�lg�������y����S�'155�m��K�,y���*����M���OEb����������w��a���L�6
��9�=����������>������O|~b�$~�m��@GG���\xx������Cz���w�}7{���������IOO�������P�`���Kf2�"""���Ho�����G.�s�_'��"��>�25�'�/j�W�������K��L����+����

TUUW�\9e������>��i2��xFFF���)Y�0�KKKUTT&6=:1�drvv������������}�"�������[[[������&�(yG�������L&����?������������������Sx��s�N,{��M]]�K��������tt�x�R|||uu��{��Lfkk��%��s�N��gG5�(o��1<<\UU�,��T*5""b�3��w��}�������R�$��;W[[;�0����~HHHnn�(���:���{��N7M���3g��2�������X��_��*�x����-99��5Z`��6���UcbbF�������o.\.��:::(J}}�����
�H���W�~n�W�H$N���l�����r��`2�6�5������qrr���0�O_�	g��t:������p�>����i``@"����S��G����/&&6e0�o�����'N���!y�L��{���o����b�������o���F������1`2PUU3<���6�����.//���n��lN�:�������������kii���333�/_>5B��[�b�


F�X�W���#���������S#�QXVV6e�0�o�,��
L��&�m�����x��d�m`��60Yo�,��
L��&�m�����x��d�?1�.�
endstream
endobj
100 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
105 0 obj
<</R23
23 0 R>>
endobj
106 0 obj
<</R104
104 0 R>>
endobj
104 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 975
/Height 271
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 975
/Colors 3>>/Length 33701>>stream
x���wX���3i�PA�.��U@l��^�b��EATP��b/���{QQ� *6�(��$$�� ^eW�!���q�M.�3!��cM�d�Bm�j@VP������6����d�
 +�mYAm�
j@VP������6@�"�IJ���z�&:��*����:=��m���W>KBD�D��mm���j���I������J�Jj��8I{O(���*���+��iB���|�����fR���)�IV>]|-
^9}�B�t^^�'
��Z��s�$��mak���Y�1-�z�E1�8*��8($�6@��s�5J���DB{�P�f��z���+M�����>}zJJ��)S�����������_?ww�-[����������o�����f��/����1u�T&��m�6WWW�X�e�??����i��edd���["�����a�MX�9��%t���������9l�0��><l�0B��-[V�^����z��M�6���������*�o�����v������{��999Y[[���p������r���;wR5a���'6l�����N$Sj�?�����w��U��=sqq���9s�L�
<<<
�`0���o333��3g������q��kKM���f_o�7.##c����o���0`@��Mg������q�F77����}��q�����Rt.BT��n���
�_o<8??������,k������{��511�����{���/���W�v	�(��X���=jSg1k��wFF����������������GFF���SWWw���YYY��o/��"~1����v�1�]�z����6mRVV^�~}������w��5a���������l�r���%.��"_,�z��`�l=�����7���7k�l��!�F�RQQ	,:�������LKK2d���}���BBmT-���C_�m�x���l���O�8��W/�Hz��5___6����gmm}������5j����N-I��Z(���8�������c7m�����ooo�w�^�H4|�pgg�m�����������E�)�,��"�1�0�$r�f������[gll��}���������k���2����W�^'N��~e1-6����R�zO�0����_N�D���2c���W�\�k��6m����)++�X��W�^7n����m��q	���Lf� �bS\>M��_2d��}�f������i���}�������%333""b��=����&M*��DefO~��xm7�P�nZ��������C���aCvv����


;w��e��7n�x���(.�;`����Rt��?j���������.��[�A�QQQiii�[����kjj��;w�,Y���I�:u����K\<wO�������N�����;!3g����{��y�&M^�xajj����������S��������������L&����������L�3_ou��!((h������366NOO��}���������K�={�g����?���#��:_T�����v���tuu��o������G�y��}###�3f�������K\j�j)���n�:r�H�Px������)S���5���,\����&,,,..N__��/�����w���7o����p8++�-[�������4o�����W����c�&MJ\���666~���������{�����bkkk``0q��u��m��%99������K�m�H��O�S�Nyxx�D�U�V?~\WW������xxxH$�q�����:u��p���6���2e����g��)�V�^�����I���'������a����W��kW��]zm�y�&22�w���V�b2��'O�S���q������������36���y��/���9��E�:u���}[YY������&33������-�5kVhh���V���Qm���o�����w�k�NHH�]�v�j�����O�2�c��qqq7o���c���~��k�s��.\��u������Mzz������������w��u#""n����������7n���hmm=s����p�aiiY���6m;v���aLLL���BBmT-��vvv����?}���w�6m�\�zu��UO�<��o_���>�������4�^�W�\������\�z���S]\\(�
��@@@�������l6M^������v���W�^�-z������'O�0j����$ww�[�n����g���~��k����x�������E����suu�t�����;w�,�HRSS_�|�����G�NJ��M):*oo��o����������+�>}���������������������6!d�����-���X�|ypp�������t�2i���K�����f�����%6q����YMM����O�:u��]{�����0`��C�F��_Wz���O�>]�~��+W�9r���!!!�7o�x���)S������3n����W+))}�x��=a��u��YYY=|��������fff�����y��K��>}�����k�s���x���v�-���giiy��uu��;w�o���&L���

6l���������AmT-��6!����,KOO/77711������ &&���H(���B���[����6!���w��WOII���700���ILL�[�nLLLAAEQ&&&%.^zm�rk��������'6����������kdd������j��������?���g��d�{�NKKK]]=11Q BEQ���X�|>?66������p8����S�f��������]��������������}��z�����			b����YQzmB��������tccc����}�����ruuuK\���M���QQQ���JMM���622�>}*:Q?~TQQ���.q��k��}2����4���������fll\t�LLLJ�'�k[,����S�������������L&�Z�j\.7..���|@Q�����(��c����1�{�Q���r�������m�X�R��^.������Q�n3���P%\�[V4M�<������R����?��e#�&��I�l;��S.k'����Sb�u��i��IO�\/	�&,~��XG��v�|��o���$_��"�K���-�@�v���6�fZ�.��	!�w7������N��������(����6����d�
 +�mYAm�
j@VP������2��X,���~��Q��mvpp��u���
M����&Lpvv����������+���e����xoo������u�e��.���}�����QVV����?|���E�F���SG��
 �v%�����;!nnn{��	����_���}����j������w��]�v��X,���E1�>�L�=b�����,^�x���999l6;..��������7o��������{����?���GQ��>�J��`())I��e��1c�DGG�������9sF,���
4HCCc��M����o�vrr���4hP�
~�!�=�'	����d�
 +�mYAm�
j@VP������6����d�
 +�mYAm�
j@VP�����Jb���`Q?nE��v	hB��(2�)��!���������~�����
<H��P��6��W�0���Y��,|.���IQo�Y�tv���l����>X��P����v��#��������%�{9�M^
����w��������m(j����$�|zt���:W��u������W���1�>����P�$��C������� 6�mb���]�J�X
���Dm���I�����{5��KQ������My��P��C�m�9s� �>j@N�B�����(j�
 '{�N��A���	B���]��].~5�i2��4*��J��Fm�	
�v"����}��!W5�Gta6\Cx�+��J��Fm�	������}�����W����U����i�%�+��,�N@m��6���[��yt�e�+\y���my��9��Fm�	�6�!�6j@N��A��Q�r�
r�����m�C�m�6��@m�Bm���j�j�
 'P� �P��m9��9��FmTj�2���0%������!��TR�������vrr���!$11����������4�����������r����U�&�c.�m(4M(��{�� ������&o%�k2�}0M�8���V��}��[���ww��{0&x�����?����������EO����EQWR�3�xV��|���f������n���w������/_��}���a���O�:���9w����`��> �@(��c��
RBm��K����T}���{s#�3��� ������+O|�p�������~A'�d���vs�$��Fj�+��*�	��Mj0�uQ�r�C�j7�������w5������S��w��d��R���q#67quk�6Te�m�������Y�f>���������OKK[�re�Z�233SRRx<���u�^�~���x����q#���������E����_=X���Ig����(�F�|��$��������{_j�]R�s�,�q�����,��*�6�;�V�m��k{���J����f����l��XM��,���>�W���St�yJ���{oo%;�����\���������B�>��x=����I�1����\:����*5+�1@�eE^��5��T:n[��L�e+��T��+���nq��&������q=4I�I������e[�|'sv��+�kW��v�u�����������vtt�x�bPP���q�
���O�<Y�Z�O�>���={�l���7o���mv���wT	�j�OJ;7�o��<.���i0�d��+�w�5���Y���B��E����@��"I�#&�p�j�����$2=��z-5��4ec�L��e����I���u�k���;��:.Ci�^w�]G_���V�����y����?��m����j�n���mu�^{M+���?@P,9x���
Wm�����c[?a�]{�f���������dTlNk5��zc�F5���ymn���������j{V���_�I�S��
(8�v��;�&��������~Y/������[��[.-'v��?����sM����7b3���4�����L�?7}8b�R�V���
(8��<��b�����O��u��s>��9����M��77^�$�j��&������m���L?�M��GR�x����x�Ny�\�9��@�T��L��CII�V�Z�|dj{����
�:u���G||��+W�W�~��a�H�|�r///ooo]]��'�����-p6g�u�O��<]a�o���n�>��)�LQ'��OT�
%�j�s��gu����	E1u�%�g3��x����Zs�� i�+�+If�.�i����uY,��qb�����T��r&\e��D�uX��f{w��z1�Jv�&����Eu���XR����������{�c?���3����X3T���G���/wvw%�3>��u�����XMw��d�Ix|�q��E|���K/����?�'�Z�������L���i~��n��z�QM�W�C���5��$��SG��zOt���Z��P����M��zs��v����u�\I����'	M�eWK��/wzW�6���_k�I�_\A����A������Uam���zP������L���9�v��t7(�v�m�i���a���NWQ����\��j*��j���L�$��D��
����;����%E���I�'�������m�=P��x*�;JS��c$qy��j{�CC�
�j�j�jj�
e�����9�����FmC���A���A���A���Q�Pf�m�/�m�#�mP<�m�6�j�j�jj�
e�����9������t��+_�O�m���P� _P� GP��xP��������y�6��N��(�e���A���A���A���+]m��MX�����]Y��A���A���A���+im�pnu�v��Vl
�������P� _P� GP��xP����C�[�t4c����:�2hx9n��
��
r�
��]���,���+�8F�'-���5�V��,�Wg�wnk,X"���0�Ni9��^P.P� _P� GP��xP�_��)/Gy�p����]�3�/���Gm�����X��Q{"�������
��
r�
���[k{����\z�3����fmJ��S�-�m������9��������=ts>�C6�)��R}�>Q�aam�����/�m�/�m�#�mP<���Z��6�+s�f��I#]������Q���
��
r�
��]A�=�U��u�u3��<Bm�|Am�Am��Am����P� _P� GP��xP��m(3�6��6��6(�6j��
��
r�
�����2Cm�|Am�Am��Am����P� _P� GP��xP��m(3�6��6���_��	���y������A���FmC���A���A������X4�?h�*�����B�u���-�X�v^������P2{���P���B�S3�ki�yDJ�m�/�m�#V������)w��2j!D(.�8�c��_#M�}�S��m[��XE7�=a71��i���Fm�\��c�O��]�u�}q������9R���Ge����-�8��_����Ej"��k�${a�/U�InE���cdyJ@����hm���hY��I1!y"�sp�w��j�(��j�j�H%���{��,z^��/�1�t -�e6X#�Sr�]Ek����Q�����Am��6��6��6(�v�m�#�48�q���=�i�V�t��������Cm�|Am��d��62����,KGm�OCm�����\����E���
T��A)� �R�O��j�j�He�mI�������l�6��Mrq-	�Q���P���o�������j[�b�1Mw�����
��
r��k[��X�z���o��j�j[�k;%���{q�����R�����������C��v�|u�����$��dg�v�
P� _P� G*�����7s%y�m(/�m��������9_j[(y:����R����������]�s�����$z��K�����9������*�-�~��}�swJ^z�:���Bm�4�6��6��6(�vU��D��*.�x=)�����_�����9������2�m����{���Cm�|Am�Am��Am��Q�e�����9����3��c��;w�����z��[�n���[�n?~�����	&M��v���8�6j[^T�����A���[��we�_�m�#�mP<??��4i��5k�.]*�H�N�z���w��1==����kkk�uA��]�Z_^�?R'$\t�b�n�\��/uZC&����?�v�c+����=�d�C���Kk��L�#s������������U��N��+V��
k�j�� �_
����qG�5�"�\����T�����|�����z��|�_j��U�� ���%���e�y��9���~g;�[�*����q~'s���3���^"���0�Hi�)�%J�_��:��-���=�sm�bu��qc6p���$dm���1s�8aj�D�%C7�����Lq����(�i�����1�l��~�i��z��������1c�r�~����cL����j5�@�f9��g�/�&�8��>��F��.U����r�����my�Gp��N�=�fam��y��J2E�,x>����no7�(���^����������'�����-&���������mk>�I6;%��
�m���y�j{��7.;�z�;���$\�_������^��~"�����WX�}�;
%��i�<�����������Aw?�v'Q�|O���)I �0��)�Se�d|n@^�dmo���������|����+W����5i���f����?�B���'O���2[2�X]��N���n$���(�<Q�we:�Q��@�q&�l�8�Ej���]�����hS��~�
/�CC#���L����L�Nj6�<�!s��q�I�����z��p���:�&FoSbp6T�ht���!g�D::LFwv��}�&������o�oW�P��+`k���<SS%#iN"E��Wk+�$���h,�����D�O�4��,y�}���z
�9l�tof�&=����[�Hi���j��X��M������o��V�,�%�<���J+Bjms��>t@�cv�n��wz����z�h����ZQ��z<36q��rA���,5W����w�WGy��bzQD,�^U��L�� �E]b/r���L�uS�_���d�������U�7�w�d���������\���wGZs;��o����7���f���,�����_�EOm_T}��*���8���g��V�N!+������ho����<M��v��h��v���%�����:������C�s{Ej�����_�lK���������
(8�^d��$s���tn����|l��iv�=g���;-���7O1��t�����G�7���&��zT�������2&�~�&�Q�M�*��U���
�o4sx^�v��:9��>�d� ����nk=��fZ������U�4�r\�����Mt�^S���?e���zz�Va��$Ae���LLL�����m}}�������������ikk�����={���aaa+W�<z�hY�����b���?g�z�f�����v����<~�l��$����v�uK��m������T��i8�P� ��
��R����������v�������
��a�-��I�����1�n�GTg��r��z���^#Y���D��w��h�Qt�����	�������G�W2��l��������������d�m��*;U ���n���m���a�m�oR�l�v���y���q<�^���zlNf������m�:
��9�����~��b��wo�
{��A	~��SRR��IJJb0���4M��|ee���tUUU6���p�n���m3
&Jg���q��,T���G_X�h��~�2���8aPBJ������vc�/�V?�4�E(V9��3�n��<^��b�������f*192>= ���$��r���,z����J�|ML������`�����d���==U��E���*�%�� N5��p-����N��6��6(�6j��k{���a� /��?!d�
{�e�$�Y~�����-6Aw��Pc:pr6�Qu�,~��p��
+�w�Km�8�M���6m�8��]���A���A���Q��]�!n�mZD����fq���.�M�t )o�D]�o���q�Vj�[&y�ak3m(��?��)�|L�~#����>�U���6}��j�JCm�Am��Am��e[��7^��5���z�������n���_V���,kN�[���Dm�|Am��
�-�����:~���@m��eU��W�Y
K�=$���dZ��7�m�S*�Gs?����$ /I�Pe�x�rx{&�6��
��T�?��:F))�����
�}�n���.�E��Fm���g+%�������z�j����c���K�V�Km��	y�@Mu����;nm9�g���,Y�����&>��a����RQ�M��)���N34��}�A�U��>|�`k��*j��@m��eT���6�T�MW�����m�����F��m:#���O�gr�l��E��t#\��]��*��`�@n��l��vJ�I�N��B�a�]!�����e=F�F��o���(5K�6H��j[ H�h�u:����i��A���Q��S�5��)�x���[��j�v�������km/kb����}g+�l��L ��!��uE��)k��l���^��3�:3����m�!�6��6(�6j[qj[c� qJ�������,�����izw�A
S�LY��'2W���<�T����1:����P� GP��xP��m�������6j����v��
�
���2Cm��Am��Q��mPp�����?x'�0����
eS�k;*��i�z_o��z��F��*�������Fm��A����^tRp����&j��������8�guk�6��	!r���~������/�E�`����\Bm��Q��mPp�m�#U��%)I��Z�Q�D���������iY���Y\&�bs���j��"�B���k�~+�6j�����9R�j;�U�j���|����6�N�wa��z����������I���[�o��QwJ�%�x���f��P��m�6jj�H��m{3��@N�b������I{�a��!�o���{�!�����KmG��4�3��~�4�o��Fm��Q���P� G�`m�����nc�N@m+*�6j�����9Rk�I����3�$�Q��	���Fm��^�H�������u��z��
��9��Fm+�6j����y��]o=�mgu�.�����*����P� GP��m���Fm��Q�?��x���������c��hS("^=P��j�j��xP����v��;�m���]9k��h����,�[��e=��b}�����z'�/>yF
	-���LaAam�C�r�9��74S�����^P.P� GP��m���Fm���SM���	��|��jMivDmW��^��"�/x�+�#K���;��^�n^�SsK���$s������j����Am�Am��j����N3Y�s'��������g�j�����eg=T�ui���d.�*|�{��R5g���������A���eT�k/	'u���W�6j���.����
�W���6M���;�m�#�mY�v����,o�@�]�2��AyAm��Q����}�mZ(�8J��%�6�i���Am�Am��������.\��_^�$o�&�*�:�sP��m�6j�d_k;�����@&k�r�����?�m�#���Q���Dw�Xm���T���B|�RQ�����F�j���Fm��[m�j��bKy	�r�����(Q��C��~����9������Zt�>�m��;G%;4g�Yi�	��F�j���Fm����|�W�j{�A=F��l�m�0P� GP��V���M�5Wn��F�j���Fm��r�v�X<���M-,��4�c�j���R�6Mh�P2{� �P��Q���qY�uo�6j���Fm���Y�/�rF�}t�C�u�$
��2����$@��OJ�;���r�QTxm��#t��4�'�]��[7x5�mB����q�\Bm+Lmw]����&�m�6j���*]mO�H�/�����������5�?�J�;3
'������k[^�*���'��������h�'c���q�\BmWxm�����y��_n��������v�5j�8�6j���.���m����1��Sx�������D<A�RD�����������D.j������2t$�E��q�I�����E���o>H��XT���U��=K[m�"Q�1�m���z*5Z�hJ�w SB~�m��*�P=|�`Q?.���E�I>md��r�"X��{�3�N����v�!�6j���.���m�C�q��S�Q���L�nCm�4��m��W��6^��Q����%���j[��~������Y<W�
�_�CTo�,����/
1b�Kluj2E1B�W��zdM~��OZjk:)�=xG�h 7�t%��J�j�R��N
���k>
�Z2�]2�6j�]�*U��oYuL$1�)���9���_��_Q9k[4�����;�����gf�����smwd��"5���3�s5d|n��AmKS����;�������n����d��YX�5�Z~�[�[���3�����i�Kmw]�k[�5S�j�?��Q���RT���+7�����#L�d��!��S���P���r�vX�^��/:�,:V`�C
3�)^�'s�]x}s{��T�ZP�����uEn������R��Q���RT���������=L2�sm?Am���_��:�����l�������FmW5��*U�	����JeZ���Cm��Q��@m��ZU���7
$�����H������d���0,�o	M�^�9�����������rD����'HZ�j�����g��Dw���0a�P(��u������m�V�����������Fm��K�����~���@M�M��Bd��r��6U��o��������TA�Cm+vm;���N��u���b��m/���7��(X�Z�*#�15���
V�iGE��%K�xz~ggg���fgg?~�8++������u���244��CAm���j[T���[|m&J�Q��mY�v�k`/�Kh{9�_�Z��:�W����
�=�sm�Cm�?����� �(�;���vo�2�?J�F����g�?���t��#�#t�c���2��h��{�nkkkww�E��o�>--m��u�eeeUUU}}������W}pnn�4k�(�ed�#i�Lt�,��
��j���$�o%F\l���U�?���v<_x��b�y�DG���O�5�}Zrb�����}rP�WcX������Oz�'�0�wNE>T�p_�&[��m�.�{���
[�W��'����-����]�B��T��������H�s����������4;2(2zQ��5��dO��wA�S+Z�nD�������<DM`���u�)`���+������7�4{s^����[�E����fd��E��E�4��I������p��,�})w�Y��2Zg�:;rD���^�g����(����3������4����#�7�5v�R�""���-]�x����x:����6u�>}~bAV�6���[<g$l���=�c_�M�t�R��,��'2��~�#UB����7�f��������������L�X(��*\�~)�8����F�$��������E�%4��nN�m���^�RV�5i��:��C�d���fBV����	)Cm�I&cw�=cXc�$\.o�JNxcqk����N��~bk����^��w����@��+�wGA���Sw%���3�������7�_[E����a��kI��d�x���*g������#�������n2YI��a�Z������V�0][J�/���R���)E��Y��_,���/��������H*0����N��V���j�3��
[��>=x�����|����S��������;�B1��:O��������E�l*��$�D����*�>�?9�_W���\�lJ�1�������so���f��c�N�|q��t�5n�������	��E���"u�����!V�~�#c�V���MO��68�����!X�:���|~O����b����P���r2���_X$�v�w6X�������9�����/d����>��(MXp�����I��������a���@To�����R��4;2d�y��)����|��E#B�X���.�=�ic���i��7{���zu��|���=�F�d�N��j�G�
�����v=����S��mEF�"��d�����Aa3_(YL�����	���R
�H4l������z��su
Z��g���c����gm�m�Hn]g���>�*0X������d?��RN�3��,93[��o�����$`}\��2���^M�|q���l���������K�ax���X�6N����scu=W?��c5�����LE�?�v���xq
�]���r�12o��r��������9Tz����c�Gl�n|,�G2�U�h���e��$<�{�?��>���e�T�m��^���~L��3$��*���w�x�x���`^�_�[`"\�20������������X�M5�����N����H����_H�o"��1�{t`46U��������teN���<EJJ��g�v�~���n��+Q�E.����������O�������8,�c-�1&S��	V��\
��K���d2�\����3����c33�s���j����%%%=|��G�.�7o�������BBB��l�l��R�+����?���WZ�=����Km��l�@���f�yt�e��]y�g�m:_����m��I��m�0��Z������<�d�2�p�B����l��e���v� )��g��=����{���>��s��;^�6��������T��;k�0��&<V�e�=v�r�(��
E��]��|�m���,>�n��n�����;7��N5\�����6e�EYR����nD���K�X\~�ms�	���m'��m�������g����^���v#I����mf�����?���a�-8����s-R���Gct��l�g���l{�q�O���6\�:�^��-��l��\����g�k�l��hK("^��Es~y������h�M����������G��D�wh�������l�����#c����������a=8@�
�����@�%��������o�pV�I�.f�}e��fj�9���~9.Y�l���L�$��D��
����?&���K}�m7�D{1��z�
����df�����`ID�l;��;�����e�o�y���+�I��IRk���������L�g���D!��5���l[/`�����yh��������6�r{�Y������m3�YY8�N*�l�`�1��f�b�<�f�������m�����mI�l;��g��2���-,6�fP�I�u���������g<r��l{��8��l[��C��;����{��R}��
����M�l;�����?�,|��>��|�m��S���m!���������w7��?Y�������o�M�����5�v����{:o�H�/�����fh^�Z2��l�����$}U]�.��T�o����
����...���:t�H$��]���>y���!C�:n�8��.�+Ip%I��$�u,�_���+Ip%I	�s��k���PQ��K���a-u-��k����j���^��6S�,�u�iq�L;����X�-�%q%	�\I��W�h��\�N����|�������$�l�����!X��d���������*�w�c�$;��iJ]�������Z�I��Fm��KVj�@D:-/|=��:�=�l���[=+�f��}��z ��f���E}3��J��\����>���A���Q��[�����E}6�c
��u��j�8#k�����n���7�������#�3-N��Q������Fm�Km��g���X�)�xTlm/`�?H�8n�r��P{��w���	j�j�-���c���g��c�Q�����fqB����8rj���.�������M��c1�j���������d��?�$��~�V�H�������k{	��nJ��^G��Z����i���W�N�AUk]���6��6j��j[�c��^�����������:7�����[�0� �YEm+\m';�d�o��x�����k�O��fF�}������������n�5��E��=_��/��(��zE�
�
r���.���D1��[�a����������zs�Cm�n���:��m�k�v|�����;Vdmwj���Pc���P��W�57a������Z�x��G{^
�L�T�&W��>��{y�#���Qe����m�#�m�6j[�P��m�v)�����l~�o��l����j��(D�Q�c�������\@m�Rm;%'��}*�RU+�)�����Fm�j���.�����'���Te������v�������.1&36�[�����ZE%��u�*SW��,�lP��m����Q���R����g������s�N��j�
���Fm��e���Fm���h��s;s����0����vj6��1��,)>����Fm�j���.j���.�����q�q;���Am�3�v���q�|G����a*�����P��m�6j�drZ�q�S�Em��j[FP�����k{P���k5�fG�vq�m�6j�]2��m���#e��_���U����i�mAm��Q�2��Fm��K������q��yvm�9R>F�6j����e���Fm�j���.j[����n{��9�����e�-#�m�6j[�P��m�v)P��m�v)P�
���Fm�j���.j���.j[a��Q��mBm��Q��@m��Q���gm������FK�K3�u�/���j[FP��m����Q���R��Q���R�m�S��M(���k��/|vu�H������^sQ0�A����wh�����g1�%L�2�U�����Q��mBm��Q��@m��Q�����M?�v���I�vCO�[�"*�v������X����>����H�'��4J�c��*j[FP��m����Q���R��Q���RHS�G�,]���-g{�c�����
���Fm�j���.j���.���v1����R�����Bj{���%����+y��Fm��e���Fm����Fm��k�Yg��d~d�K��]+������h���l�����.�m������f6!�YEmK���Fm����Fm��j;E�8���J_�#�q\Q���*P�B��3o.����v#ivDmK���Fm����Fm���0P�����=�is�������
j���.j���.E��8�S�aK���_��Fm��K���1�6j�]
�6j�]�
���VMU�����(���m�6j���Cm��Q��@m��Q������m�2l���)R��_��k;&Ub�%�_AyAm��Q�2��Fm��K��Fm��K�����;F��ud� %Bm��+���t!�)��RBm��Q������Q���RT��v���{O�j�]%j{�������Z��rG)��Q��m�v�P��m�v)P�2��FmW`mO����������QJ�m�6j�]2�6j�]���mIJ2C[���]@XR��j���.j��P��m�v)P����)��c��-o5SSIM�MQ�
Y��f��N,G���>G���>#{�%����L5��DX@���L��j3���i���MvH��W�m�6j���Cm��Q��@mW����~�oD��V:����������I��au��8�b��7�����3:]��c�(����X�r��O��s��Y�H�����+�6j�]
�����Q���R��Q���RT���)�����>F�G}/CmK���Fm�j���.j����w>?#���L�I�#j���P��m���Cm��Q������+mmO]����������+gm��[�s'`��d)�@P��m�v�P�?��Fm��K��Fm���V��Q�� ���Tg��{���SK������Q���R�����Fm����Fm�B.k�h�����Cm��Q����m;|��[�nEFFZZZ�D�U�V�D"//��G�����'���Fm����Fm��-#�m�6j����l������[�{�������7�������b�\\\lll�>L"�H��A9��M��isy���)�k�%WY��[��E7c\r^E�i�9��q����m�zt�26�Q�����t����9v��m*��Af>�y9� �E�Da�}���� 6�u|������#�+����7���k{������Y�Y�Ly���w�j������5M~|�E����-_��\��5��9��{Zg����E��m�3�-�:����x�������{a��~��?�'Jg<L�Za?S��H�����z�<p�M��du�quo:`h�}�����N�C���O{_-2F\T�����$o"���^po�l�Q�*�������`XA�;�����i�����j�x�uu�v )osW	������<��&dL����������#[;��w���m����i���c4W��U��X��~��Z,�;����\���f��\���j)s�6�Xs�`�@n��lN;�f:n'����[��������4;�N��"82U���L��j��Kn�v'���i��������[am�_y*����������c��N�R��-��JU��uE��)k��x����A�$�x���Q������j��k���}Pbj�����[[v��[%�)jSGKa6\���_$���i��)��|(�pYXT�����xKI�e�����>k��muj.��2f���7n���<v�-����l?o�E+%��m�����2�*,ei��#�q�C���u�X��E�������;�~��Qj<���{>%����s���p�4�ct���8������F�9�e�=}~M�9R~���.+��z��tj���@c� �=g�C���Yv5��34�}�Ek��/��2���at���7�{���0��:�a������5�c��H�����?L���zi_�#�DQ��7�[nx�/j��6=t��������-��m�d�M�������~T�#a*,�o�	)�*��]gu����\��&��a����1�nB�EG2��5�u^�@x����	��\��E-m�����s�fG<�m3V�=��T��,:),��I�d�V�O����R[�u:9�L��uV��:�������B��~'DF:����vlT_ ��H�y"G3�c���^��6�A�l�k����OS��~�yiFxM#
V����������m��y�6��v����,����U�Y ���4�VX�)/�K���"���_X�v������3�h_VG��a�����_��no�Z���r�����5�>}���g�OBK�,�������OG���/D[� ;�	�)���E�<��r�5Sm3�_[�l���6�E��=�����zl�����T<4�6������K�lft0e�I�"�~����Q�~���7cs�V����p.?�\�U�wk���[�YH��'j�g���4����.}������$g�������Oe��hqnR��KnZf;����?�(��+
.(�m��[��'I�Nu�q�������<����?v3l�]���L	Mf9��������D����?������r������tZ�����8p�$!�����&�6	�g�0(�u��Z$ �O��Z���&��,���d[4����c7Q���_�ui��^��4;��n+��������n��X��Lm���[�LU�F��r�K�Z��!^����u��(��X���G�������H�c��]}�t��K��6U����?;���u�!?�iDz���=����R�[���j�fK�������BSI}��$������9���m��������z�����tMM��]�������3������g>���T���(2U�zdH3(QiP(�(%2�B�D�(�PK�<4j.u:�����n�y�����������g�}����Z{�}���n��AlY]]-������$���g\��e_����(�z��|��k��o�����c�d�{j��=7�VNi�;�KJ�c�k�U���LQf(,��/JZ�1�Xq�y��'���G��������0H����HE�hq�[SKU�QL��$?s��D
����l����8zF�	�d�����D&��vh�=�������S��+��x�������l��gg���P}n��=7��`�}u���!��U���M��L�i���CV�p�(A���6�c^ta���AL��|�����+�����?��U������+-<q���Z���#���M��Z;H�w`�+9���'���(3�E��p����zzE#�na/6��^w�iT������F]Dnx������e��^w�;�i��9�����\���,�J�-��i�ak8�(%���(���Y�W����^���m���f�8����t��Z�w����i����w�����C�.��X�z$�4v��H]���D|[GrI;`�Vv]��l�v�gc�F��d,��4��COvk�������r��<��7��S�HaA�z�A��.��9��8���|U�H�,���b�����N�����L�g}��-�c^%g����kY(;#��S��A����F���>�5��W,��X������1��Y3p�(�C����cy�s��v�i��vI�HZsv��j�,�m��ffP�^:�[��Y�z����9<�~\-:��~�\�J}h>�(�,�]�Z�,R{r����tt�{���?RJl3�[�v���9n�t���S�+�������w�F�?uG��O�w��y��~�B�M��N�J���?l
[����!���	vm���@�Z(�|1k�3���c�4��bR�k+���>(������V��4l���s���f������B�6�9{�TQ"�������^����(2?�![���f_��#��g�;��'�d| \b�����jw��\8�5,�$���c�Z�bu�������(���E�*���f[�J �sO� �X�����*�����c�%��;���P��8��Q7%WMm�;�����iH��6��y�������Q�H�Pd���Q��r#�p���18�^�����1���
���������p|����2=ex�d�i����_-l�Z��&w����!�"�6���R<pp>��?|N��6�>$O.`��]Y�c%����/I��=�����*�=�5��%��'�����V���&6	����T_�-���H���Ls����F��3�4�U�h]5'{��@c�����o���n��R����JGv����Y-�
��\f��W������LB��������b	������.���F���������,%R������D&����?C���+�����3��E�����m���g��N�%5y[�2�r�v���K}�?�f7x���A�xA	�d�;���
�Y��l�����z��v�������:�����~���^���ZINc�(�"8K����1����K��$w5+I�F�B!�T>��3��`�*)����W;�� ku���|�Me���\����7��<�#�1i��l�-re+����g�3<�ld�(edq1�1C��QM�I����aO�n#a�sc���7��_��RS�9��;���<�U"*�?d8YKJ��D���m���uDt�v.fsL,���/_���P]fI�4k�ba�6-�Ql��E�
��e��,�!;����K��!�����DB*��yV�c����%���K�%"���b�����ef������KeCxs�������V��iZt(LlLQ++j���[��nW�G�������t�S9\zD�*�8`�(��J�����rn������A�1�i��EFF��|gggiii__��[�����|�200�G3I`&	�$�$��$0�D��f����J�R�(�d����3I`&	�$�y~�*I�EK�6�6������������O+�d#>��L�b5!�tG��y����C8�>7����$Uku�
�
����;I��������6�6���
����Q&�pz|������$d���J�W�Mn��Q[����2����TT�����Iu^�@��m�m�m�<`�`�`�`��lll[�����B&��l� �����<
9c��d�;�G����.�<���}�1			����.`��lllll��������%�.vx����v��l�������g��������
�
�-�m�m�m�m��lll[`�`����t������"F����y����������6�6���Cm{u����4���m��9^[#6C��~�m�m����6�6�����������{m�m�� ����;�2����
LY�^�P�X�Q����)#�9H#"������M�6���	���<�����`�`�`���m�m�m!�m�m�ml�O�vb/�e�}���T�������h7wf����b~�G����,c4�y����G6*��5/�m�6���`�`�`�B�������X^vo����S45Vm
�
��]����
�
�
��}��������
�
�-����m�m�m�m�m�m�m!�m�m�ml�+`�`�`�`��lll[`�"����K��T1"�6�6��olll[`�`�`�B����v������L��o��������#`�`�`�B�����������:����������#`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6��������
�
�-�m�m�m!�m�m�mll�`�`�`�B����6�6������m{���T*5//���j��i+V�055��������������
�
�-�m�m�m!�m���������S[[{��1%%�7o��������7NLL���+�X��:��Y�n>\�H��s���������c����6���Z�I��h(2�$#�4��K\0�s>w��T��X*J�O4c{[�����k���(d?da$�F��L
��{�y$
yi�����VU�P6'`1����:l$s�(��fL3�-7�����(����b4=�y��b�����#s����3�
��E+14e�E���E�y���%s4��KS��s$�w��fc}]���{��;�����A+���0k&�0[�4F�O�|������[��G���am��9~�����}�������*���Wzb|V������6OX��T%b�
��\N�7�Vk��Z��"���$]����D����j^s��s���j���ME9�l�������@4��~���{#�E&������L�
����.�d*�����}��8
s�m$����]�K�6|t�EN�2Z��c�T�86���yy���Y}��N%��Z�<�-�C�f>1�q7TI3q�����"It�l������PA�+O8[��6��3^JP�W��Fg��o��'S�Vv���F�������[,�N!�^s�</�A�m����=�+��Y�1o��Mg8v3h�)������K0�����=NT�Ny�G��gy3k'�	?@fn�k2I�;������^��Y��W���&h�K�"�y�g.���!�c��O���?91�����y'q����W����"{ ��;��w-�J��1}��r�����m�
�����"�*y*h=}J2h��������xva����G���@V�7�No2u�� ����9
�S\������c�h7>%3����Dr��w�;�w� ��<�|W0����mv�����=0��>a ��Gn���wW0��
{Y������������A�;������'pV��d�}x'������*N�� �mO`m2bLN����&�aGv����v�������(�tqiK�sm�d�$�f�����X:/u�4S�n�(��v����=�w�%/��Z�3V!=GP����U�s�4c�)�������eg<�=����{�3��������RA���������9�����|?[F�ST��L�R>^<��|�w���>_���e�n����gO�M����*>��{[rk0�h������G5%�g����������u�j�~����4�pz�:M+���������9A���/�.��_�u�V��
���v���������1�����_%��%�]�.��H�X\�W]��,����R���������^D����IH��4�K���)��}�+�X5�D���k�b�d�\U%i?2�9A#Q�Y��������$�������	���������x�QlT��[��}�
�h��� �JF���,�g20L^+�g�������4wI��>~�c���Wu'y$p���N��h�v�#�<6���
+�^��z�TQm;������/�:OG�h	C�#x��e;��1�ne�
ti7:����=it�ru���k��.�����Sm������H����{��������!u��6���H��Q	F�fLHrE�o��J�e���,��m}���S20/4��'<��Pa��i>t�*JD6�ng��f����,�9)���4�vr�#	#��F�>���G�e��������Z���F���P��Ga#����Y0Uy�(�4'������=�6b���� .iN����3zM������h7���k����OJ'�������(�x�c����o��i:\v�(B;�����7����g�wdW�-a�}��dW�s�FA}��q�I�Y�>�yU���O���\r�d��[���l������O���;x�`UU������/##~��9���

�R��*`gQT��Yv=����S��mdI�#*��K���Q�G$�*)��H��F����U����Z�)nU�/A����/Z�I�e�^��D���=�j�?�}b���YXT_��Z%d1���a9Ak�2	c#�N�Ir��h�4o�#�@�+��("�Kp��8�:SA��/ W�$�<�6ZO��
1jh
\����$��sZ$x�b��U�8���VQ1��1"�Q�d�B����BE�|)Fke��'1.�K������
L�-%Mc�RH����T�B�j��Cxu3
IQP�E�*<�$g��z��}�P�\q��Y�U�#��*���9��S�&���j���G>R�e�X�'.I��8&R?WU:�F������L��d}�'�f*�0\���l�@
����OocaLE~k�2	kC��RL^E=
��b���t\U�"���G{��]I���Q�%�&�Hrl�,j�@
,)Y2�M��0�$�QyH�8��q%��Z���V��tU1�Xe
����Ir9b*Ru�\FO�C���Q{��$�(&���>WU��^����peNM�����Q+�$�%I2�U���:SB��AaW4)|�2���O��$+r$$����ty�0�����������iH��5�p�H��R�W�R����1�$?�I��O�mb\�VjOT�C�*�"[B���Hb�@�v�I
��xO9N+����"&o��XF��v��_����a�V1��xSU�|�O�c�l�,j)�\])6��ati��re�	GLq.�*�h�#i
j�#�Hy�H J$�#���������jZ�b5��k���k�d����Be�4c��x{�����P
���r*hH�����E�*��J4��:�kK���I�$�#�V������G$�w�:��������b<Q�+��,#����<��a�)I�#�*Z����LUE��C��<W\���j{�,�*��<�������NBM\$B����dEHp>�q��b=�����v�v����	1�h�]��$����]?QZ�?�]�C����JC�������]gU������V
Y�0x�:2O��Iq1U����d��y@���|�*y���Tm]�t#��q�V�*����Z�;�jG�S��("��f�@Q�[^G�d)�[E�<���4�p$��Q�gWV;�P��]yRL~Gv��b����R�2�?7���96��)�S����U�;�J�6�*^���
t9��hn��U�E��IC���������$��������e����E� q�$���Y�M�C���,I�V�+��eh�-d�,"Q��*EV�#)	.��>�P;�&����J���m
t���':�S����%���������][������9�VP/9N;�����d����D�7����8���)K�W���0:j�dE�U y�����rp�!#h�R��v��T�����s�a�q��G���+�Df�a�b�Dv%�0Y.O�T�)|���T�D��}*S&�H\�K�*�����0wQ��%��%9uR\v5�#�6��k4.U�?�+�����9�������***fff"����6��������PTTTWW���������OKK��a��~v���+W�����-++������7G�!++��N�#����qm��O���������a�n�K��{����
�K��B�����-�>�����m������W�K"��{�������'��y�/|��[q������ma��E��jll������mhh`�X������Z������*ll�L�2�|%%��XUU������hkk���\�pa�'�w�vvvF����C(B����W�^B6077?q�D��o�����?B������g��
III?���FQ����V����>}�����:�����=:q�D�.�PD6�M���D��aC�O��=�h�"������CB,��d���v^�N<<<BCCq��5��|������%����


B
ebbr����>�}����������ziii��>|���155��������w��n���>$���"�[�N\\��S ���tn���0#��w���,--��p��������������j������������]��Z�oilllhh���stt��w�w/Keee||���k�'+W������MEEE�4�-��������8���366�nD�@���QRR����M��������g��UZYY}�L8Nyyy��������Jeee������0l�aK�p?~|AA��m:���>z�(Bh�������>}�9s�(;~EDD��������BQQQ�>��p���R>�?l������\EE�D"�]�v��
NNN?��pswim���w'L��������i���w�1k���W��������-[B������@CC�����l�n������5��^IJJ>~��k�l�K�{���YJJ��mZZZ����RTTT[[���*$������;E9�'N�����:S��5k,--'M�$�8�����%��
e'�9���k��A������[�������~�w�i8���������p���/^�X��o�L����8.�E����'#G���
>��5+Vl���B��k��	IS�xsee���"�B���joo>p��?AD�]�}���C�]�z5))�����������LFF�����e�F�!))�f�7l��g����2�<y�d!��pss{��Mrr���O�8�����������������3;v���������y����������?Z�e��ihhl������o��,���IOO�;wnhhhSS���iYY���[���:�D1w.����|2��l����������t�R.����njj���W]]�j��=zTTT�����u����"R���ell,�B$�������?��I���utt��n�:??�>���TTT,^�XKKk�����G�A���
6���/^�puue��?x]��{���_���ZPP����v���������#G��s'11���D�hllLNN^�b�o��}��{'��AAA���uuu999&&&������\KK+'''))���c��]366�z�/^�������]�v���������EEE���dee
:���
!����t��x������������1cF������?|�`ll<s��Y�f����k=TSS+..���8}����7�L�booogg���������q��y�TYYY__OTl���j&��������b�������QTTt����'���())eddu�5n���7o.Z�H[[���y��aw�����x����3v����������w��WUl��i�������9~�xfffBBBTTTnn���������xbbbXX���������zgW|��%���������������g����y���)�����tww�0l��!��7ov�hkk[WWw��777�
6���������������xkkkb���:�~�����o��q���s�������eK}}��q�>|����A������m���:::���������l6;--���w���{��)//wuu-,,433;y�$���qss������^�ti\\\ffflllFFF\\�������SXXH�������[������N�0!++����������������766��i��Q�(
��]�j��c����O�:���`aaq���<��XZZFGG������RGG�����v�b�X�N�RSS{��������U��s�\�xQQQ���)--MCC����JJJ�O������u����g��#Ad<"�v������DFFVVV���{{{=zTYY���O���RRRS�N��w/��;y����seee��Z;���Otttmm-����������;w\\\f��E��/]�D�nMM����g�����;GG����7nt/"�L���vpp���������zzz���w��qww7�!�tss#�"�e��X[[���w��5...����zzz+V�X�p��U����w���{�����������dee�9���>���#����&�I��%%%��P(QGG'22���������������111/^����sqq�;w����UUU����������/_~��%##��/�H���\11������,??���B�hkkt��eee*�*%%UQQahh�|�r�$''���������������k���������Tjkkkxx����	+((6l��>��������:TRR������{���������������+W���?�w��qqqvvv6l���111)--%z5c��y�������/,,444���7o������#***BCC���JJJ�/_������,--G�igg��o������W��1�������;u�T��=B����/_�x���������93y�d!���������355e��������[�t��9sV�ZEdZoo�3g�HII���������p���jjj�f�B'''R^VV���icc#$���������C�yzz�<yROOo������zzz���������KJJZ�dIII���|�=rrr0�2e�w�������f���S���qCKK���i���NNN222YYY�F�*..611��w���#������T�n���'F�������������g������egg����:u���522:�<����nvv��F$F.\H�R�M�fdd��w���R77����3f())%%%u��u��}����E�����7���_�r���;fff���yyy/^2d������+1����zSSSqq����[���kjj,,,^�|ijj���aeeUUUu������~�����?~�XWW���.���$��'&

��q��E��\�������PXXx��Eoo���_�>}����ko877w��	��]#�������ZZZg�����}���������s@��N����MHHZZZG�INN�p����������sfff��#//���`;;;mm��{���������&&&
e��G�uuu%�H�TWWW�������[��y����/,,�����������{�433�������||�|��S�N�3���S���������Z�lYCC��={^�z5x�`�����y�;�����c���STT4v�X�+���?{����KKK.�����9�O��N�2������7AAA���{�����G���o��I���������F\�j��]���s�N[[������xmmm
����;m���{����ZZZ�����?����j244���<w�\zz���wmm�����>}�`�b���S��/����5<<\WWw���4m���bbb555))),�p���[����f��y��5b����;��b�rss.\8k���3g>x�@[[���l��9D�T����ijjz���I�&?~������������f��L&���!������|��qGG�n$������=zt����V�RRR
_�z����>�j���c�?i��/_VVV���|��	���t����
�QL{{����_�Z	�������������kjj>z�h��Iyyyo��y����9s\\\"""B�'O�}�vgj��������w��}�'O��y��]�������TPP�����O����P@@���k�#������>}z7"������+W�������������^�p!  ����jjj���K�.M�2EZZ��+55u���G���=t���	����m�VSS�n��>���555������.����
����f�[[�^�z�p;{����KKK+..VSSb��Woee�������������O�>=22������>y���<���>|������+zzz��'��7l��y��m�����\�v���'>>>)))�&M"�H�W����##���
8���REE���7���D7�����F�"	>z����7����#==��� 11���������IYY���O^^^JJ�W�^EEE


��@//�'O�,Z�(88x���555������������#G�t������;hjj�<yrvv6Q��\����@ ���PRRz���aD nF++�W�^����J-Zt��Y!����_�x��#������/:����������zKKK???999�RRR���vww0`�������W9a�����m�G�w����'sss|||RSS��������W����9s�$%%�����d2w�����n���ZZZ���?:�������9::VVV�<yRKK+))IYY���k���~~~>>>L&������;677�^�������p�/_����z�j###++���'���eff=zT__?;;{��M��=�������x������?�455��!��	�m��`�YY������W�����������������|>�������D��T�-[�888xxxt��544���eddCBB���8�����K����mll����������������,+''�(�QPPp��1����������m��dgg�����?==]VV�06BI����:���

211��h�����NOOoiiqtt��������qc~~~zz���yrrr�^�m����?�������#  �������O�>����u��@ �����$8p����=1,���z���I�&�=z��
iiijjj����v�������		���;���{�

�������N�jaaAdI�4j����w�X]]-&&���8{����b//�s��������������344�o��.\		���x����)�����w��U�������[QQ���F����%BG��w�^PP�����g�����?~xx�����}�.\8z�h"!"�������

g��adddkk������:d����_�}�������K�,III�3g�W4����Tj@@��i��+cbb#!!A����3o������l����"f����c������%%%FFF�n�Z�pa~~���KJJ����QVV4hP��nnn&L���^�fM@@@CCCJJJhh���.\����o���w��^��S�8���o�|~cc�����o��UWW������VQQAL'������k����a���������gkkkii�d�������8p &&��������.���mjj�������������������ZZZ�w�&rfDDa�D[������Z���4�\�~��byzz>y�DYY����_�~�O�600 �<��$������w������y�^^^���'N�X�`��C�������_�~���+==��������������m��:t�B�z���o���S�\���W�
6�����w�+�vvv��{7Q����������w/"�N��}VV���vtt4�B144<{�,� Q]]���-[������?~��i?���A��_����������			3g�$�\���^�x�����������D�C�
4�����O�>mii��ukzz������~~~t:����...���{����hEEE���3g�trr0`@\\����������222�&�q�������Y�f�h����k��}��YCC����CBBtuu���|}}mll0�����������o

�>}���C��5�f���&&&]�?����C���c���+�^�������mll|����/������������'&&f��U����������v��"��`�����������<y��L&�[����SFFFIII����n�������


��������	������������v��y�����|� ���M�������$�	[XX����ccc'N��������������D
���/,,��kWZZ������l�2;;;�qy�����&q����_����ccc;G������w>U�L~�|���u��������=
G�!b�_uJ]�������3K��'*_E��TUU����Y�]~?sz���_�Z���%W�XAt�~aD����[���F8������d�i��9%B�����O��PZZ������P[[+J��_[QQ��g��K2:�v4�����@����=<<���������l����[ZZ���~a2���c7�?F�7���yI�{zx��p�E���������"�Y�<			jjj���Z��}�������.�kcc������s���������,���������~��%Kh4ZJJ��D|��w��������WTT������u�Q��,]�����BFe:���OKK{��aSSSYYY�#XYY	_7L�����7���������nG������$�0�#��/'���������{�n��i��155��������e�����^7

-..&�B�>���)�M�K(,,��y��S������������l��-��x���/����F���xzz���FEE������������������GQ$�H?���444���s��y?:���4558p`���O�>�8�"/^|���o�!�>���}#�n/��m�m����nx��!kkk*��g�>|���b}�"�w�e�GG�]/�m8���K}||���g"jjj.Z����g_����466���?vU��
�6�z�����	Y�ky��A{{����m�����~�"��!���{����Z�?@]]����W�����=����,�[��_����



��������C���yyy}�����};`��������|�������yII��������L������KK�?��BRR������~+`�������6�.���w�
��m�]�m��l~`�������6�.���w�
��m�]�?
���
endstream
endobj
107 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
112 0 obj
<</R23
23 0 R>>
endobj
113 0 obj
<</R111
111 0 R>>
endobj
111 0 obj
<</Subtype/Image
/ColorSpace/DeviceRGB
/Width 1024
/Height 247
/BitsPerComponent 8
/Filter/FlateDecode
/DecodeParms<</Predictor 15
/Columns 1024
/Colors 3>>/Length 21177>>stream
x���T����'�B� �P�b��b�U�`���^PA,("6�e����bA]����TQE�.����_.��J����s����+�?��<�7	�&��4@���G��A�#�zP� �@=��G��p��q����O�>������C���vvvAAA�����
��[R�k��wo���/^\�r���K���***/^�����sgSS�����������o3H�NPYY�e�����	&\�~}������������777w������������f�$�AAA������s��}���������N���[�r�������#""TTTz��E)**�r�r�~�o�N���o�_�p�`0

��eb������5k����|��A�
���UUU]\\�l�"��m��f���r�Z�)@��A�#�zP� �@=��G��A�#��ZWY��������'�+q��y�@(����a9���x������o��r���b��C�l`�r�|i~C��g�y��k�k����PnE���Eq�sn�;�o��vS���'�0D��I���{��0�������m���/4m���
7F�vH��.����'����v�������fu�9r�[�(x:��.t�M��
6tu�������O�0"�}��Y��_K��r
PO @aSw	_/f:��87[���oF���9D� �>���bh�V�@��������w��!��`�+A��@ @zPH
��!@a  =(����@���0��C��@ @zPH
��!@a  =(����@���0��C��@ @zPH
��!@a  =(����@���0��C��@ @zPH
��!@a  �:@zz���ee�����]�h4Z�>}���cbbX,��q��B����UTT���P! �:@qq1��8}��������/^���A��|||<==�l�����������\6�#�W��EFF9rd����_��9s����N��a���O~~���oddd~~����	!���UUU_���w~`4����PF�~������~������t�r�~l#��s������~�4���3��wV)q*�.z����u-���*����k�W�+mnO�>�P����\��%��^8B��Y�yo7OK����}�&��~�h4���/��*YUU���SRR�<y�d��#G�TTT,\�p������������C�m��!������RF;����/���=����T��6o�.������6Z���P��)����c!�w�+��g.3?��<��6�����:�Km'�g.���r��r��i1g��n=[>s	!���B����uW�U��2��NWW����uZ������|��)S"##���g�����������gkkhddD=����-@xH��}��#H
��!@a  =(����@���0��C��@ @zPH
��!@a  =(����@���0��C������x��"�2�!�{O�������P�P�7���p6�s��
�����B2& �*(�����=9N,9@E�����'.
_v�s���0i�6����p�t�������U�q��j��b�W��7�gv�^�ay��+%~�:�����;f��01U����G�7P�
���@�� @  +P�
���@�� @  +P�
���@�� @  +P�
���@�� @  +P�
���@�� @  +P�
���@�� @  +P�J��������
�+V���{7�F���OqqqLL��7n\@@�P(��y����\6�d�NPRR"�BCC�tzRR����=<<h4������'���={vTT��ptt��f@��@����-@�^�:y���	�^�:s�L777:��a����|__���������B����\���x�o�F�{��G�}����E�;��X��ks,��|�v����&N_{.=�c�6��:C*�z��O��A����G3W-3'Zi���kvg��#����[�|���+D,6s��f�������_4Q1Xn5�k��%�bz����i��B�����l����-���?�ahh��/��*_PP�m��u����|ww�%K�9r���b���k�����744���:th�f�d����W�
�J� >>���34�����������1���������600����z�@�����u^��  +P�
���@�� @  +P�
���@�� @  +P�
���@�� @  +P�
��zj��D���m7
����on_O}��~��Gw	=�f�dkg���

��Ob��7��}u���O��M�v{�Q9�%�t<9r���i-G�tmP|=���0��������n�:p/#���r��@�@ �+I/�zi�w�E�w��h��w����s0�,ws��r��V�5��<�W%�H�]��LC���[�`S&���~��<����fNo�aW6_� L�&�d:�uvA�	���W L�G��1��\@$��*��y�*j����c�X@� ?<��0y�� 4��y��G��"�?[	�M @ ?0���4�����9/f��of���F��?���&�����J���n��[�����P��JT�Z��%@�
��Q!���@���C�@ ~T��!@ ?*@��� P��
��?(�G�@���B��PoL^\'�=��������(vmN/��Ov7�����(�G���P����'�p�L��<�������'MYY��?��rgG������f4o����O����l�U:����G��BT���5�^Z!RS��~�G0h/YiL����12�`W�v�����3����2�f.������cr+�lc������0s���	�;�J��T�G��v���OB��t��!��DU���J]�)�*�"MXZBW���g�/ lU!�YR��d)��D�,M$ <�#���
��/�J����F+���i�2.MY������l�duC.��D�s�i��?�)���b4n�l�D���
��{DU7��O'���h�$:Y��N3cW>����`��~s��=��g2�d����
^gM��+�)=Y��q��*V��2��Q�����������Qm�\�����"�AV����2��������\�n�q/rZ��j%����%}���k�����~�g`Ck�u�e�IAI�N�h���wKf���������b�?,rt����8�������������=�=�����~��@��?�x��}��Z[���\|�w�nZ�j�Qe�P��~68����7�'�d|��D�1�]q�����rx����@i)���4�@��	f
�e��BbgXUq��r���|K�G�}Y���r���)�����n���q]�]����Y,��������|C.�l��h�Y#�b��d����b���N��N�����X��t�q����I�.���y�:��_��������&�J��L��52�mo#}��_O�k�������(��/��O��q!k&�.j���M������l����6��~�g�������_�������zc]�#�<$X�����!��o����Q�xc�)�Bt���|��]XTHY��E�8Bw����s��F>]�����J>��$��Gu�\f3�
�L��vN*�z[����E�R�Y�(��de07��S�2t�D�w#��@�x ,.���������Nd��5��ED4������y}Yjg���58�~�~�H�/�^�:~��Y�P��vW}�9@�������_�%��=@y�0��x�/J�o�z�D{��O�{EC�H4�������,�S;hj��L�~���0ox?�E��fW��_w�_8�rN��=v���>����C����!��@����U�,V{���@dd$��h��������O�^YYy��!GGG;;��� }}�a��}����	�@ $����:@EE��={444&L��r���K�UTT,^�����s������?���������b���:����Ywbb'����]!$99944�O�>��_�>}���;�������������qcdddnn���?�'���FPQL�����Xse�ID�	�3�-�xA2��.���+�H�%I���<���&���V�H��� �C�=|��������N���[�r�������#""TTTz��E)**�r�_{��:�.
��V|-����?Y��3�����6�����k�8��HV��)�-m���#���K������dr�v��{�m��w����,I���B����;�}��[��T�L�K<u���������a7�tNa�E�}a:�wM��|S6��+��R�k�
�N�h�N�]^�Xdh��h��>/����O��j������r�Z;7������AX
gf��[��v�����-h<��Gd��,�\�"�1�^+urA�Ks�/��x�t�/���yI>�ZU�#ip'Iur���v�-m*���}�/��%_��K����WL/7��:M5_B��5<=����zi�33�&�������k� 4N�H�����u�l�L��Z�1�<���O��^������"�e�I��L��M�|���e�O������{3,����0��\��Uq��GDU���RR�4w�`�lU�P��s��3��\-'y�+3/1j���)��L
��5�i#t�+��6Z`l*�d��#��W�����-1��������,���
m������r�Q���5�,�|L�(��"���O1��^���f������^�|p��J���E�|�4���|]�FG��z���zB��);I���'�La����*�����Lz�ul�����E��f����S]���FI��)-�8G����-m$�`0
��4G�����~~~������'O�����
*,,WUUuqq��e�H$��m���x�>G���D�����]���n.�}
qXH,z�5Z���J��	���+�W"���!d������W��W�!��|�B@Y����L����@ ���>DB��!i����b��A�n&�|��+��P
�@ $�+}d�$����~��Z��2��>�,��*��P
�@ �gV_�!*�b��>������
l��e����wv����x�J$�H�&W�U_
�7c.����'��#i�I��b��M��Gt������K(�lD�4N�7���	�@|"n�[[�/GG��suQaM�#�1�|��!*{�����&P?!�O�����v2� Q���\@]@m�X������i�	?y�^XuX��p��@����Ik�fkh��?_n0}���"z����@ �m�Y�	�C7��V���	W������}���(P�&��"iZ]>���Vu������|�@� $�@|2H�^�F����[���3�V���n���n@u���'(@]����������p�
vwa�iFs���"���O���3�R�n����'ka��bM�A�}.���$V+R�I��r'=���-��N9�j/���t���L��ey�9�rV�<��s�V�3c�u�iF��O��E�Gh���QA�u���T�p��>�|�������/B �$%�����"����{],���"�X�F����`��"�����l��7�
mN�=�<�����'�F�SK�o���[��)r�{�_WW��1�/B �$
�'���"�-��{�Y8���,�)b=�}��`pS������p+-��q��M���|�8�>��_1A��� ������IV�ZE�F@}�@ ���j��������9���%:s���� 0�mYU�B���B ��9���V7�x.���{0����S\��s�N~��1fi�#���@ j��g5T7G��9^h���F���x%W���b��G�I�0P����hP���wA$���`�������y���p���W�3�9��7��"v_t�o����\B��+[h5���X�s@���?�N�z���S?o��hr����	�Cf����Da�i��EpF���o9��������4��s�"v_r�o �G	b�2���	�o���X +���i��{y���?�7�}���Bf���g&�9���
����,��?lI��x�3�C�D�C����*-�RY�H�I������Q��\a������N�c��%���A)������t.O�������Mc0d8��@ YA�MP���Eo�lYKy��a�2f���g�V@�������UV�+1E"Z%_������4U5����kD7��
 V}�u����o���3�mC�nI��M�G/Y�}h�wL�[������[��������R����)���

�u�N�������K��M}��M_���|t�����������x?��kC �� >��d�����m�!.��.S�)�.�:72�[��S5#A�����U�s?��Lr��r}��1S�7V.�}��wuo\�2|,���G�#�V[�O�f;��������T�0���re�C���Dv���8|m���s�=�&���w�t���9�G�-���4���j�'��O�N k�y��AW�3�����+��+7�bw�����G�� +��v�zd��.S���yA��o�����2�$�]���J��8�^���U[y�n�Ok�H���q����x�����]d8��WV�y�[�=��z�W�5�+��0;���V2���!@V�#�e�d��=&C�d������E���JV1��w)��W�p.u�@ YA|�P��(&����*�RUA����d�@ dP}�5��
��"��)����cJ6v�����RQ"�OAVF��S���
��$��D!-u�������q�znt	����Hz.$f������ijjJ-J$�'!�?���&�\�!�@�J}��s����z�H�o���d��o.���q�Ym;��T�<n��x2��Xs+��Y�i}����@I6�c��?�*n\.�[[=�M	@���?���"�h�(^��	��_m����?3.�5x�J�H�a��t�X�-��E����6��O
���<:6�����_,�y�<!@V�{p����_�oDHq��J	���k�:FS�h��������r;�d<#.�b��,��h:}i��D���+N3�i9�TK<W�`�#����'>��"@I%w��c�m�j����#lk9�b��)�"F�!D����'x<B����d>7��`��37u_�C�����VTU�����W/wl&MXRLW�����}?���5�:q�$7��~AP��
��1�x��	���%�c����Kv�Y��7�.8�%g�s#�zo�eCL������3��>�qY�;�|�<!@V��B���f�Lc�}��1�
�E%���nB^���D�b��|���?���'Ci�c�Fc$�+a����O0��}�	
~i�������v_����D:3���?��Oo�8��HM�Xr�x�o+��$���\��������3�oK�F�.Pe�h�L[��Y9�K-$����2d�,�
+D��L^����`���2����';��F��"��$��/@m�������
������'}���7�x;�����1�V�J����I��q�U�2�i&��/$,�����uutu#��������?),�a�f�e�s%��`��$���d8W� +�@m�T�����gF�J���~�	����2@A��&��X������_gn���-m[��������x\~����J������xY��
���������d�#���D�2������),�������ce5�z�_->xos���)�.����|R�f��q!.��**���u���S��tu`�%�@��P�(�,�`��?Z|!_��H� 0Q)�T�(�\|s�N�����
��wTl����6�����V�����V��_&�;�}!�@ �+�mHO���\�����|���]���m�]�s�/�*Hv�"��U��vZ�z�1��%���7��?7Pw���@ YAH's+�+�����9��6{G��b@��r�����oW[
�s\��^}wk���:de0��45��J�,i����{�2�A��2il�>6G$����8�����=���4�6M���[)+ J�DI��,*	��2��V�_��k�!��u�~���bB���Xs��by�B��,#�
���|��%�mu�|����@ YA ?Zr���:im;@�t�{��W�����3>������-1m-���u�O�uw�=��4�������	���qH�9��c��2���l�R�su��!�������YH,�G�#�OG������=>�{ia����b���v�5�g
&��7[OS�{��~I�/��#���@�
������a��{��b@�M���h��[9>��~n'��s$	�^���t�4�A�M��I�l'c��i���i��
��H�]�A��Q��N���m������ @V���,�H/�6Q3����V��ou�@|=��@ � Sg������C��s\J�7Q3h��)�(i�v6��8'��9�����v=����d�[��| ��K��?[�P���@ �
P;��=O�����l���r������)�9��G�o��o�f��;>�����}D��'9�"���� @V�P��@RRR@@�P(��y���x����!@���
�@ dE�y���S�FEE1GGGm�� @���
�@ dE�v��+VDFF���<�������4m�������;k�A�|.g�5�qTa���#��ZTl=��A�����Z���X����E2�
�;���<]��QA���t�����~�T�/��4dtVR/=�^o��7�:h��)7]-�J��;�J:q��
���}�,�{��-�
�����J�o��Rm�2�����r�Xs/��o]+7U����nuGhw+7b^�1��.�6�u��O�\�=�[ot�t�,sc���q�;��N)u������z��0xy���G��^3W�����$�t��6���f��W�y�lhX)o��z�[���VEq�o�z�d�;�e����&��
R�5xu���{��x[�|>��oE�l�U�?O{EPf�Kk��{����o��F,��P�U��}�|���*4d6)W�f�'])xh�b��(W�}l��d�Y,Vee�X� O��b��+����%�<e���>�A���B����XT6H����������TyW��A�~��n/l����S�m�������k4��UzFj�Xg��5�����B~a7�}��y��	!�;[�q�f�"��B�BE�?�5�f�k]%�N<�f���#,1�0���^�9
���5�����O]�%?`����rV�.!������6�U#!�H��]�T3*4�Q�����oke�.*I�e�%$���#��AMEX�S��47��5��y'����T_���,�����0g^���<�g�I�BW�M���:�������m�_j��,��u�����@��{�<B��]JJJUUUbm�p~��oj�l�P/=�^o����m5�~��R�g��=��w���*t2J�&�wo�[X+�����i���:��� �|�Xs�	/�7�����S���m}O��FN�����\�(k��n�\�t=��o��j�Y���S��8�4��^���S������2x,���r��PS���@��`������
U���5�~����P	i��d��N�W��$O��_�&����
�2��\-�Til���R��m�"�\�s'������V��|m���':,������9���V(�����^3{4��
�^id�0�������*o^�f����e�Lk.#;S9���A#4�C��*�<�yc����_*:�6�o����V{���Y�v�Py���{/]QH�i������~Z�L��������A4������533�j� �|�"����E��)j6z6�w���XDH7���,�v
M��	Yl���VUh.k���M�Z�J��z/�j�PI��si��*��V��Vi��U�2IcY���=�����f����L�gMN~��@D������������^�v����4��������k���Sz��b�T��e�j��,e'���1p4]Uh7��f�\�`��fx���Aq��G�����f:U�������i��"
�������T�h��pV�����9�KU_)�{Z�z��`�kK�l�\�����j��VX���~���ziIE������z?(+������%�������e�������wlM����]��)����?m;�W�EU�]�����7�Pp�������$�Ia�|�8��j��XFUYns���SVV�����L���S�

###���Y����rr�����Y��h4����������p�|�pz�`2�M�4����`�
����FFF�������<���W�\�����6##���������6o��t�RB���_YY������x�q��\B�������~9����yyy}��y����9s�7�wPj������Y[[��?��U+���g��M��-���puue�X...:t�`����g���'�����������b-_�\��7o����;�����������O�2������������K�����[\\��[��}AXXX~~��A�j���.//��e�����6l�����}�|}},X�������Z������_lii��U��6��_)��C����5k����'����
		������.��Dy����',X����h�@ ��}��%K>�2	N����#""���j��Y��|D�]����[U��|�e��>�����}|����W�^UUU����D��~�^���={6&&��b�Y��.#>���3��������������/����6m�4s��=���\��5�����B����4����"##������5���W<x����{�ni�K)�]]VV�����{��#GZXXH<���;-[��������u����������3���q���E5�F�����j��a����s��s%������+kFKp5R��~q�����\��D�����<��|P���]�����j�������PVV�p8-[�LNN���@ ����������;w�����z��U�g���+}||>����s{�����4h��c�������3�������o��AAA���������'����s��i***����������@XXXVVVhh�����i�������eeeu����;����������������������&M��x%%%iii����5�������@ 022��h���EEE-[�LJJz����M�����6m��/�P
�9�������z����7K0����������L&s��9�7o���s��Y�x������QDDD�
RSSw��A�nI����6m��$%%Y[[�q��u���������_�k��q8uu����-Z�x����900PEEe������|>��5����:t�\��8~�������9����}}}���
�s�N{{���r.�KINN��w����]c��m��������^zzzyyy�N���i���?NOO��k����GV������h�g��m��������r���>>>�9g���;�h�b����������`����������o����M�4����������$''S�@dddiii�6�����k��VVV={��}�vII��I�N�8�d2x��!�H���F�"�����=}���m���Y���d��=;00�: �n�J�-���|4W���2p�@�q������`�X����������:>L��jV�/����[�VUU%''�������������-[jkk��������������P������+Y,V]����w��O����������?���_3f�

���h��y||��}��tz�����W�^=|�P�6l���K��_���lffVPP�����M�����M�v���w���O���t���x���g�_������Q#�UXXx��=��������]����yyy�^0333


]]]���k�q�3f��I7n��<y��'�t������gmm�f�333}}}���k����L���;O�8Q�gjj,X��[�ng���0a�������SRRTTTZ�hq���N�:���Rj�u)***;;{���b�6l����9�����];.���s�7n�y�FSS�}��			���zzz�N����>~�8��o��u��	�8	�F������f���i������k�������!C�P�i��
�|�������|�@�j�*��F��u��9�=b2�����F��v����yHH�
���������c���+W��r�Jyy����W�X�n�:[[[��F��522R[[�m��O�>�;w���[�L�������NOIIQRR���������U������������j$&&f���NNN���'99y��M�������w�^EE��E����y<^�N��������U���:::z��UZZZ\��������^�|���XZZ6k���������uKMMm����.]��x����
�V���g�\��G�_���w���%%%�'O�N==�i�����{������[�p������off��7��/
W�^�������q�F��6����?��Stt���Scbbrrr���G����&&&n��u��e�6m"�l���������J999w��i�����C����.]�|�r:�������acc��E����P���?���[�n�5"�dee�_�^��������299y�������6mZ�r������������kX��UUU�v�Z�x1��>>>jjj.�`������S```�=A�f�444������D�rEE��-[V�Z�f�6�]s��{��N�����:th����s����SQQ�1c���ut����
������

�9s����������g���[G�|�����7j����S�Z����NOO<x��u�5HMMm��!!!���5�����S�:i��;w����.Y��: ����7oN=(����7�#b�>�����S�:�����?z�h������M�4�9s���'�U���>�?�+��C��c��m������_�x��uk��<t�P����o�9s��/&�=}>�����w���������#  �G��5JKK���/))0`@�
��]���C}Kppp�������������x��: �����Z����]>|xLLLdd��!C����z��-�puu�p8�W�/�x������s������w���&&&��O��q���_�+5��b��
6�����?�s�;����G�������_�>k��������EFFZ[[�X������gkhh��������P)�z��/_�x��'NP+�H$��a�������O�<�`���fUU��������x�~��g��>}�q�Fwww��������9uI�f����}5���#???�[����C�R�����JJJ�n��4iRXX�����j��U����rY,��K�>��Y������9  �����p����G�b�����}����c�����-k����3��_o``��w�+V�������JJJ���UUU���C�IMM�V���*�JT���a����J2b����/����n����k��w�9s������WuttBBBjV�)S�POjHv5���1}�tj�����_����jkk���#6���_�I��/�D�W�����Y�333��U���a���_�u����]�vfff��-4hP\\���n�]�p���G�R�:ulS'�W#���;v�������/^�i����3���fffaaa:tptt���{��E\\c��niiI]rHv5����|�rOO�V�Z�3f��m��O�>������EEE5�lggW��'$$P�,��H}G0p�@[[[??����M�6-]����x��Y;v�3f��={-Zt���%K�;v�W�^&&�)�h���k�R?D%����lkk�����C[[���33�g��9;;_�p!<<|������<o��=5������bkk����������u�V�@���M]��������m���T���Z�r�����+;u���C777����I=mcee�����m�zyy��������w�N�S����{��3-Z�{�n	����������|�����!444***=zD�rQQ��u�|||�n�J��jv944�I�&W�^���K��������q���N�:�>}:--�����U,��;h�������[���:;;o��m��������0z���:~������������NNN��WVV5�z����3g�����7ixxx���l�������������zP���ko�G�:}�o���c��F�5h� ..����A�yyy�������V��������{�z��s�~�P�B���������W����5�: 9N�\�%uww��������^��6m����4h�}���;��|�����^a���u?}�o�����S�PXYY�����r�7oN����5���
S����G�L��d���S�v��a��%�����d�������K�.���_�v���{���7n�<yr�6mf��������M���;k��Y�p���kw��Y�)������];KK��������NNN111�������=���T{wd+V��������.���w��Q:�N����7��^0}||&O���Ai�>�<::�������������...�G�������k_���o�
��{���S�����6|����,��.%%�Y�fw���0a��G�����7���K�������5��$�������������f�777##��������$�V�X!�5��+�/_���F�{�_����tss���^�zubbbPP������W>|8{�������:!D���am��9��L�:���}��=���_�re||��g�Z�l����f�555��;W{U�����j���m��������o�7o�������;t������m[�H�g���3g����:u������N�uS�����w0 ;;�Z�TUU_�|�t��M�6����<yr��5W�\����^K���B$��r�[�l���/���[�n��)�������2eJZZZQQQ^^������k���b+�����������[�lll._�L��&L�p���_~����0  ���+""����������B$����(���7l�0>����~������o�><<�e����_����8q��������w�N}�����W�j����;��S�NYXX8::=zTGG�o����=���z��Uaa��7o��-++�������zs��+Wh4���sutt�����WXX��r'N�7n����hGG��7o���'$$p������h��]��������l��~����}����w��������egg��q#''g��QFFF����������K}��m�F�������N�4)$$���`��'O����...�o��v�N�gdd;::VTT���Qo+,++;x�������C%�K�`jj���6v������w�>�����\ss�������)S�P�?�]�_����S�A��������RRR�8q��+W233�����j���R~~>!������R����gbb���naaajjZ����������u�P�����_���N�8��u���Q�����������/uuu?y~�u�P���rk�����L&����y���=���ovvv�\B��vY��������844�����f?y�����k�����***<���������>��<���W]]�.�l�2==]EE�: �<yR���W���u�������999���4,,�c��"��:�t���a�Eyyy!!!fff������111IIINNN�����vjn�����

������]�|���G���Z�j���z�������c�������1�������������y������E"����d%666;;[II������s4-&&������N�:={��S�N�L???>���q�O��TG<o����Z�rttLHH�v�ZBB��������?^sW�^0���~���f��I��|���WSS355m��YPPPii)u������1�L===CC��G���'

&O�����Eo����������.+++>>��o��R�]{]�����S�O��������j%y��]vv�������?~<r�H.�{�����G'$$�������^�?Z��~5B=�=z��V��-[N�6-77���666��u;t�P��M��m���X{Ug��_��?���822���`�����]���l��apppII�������?>u����`�7n����:u���b]�P+����=>�>x���O���uk������������x<^vv6�����B$���{w��]�����c��]�<xp��I���G�m����������l������{]\\j�����_���}��`�x���$���RVVn����+W��y3~������w�N�8���;����W���.��F�����7o�x���_R����WW���y.!���K�H����=�������
����39�>t�����>�v��Y���34�����t�Rjj*!������N��?���ILL�q�������~Y\b�>����@  �t���&$V���������������Lmu?}���CBB��=d�cc��j�G�:}N�8QXXH������H\;w�,++����5k���U��/�|��-[�L�<G/]���f0�?>Z�k8~����o����97''g��u4���K[[[����_O��-,,F�!����w�MAA����N_�z���.��;|��qy����~�zQQ��N"..��b�
:�>~�x��C,���744l����Q���n)���_�~=**������B@|�N�j�������{��L�"��������P��bcc�U�V�]�V�'�@EEEyy���O����|!�p��f��������R�s���H������B@|����������g?���_���+�?������3F�sv��5~�x	>EJ���666�/��o�<�B�^����"��	��
DFFR����������I<x�������_\������n������.�7O:t���e��%�}p�4V�X���C���<�����?�|�4�"���������O�8Q�s��w#���7OKKK��O�<���=y���A��SA���Ey��^��A�#�zP� ������&�����G���TA~
n'�JO�q�T~�"To�$N2��(vx��Q��?�s=ow�aT.A�y����
�i�p�GQ��}�p.�i��m����y�?�7��'���@@0@@0@@0`����N��m���R7�Qq���>}:��a�����Sf�Y�#�1C���6����c�}(�OG�)����{(Z@�;��Mf�����6M-]��Qq��&���-"�Z���gk2OL�=�D��?��M���������&3;����x�;*N�&�����:�E���;�1�=)����U^�z||���#"����P�VS7<<<::�d����D�Z��lEE�X,����?5��I��k�!�����o�����\�R����^9;t�}P�=E~&�}H�ONN899������OOO:tH*�Z�O
6��[
���w����'���t������������Y�d��b��/..~�����{xoK����C?pO�G�&�.�'�z�j�V��������o6|����A#�D�p�
`��gee����>���$�u���Z�WW�'�o��������o����������p�8;z��������"�x{�)))�.]�������E�������|�{�-<U��3{�����F��WVV:99Y���sT�'���qOO����������+�f}~�p���(IuV���7"---���R�������9''�����������jww�m�����1�n�:�������j��k��Ii�����^����r�A%;�G��(���6�555������g��III��d���~{^=x�{����iA��{xx�8qb������<�~����?����=������:��PRR�{[��P?4�=��Q�mO<  ���G+W�LNN��_4^��b��8E2���/_������RSSy������_��H�?p�V�y[����-**���O4�Q�EG���<��������(�J??����}�����{���d������RiFFFgg���tt4�������z��
']y~~~]]]QQ�Z�&�B@h0��9��;'��]\\N�<���YUU�0������$I^^��;w���7n�h'r�������e��m����YYYw����s'o�<l�o�����VRR����M,�@����,[�L���^9y��J����M�����u�<�q��cbb���~��8y���===b�x��5���<�������/����������:N�����[ZZ&&&T*���ky.�8x������+**���%�B@h0��9������'�H\]]u:�����������=�P(�������Je@@�u�<������g���z}vv����w��������3�m!r<>>>44���+((H.�����OHH���N�����n�9�i�&???�N'��W�Ze'��/^���EEEEFFZ������:|�pii�H$����9������/��
�u�<����l�a���v����.��7����z\�z��
�����(����h4�T����������?{����9--���������{�O.�9n2����)����X�b�u�<x������qqq������T*�7u�9��h������i����9��}ygggll,�='{�\&�)�J�8y�����MMMb�8--��mc��-:�.>>��ls��AAA.\�J��7o��%r<44���>""�����o���������={�h4�+�f�����e1
�����O�~9��)��kA��b�7:o���;\�=a��3����x���O/pOJ����~��8��O�k�8�z�������<m��p$*?<8��1?N���Q����w��2���7`������{"y_���C����<j��D~�������o��i����B���S��F�Z\9��\����@`0@@0@@0@@0@@0@@0@@�
�k8�
endstream
endobj
114 0 obj
<</R20
20 0 R/R8
8 0 R>>
endobj
21 0 obj
<</BaseFont/PXGHNK+Arial-BoldMT/FontDescriptor 22 0 R/Type/Font
/CIDToGIDMap /Identity
/DW 556
/W[16[333]
18[278
556
556
556
556
556
556
556
556
556
556]
69[611
556]
74[611
611]
77[278]
79[278
889
611
611]
85[389
556
333]]
/CIDSystemInfo 19 0 R/Subtype/CIDFontType2>>
endobj
9 0 obj
<</BaseFont/SPSPNX+ArialMT/FontDescriptor 10 0 R/Type/Font
/CIDToGIDMap /Identity
/DW 556
/W[5[355
556]
8[889]
11[333
333]
14[584
278
333
278
278
556
556
556
556
556
556
556
556
556
556]
30[278]
32[584]
36[667
667
722
722
667]
42[778]
44[278]
48[833
722
778
667]
53[722
667
611
722
667
944]
63[278]
70[500
556
556
278
556
556
222
222
500
222
833
556
556
556
556
333
500
278
556
500
722
500
500
500]
95[260]]
/CIDSystemInfo 7 0 R/Subtype/CIDFontType2>>
endobj
117 0 obj
<</Filter/FlateDecode/Length 316>>stream
x�]�;n�0D{��7�(~d�6N�"A��2�T�d�����[�y�D`v9���r��f��uN���2���}~����������)m'a��K��_���ga�����������(�Ki�|_���X��9u]����R����}��u-���H�g���/P������t*���P���A
�t�@J���tj$����t�DJw�e0` /����^r����#�{		FD�����%��gR���2&QA�x�`I
�#
��>Hnp@� [�>���!Jk�������>���u��J'�*N�����������K�T
endstream
endobj
20 0 obj
<</BaseFont/PXGHNK+Arial-BoldMT/ToUnicode 117 0 R/Type/Font
/Encoding /Identity-H/DescendantFonts[21 0 R]/Subtype/Type0>>
endobj
118 0 obj
<</Filter/FlateDecode/Length 521>>stream
x�]�MjA��N17�t�tK`���x��\`4�c��H��"��W �T���+����//��>���.��~��������6���_���������O��6]w�������kdX��������Q�!F�_�/K�Ns�M�k�=��8�i]���-����������bT��rt@*��	�8������TN�He�Q���QA*�#A*+G�TG���"Uf,��*3��T��L����V�&��8�Z����Pe9%j��r���&$�Pc q��@�s��@Q�1��I�M��������H�G��qU-H�GSAj<�*R����:iE��
�3��:���B�����Q�3�	R��fH+�7GZ�����rUkH+W5�V�jG���V�j�3��?&��V^��u��#��}���2��H+��qAZyG����"m�������6��ic_��6��<j�����ac_����}y4E��ic_M���|E��x@?_��[�x�?�a����v��;����|�z��������o���/ef
endstream
endobj
8 0 obj
<</BaseFont/SPSPNX+ArialMT/ToUnicode 118 0 R/Type/Font
/Encoding /Identity-H/DescendantFonts[9 0 R]/Subtype/Type0>>
endobj
119 0 obj
<</Filter/FlateDecode/Length 20>>stream
x�c`������Ao�s�
endstream
endobj
22 0 obj
<</Type/FontDescriptor/FontName/PXGHNK+Arial-BoldMT/FontBBox[-627 -376 2000 1017]/Flags 65568
/Ascent 1017
/CapHeight 1017
/Descent -376
/ItalicAngle 0
/StemV 300
/CIDSet 119 0 R
/FontFile2 115 0 R>>
endobj
115 0 obj
<</Filter/FlateDecode/Length 12674>>stream
x��|y|TE��9U��������v���$������
Y3���F[���($&$qA6���;qf�t;,c�q������oDQ��a�!:j��_����������?����:���:uN��S	���^��^ts�����~����m��F���T*���n��Z������%�\������r��%���^��,������Gp�
+-L�k���oiW�q78�����]����#�oZ����?��L�C��vA������)�,y��$_`f*�?1�b�a�;|�N,�i���@a?��C`�9�0�!l0�!�N�=�x����p)��$_�;�{�[�u�2��8�*�sa.,�/�gN>*�Z��������A�5�M~X��j�j�/'�����O��68�����2��l�L������,�C8�M,���:����� �:�u�K0
��
X�a��fl�O�g�k��Ai�a|�8�<�������0o�S����v�W&B�'���^D
���2�����O':�A)\
3`\w���[����${`*��U�f�}��'YG��wa�B�C7��(��0�c��F�3�`:^���6�;����8}���q��6�B.@<��w�G�X�-x=����$��(��|���;��1��I�������_�j��m�,��x���P�����(���DM�I3i'���/�����UpS���;������Be�����L�>�b��@��A#,��a+</���>|�?3���8	����;q#>������������M&�z�LV���:ry�<H�'��qr�|L�D�B��<�����>M�4NO��9��q�R����%�2������������g5���v��;�w�~7V0�o	H,MD��=��0��m�<�����w�/���Y��=��Vc#6�t���%xn���|��_�klDI�I����d!YB�"���9@����I2J��v��ZJ����J��v�.���E��mt=N����t�Sqv.���Vs�r�����_�7�7���/������D�Rd(��+v+��T(+�-�M�����j�,@����8AYd�p=8
����A7�l4�@�&p)���V����B����@�=
B���+�^%�����Nn]��E<���A9J��8@j�<��w�g�n�q9v�^�K�6��x���l�j���8
��
j����p5��V������8=���!��|�G>�5���\X���z`�^>���9�Aq�@Y�����3���?���)���e�S����d����n���p�>�����KW�T�@�A-��|X�����&�H���5��F7|���=��������7a+|�[`.����_=��0_�s��F�(3��������Q��-p<��?�Y�����+�U8�PA���0Z����]��BT����t�-p�O��p����W���$�c!.��P��sat�NT��8�������'�.(	�������_�s8I�_�8	�q.�o�
X�VB$A5��z�;�rP�)����h�0@&T��"�����D��C$��0��R��Z0B����"1
�]�\� ��Q�$���J�o�pH���z�v��|i��K�'VU��JK�'
�����o��-fef�����jI3��A��j�*���(A(l�6�����(��N�Z������o�EmQ�Bw���4Qw�L��9�������R��J�5PST�n�����{�q�?�����[�
���r~������zo��)*t�K��Qls7Do^��������:o�MQ!h�u�:mQ!D����OF9C�
�P��
�.o}C���g=�������-3[��=�pQa�y���wJ��I�Nn&���*�f���h`�{�px�=q�i�{/��5J�Y�@�����W�r�T,*���Z7\�M������y�wtxf��X{�������(�ml����3���i����(Yn���pQ�����*5�%�V�v�;��N�.�|}�Bw��9
�n��\.i(9���9�^O4��
/�����Y�:%�����������8���/�,���s29�5�� Yd=�N�JmQ�"wf�z�$w"{-��ML��'�E�M��3[�E�um��KX=���������b�w����,�Q�
��2=��jQ\x>
�LE�uQ�d�\QTxs�Tz�w�T6�������K�E���-q	�)*�D{g���n�&=Rq %m3|c��0��1��������xkT����(���^E��^��7��6����n��6.��9?+��/��s���V�N�s$������������r�Q.W!+��(����@wcTh��z�5��W�.b�'�0.��6���%���'������6��9Q�G�����Y�3\���m��F��qs����d�5^���<Dv�]����Oh<yxKz���pTh[���2���3$�8{~����8�5F���M	�����CnI�%jY��J��M�ZcD%���$�^��ryQA�S��CX'�:A���E��^991�����Wr����r�|�;��VP�a.����<�)�������30��VO��~��1�}������
�7���Iq�o�l�a�(J�}�K�����z�������tol.��;(����|R
��0�J���2��E�B1����f����������^P�@��'t/$�^�t7}&�(Jq�|LUn����A���q�<$��� L����y��x
����u��ge�t�, �� �g��>���p�><�����,$��@a}��g��g��1Aj5�)��O�����0}dP�e���1�L��C�BQ:��t ��n��
Hr�6��Je6
jeB�@���n�^�(��-�rY�[d�-�i6��;cF���&VLeGYK���H������+ D����L�5t1��~J�F���V�!@�V���R��H���e���!�Nw,���VC��C&1R=A�*�����#T���qP�e���e��z����*cv�x�j��j���T���jut��9@@�+a������Z[Vk�
4l ��4� �F�%�]�ih�>9�����d���G�8��R���zC�p��N�(���}r�}���eP��yPB�����@	���~3����B7�J�z�f�A7��n��	���NWA];�jY��1� K���+�N������LS����z����d2���P:F;��v�v
�e+��y(���t��S���Q{jj��������Y�`2iV�*FkE�%+�H�"'�����}6��*����;��_R09LN��������� , ��3 �yJ@$�8���ADr��a1�d���Dr8�yS�����Mf����66X�j,P<�s�3�����VV�K^!/C��C�2��H^&��
"y��D2L��M�AR�@$��o�Q���Er&�Hc��hL������_� Uj)��_�����2�s�q�{��#��
@�<��e��Z
y[�,��N2f�L��}�/v�-�>�'9��\�H�IKrK�JvRw���]�����}����l![�
�d��M �M�G6���h��6.����\��v9���3r.D�C3Y��:�#=�Gz�������G�@Y}�6���H7��U@����v����e�v���h'k�]�h�[��v�����6�m��d�6���h#k�M�`�m#��&s��u�Bz���B���BVCY-d-��-�ZH7��Y����$sHd5Hd
Hd-H2�D�@"� �%d��(!�P"s���PB�@	Y%2G	���
%2���7�7����&��M�������&]�&���9���A��j��ZdA��nd��FH��^�9F�j!k`����c�t����j���}����N�^8!�� ��Y'�Z8!�� ]p�t����w�� 0L��0��a��2�0Y
�d
��0,����
�2o���(��(����%�!J�@�����%]%��9��:�'=�Oz�_��'�����~��e�~Yq��_��?W����!�c�
A$��/��Z����o����2\w�p5T�p�d�
nv����Xe���
h&��T�JR;H�'����;N*�RIR!esFe�r�r��%%�_9�$FE�b�b��%�_1� ��t����
@�*�{H��T�M�T@H��H�I� $H��i���<^�/���Z��jrr�J��*� b���MO�&�U>�d1N�;��]��*�8M�|)����U��Jq��R��W)V�*�2_�X��s}��(��ql���?y�7Y��&��d����
�&�4D��s�5=�Y;�<1�Gb�1����Y���1�5b���yEx�d/��?&�������q��b#1�1�W�����z�"�X���� �c���q���8b~�.?�����
�@dy�+'��7&N����
�l�QEr�xeH�#�_���CI+���_�����q�1�#w���x<7��$�x��)��X+�j5�D�Q�;s7��a+�'���UxP�W�7�M��;�q�WJ{������)^..g���8����nB[��CbK�&q����xYn\�b�x�(�~��}��&��[Ut�I�R��A� 7�t|nUMR����Oy�r�r����Vf)3��Y%�*�J�R�*NET����#R��3,
��������ao�1�����M�M�i�l�/��k��s��q�����S0jn��9S�MqerV�*�U�\�:�x_8:1%�sZ��dU��Y\da���C��\o8��!G�<�T�X�O^m���O���lf������=��h�$3�M��Y�d����~��q���0��s���pS��Lnbh�?��!�p32p��0�8���|���<�[�4z��t>�^����
�t7���2M.�I��d.\D3���k���d*�[�z�r����bC�@�(��D�C"��E�"�'��@R!�E�'1Ec�;Oc�������),�^�*D�y��y��[n^���^�v���P���Y����K���%��u�z�@����*C�z����9��JK�c�Ri�wa}x0T�Z���6]h����|��}�����'�Z���jY[������jX����u@S�uW�� �j�u�m�����>�)��$�c]�ap7h���;%��N�QE�E����2�h�8��n�'�0�G	�)Q�w
�-0��h����g��V�*Qi�?��N��h4,�oXV����%�����)���>]������d��@'@S�`vS�r�����!*����):�|�r��Z�O���
����5�r�]�*i@J���WvT�te��<FVAY��qdU�X>>�U������5X\��y2��<e���`�+S��)(��\�e}�}E}U���E�U�xr��NWf���m����������
C�u���t,#Sn��e�p�ey�ga�y�_l��W;��w���T}��G:�R�@�y��q&�-3�>�*]x��tu��
��PP��
e���4����K 8U
>A�Q����G@8W3V3C8[3}�B5c5��5c5�%����1y8��M��x���0�
jqYF��B��)A		VP��i	�h=/�J���{�G`�p*2]�<�����4��TK�p:�09�<M&����R& N#�B������O�N��Kg����M�PM�f?!p����%�#���&�9������<��a0���n�l���=�=�]��o�U�La[�5O������u���&��Ju�~��hP7�w��&�U�F��G�������M��� �Z�f[��S���Fb����q'��'�9�\���Z���W~�o��h�����@$�ev�IP*��`�*��
��$�l�e�U�&��#e����o��$��[�~y�-3�\���#m��i��:����snL$����97 A�r- <@}�0�a��^N��-�.���x����UjA��#�
�D$=����(�qN�a��� $R3���,�������x<&���2�������
$%�8o_C2��w3M)�t�0dbHZp�q�5����������KU�^�Q�9��8��c�3C�p�!OQ����9��u.U�#����67���x"���'2�d��T�!S�tg�f��yWf_���L6/6�5�I�1�� aZ(�*�mA���	�������b�����L��'m6l�h<)�"���xV������(���B5c�HG��\���@�-�����YbFH��S	���T��L�)k
(H��VI�Nw���4dw�&s��\	3�h��z��#�����������D���4WUVUV}�l�2�2���f�(�
N��t?�����dI�u�*��U���e���.�!���AT�q t����\�&��������kj����,MO����a�����
��Z���-d'}��e9D��,;��T�,x��^;�h���=����N�I��GI�UO�z���HF1�8��1���L�1��
��#N���v��Q����t6�d��U�!�_yqDy�9s\�I�*<	���tv���FFM����	a/V�E&��q�2���0�A$-W�-Yz�*�1N&�e��Ti�Y-���������W��iVe�x�-���^�M�������o���u����������t�]����~F�]qy�����;�����/w�]�r��g@������=|"U���j����f�����S��Q���\.O?U�~��E��z5�z%��������q������R���z�i@)���'�T�@����z���A
�����|II�.c��B�����*[���\d���l�!�l�X����\-��\��X�6�7F�����.�M���������F ��<Xn*�zMhB�nl7Y���C�3���?G����o�,�&��+���
~'d� ��n]���<����)�v�����f�����f8�kt:��d��~�f�[�Yca�Q$!m�>$?W�e�J���d��5��a���ia�����nK�E�PK�J�I�R,��f�
�T`m���(�(O���dG�K4��#�����	@	v�~��e�;���/�L9O����\
�j!��.�j�����6�7xA_e]�����<&������W�\����[W�����S}����^}���mM���5s���}6,����Gq��G�{�c pE�4W���pL�t��[�A������z��'m�p�t,m�t<Mo�+M��j�A���Ey�����N'�L�Yf[0}�Q/z�=����R��>�N��:�����	+[��9�I��Xy<	��N6�P��u�4;s��}�lj�=9��1�B��-G ��O6���2�l�`�����rSp������a����������w�q��u3g,m�3����5��lm��{�#q��	��G~�r�C��l]�Wp~'h��!��#R���h]�K�����<�����h?�j<Z�R%��b-)����Z�e#�fn�{���*�*����JT�I�d ��K����K���4��F��mV�U��=�f�X��=fs���U�n������sO#�������W�����0;�9g�Y\=JP�D�)}��w�
�b�
�i�*�n�f�]�A��&����"�u���U�
���G��_p�����������!��Y�Uiw��m�����P�2&<�C]@���tzhB�2�0�2���`s�3z3Hfd(�yL��f[�$O�#yqr��7�C���$�qg���O�WN6��*s���rO����y7���Hh,�} 0	�bcI����:"��|��������7����.R!zQ�����w����=���\���C�[�����Z�����uK�}��m�}��s��((=������'W��3�\�`N����K������n��y����9��5��Z'��LK��U�T/�/�vv��!z�7H{
�2
���i��a�����4��_e{��'��4
S�)���t)e���q{��	Y���D���O�g���?e����)��;N6����M���"�<{���/�����W�*�7�T����7���,��\#(h�4\��t��_���w>��������Y3�k�3s%��������~�	�����'^�^s�}�{�����mIPpI:<�����4(�l��P<}���Vx�i\_b��CL���_p����2���*]�>���<}j���?
�����F��k�������|���5�����t�B���N�_�o
;W���N�A��
�K����SVj*��H�	�);/�#93�+s~�!9��GXb0/�D��F3���efb9Hc�"���2L!��.�<����I�AN��k
��a
�0]HA�+X']�d�f��T��<}X���Q�I�$�-�s51��� by�g�?�c�}�}������j����������"���)���PMh,��N�e�J���L���
C��~1=38'gq���@ `2WS�PSS�R�H�+�����b�{|~�_��f3�����76�B�����%����7����WZAI�>y������-+�����_�T�����\��?Ox�����{��"���jl�m�oSF�;cf����e��ye���%@`C��A�0!�s�;�T[]�~Y:1�S�����s�3�U**�I�Ii�
\��)�!�A��j�����B����"M�5���Q���0K�'�gd!z�C/pf�R����>:V���c�0;
���@G#u���Z���km�:�e��0D�&���x)S�[�,����
��#�J"16t��dN�5r�]�-��?<v�����?g_~�<���c����d^���\��N���Vc��-5.3/�������l��.���P�����K��i_Z�S�ML�h��|����-�)/1W��t������������6d>dSd
M2x�l	���������F4�K2�� 9�� �-A(�C�����.�mW"�E�YF�i6�����X�����<����������FN������@�R;�����Z{�x������&�bX������-�Z�8������m�����e��l�s��+W���!�J���Z�����d�@>;B�;�Q[���1���n~�����>�R�,8�\�h�4g��Rr��yS�����;��z}�1�*�g���4Z_��R�,���
��5N��tF����f �Y���r�|��te������td�P2��~�������8����5G�[SB-�Z��2kG�e��$�1�rf��&��������!�YB)�r����.���O����`�%df��2�`� ��y����X,3@r8f�f�����5��b�'$�������7H����5j`�$�=$���5c��M�@
�������)w�}l��z�
*���Ra'���������_��wG��������[�]US�8����n��0�x#q��-S���=uu�L�$frm��R�3�kVen�$f����n}o)�F/��,'�T�:RG�4�-��y�������L���'��m�����������3�1��>��:��@��lvk�^g�q��e
�'�`��4���`^AJ��)XL)���.ox�^�F?MS�U�p*
��>��-Wj����Z���������Y�Zs�'�q�D��S����7�b���;'7S������l�}v��V���K^G]��_f\fY�{]���e�
���y���>P���46��
��b ^��"x��v+��2�����M�����k_z�����lM���?��v�}�.���?�����z�~|�����]�|������f�>���'�~���'�~.����E�f)L�)���Z����jZj�I��dN��F&�L�Tk4�����Tj�������.9�h��	�9���q�A��@X|6�l���`��X����R���)���3�H�M�s�%��P#��F��Mvy�U������E�&�)(/
����-|�93q�=����������vLm��ul)�5��~��c_��H���L��
V�"M�)}J��RyH����r<X-zA'�����O�P���X �p���x�][���v}��C��f�&��C��qd�9�����\$5���X*�f*�����r%X�j�����J���
������]�pb��kT��C�%~�����w]m4��������=���q�>9,��4!�o�I/����_y^�������|*M�S���	������A��M�^�:���7�:�#�d���!�OK�&�+:�����*�z�I���Q���iq�lu[~o������;uu���#��/�o����R�'��|�������H^����L3k����R�L&zF���AP��$��h3�J���Q2��}��������q�x ���c�c~y;R��*M��_�_��j+�RM�_����%���������/�{��n�?�����Ok������U��\�#P3Vs�#��I�Y�C��(n
0�i+����h��J����?��:�[��}���������L������tB�tV��viQ�4������s�,�����0Y�����[N���������1��&�
vHe�=m�j���sT�zU��K�W�FnR�
�V��!���l��II�/�\���L�z�����������,��d�LF5�Q��`�_d������'N�����H�F~�����E��o����b	��=q�n�]k��5+��7������/�����o�������SM����y��c;yD��j�.�~����W�����i�za��m�;����u|�j��.5�Z��3Y8��Z�jKoO'���qK����p���e���km��M#iAs*������V�Y���������������=�b��$�TlN����%^�+����������Y�Z�MvKy�yTp6-���bs�a����i�F��#[uI	��uD'��|�AC�B�jA]�nWsjW�y��,0����O�9�>�l����H�i

a�w�/L�9���#$�e��������MQ�lv�;�u@S61,��x�L��Jy�M��f�ny}[���.�4���m_^_����=��BI�N�&CyEV�^��n�{1?b���@��4Z���~����3�w�s��|�\��C���]�C:e�W���]�������rw��9�*_��Q{�������������t���
oE�R��Mj�C��egg{�9�Ra��������6Z�*x��P���^}/n���x����h����Io�&e�A�����V����n�%��#3��bG]�n��Z
����0�S"�P�_�dh��Rk�Zg��8���@@>��[Q���Z4
���
�m����4z�`��������sy�I^�^G�\8���-.t5�)Cc,�����)��>��������`v<92����#�bN��t�e)��
.�cevc�#����^�����s��=<(g����(����\��
2(e�2� �[�k�^<�P�lCN�L�1�(Mpg8��`���A[�]����Z3�.UT�,na�r��v����"��s]RvN���W�E�����s*��g�+<���0����CB��=YN���Zk����'�����Zg�U�lLW�O~5����K���h�����9�����O9�V������	����b��U�����+�}�����%�5-h]Y���������h����e��S��y�n>z����SD�7��q��Mw���(�_���l��'�#����^M������Q������`7�Y�n&tP�R��9���`��G����o�S{��Y�`��^�K2���bM1@1.��/8)�A}v�\k�����B�,��>�	���.,8��uK��
�)Z5�):i���!�$�'�k���������--��}��H�)��2�F����k��2�3����
����"�DVk��������j�������Fs�d�\P>���!<>��s�M����s>�b�d��\�A%{)�K�^4�<9�*��E����c*�V����V�3u�]�l(����A<yF�e����pjs�@�K�@�2����oi�� ��Fh b����N��Z�pZ�ZM*��QW�x����j���D/�9���4!M��j��D�r�Z���f�r�I	`�W2�*��~	����;N&��c�hdT�8g4,��\.�"�5B���������X��B<�I���V������y�]a�BOb>����[Q�������T�o~"����&�E��Wp��W�*V�?�yP���q�S���E�1�1�^m@�n�n�������?�~
�B�P-|
�`�7�����Su2�u�<���:�����\'��N����+��l�j��L�O%��}
�N�;�R�Oa�)���N>�J��2�q��6�S�u>�L�����%����~��5���k�Q9U��<�i���g��2�����8&�*���z�����
endstream
endobj
120 0 obj
<</Filter/FlateDecode/Length 21>>stream
x�c���W�������3��
endstream
endobj
10 0 obj
<</Type/FontDescriptor/FontName/SPSPNX+ArialMT/FontBBox[-664 -324 2000 1005]/Flags 65568
/Ascent 1005
/CapHeight 1005
/Descent -324
/ItalicAngle 0
/StemV 300
/CIDSet 120 0 R
/FontFile2 116 0 R>>
endobj
116 0 obj
<</Filter/FlateDecode/Length 25743>>stream
x���|SE�8~f����I�n��4MJIK�P�^�E�"oh�HJA�EQ�UT|��T���]X��,~�e�7����U�eQ�&���MJ�u���?��sg&���9�s��@m@@�u��i���/��^4��w�5m@�~������<p�o������_����`��ysf�*uE�EP0��%7i�Y���]8k�v�!�\7��E�^S��!P-l]��L��z��E��c���~�w���/�x���_�O�:uM���5���/�����m�^A'�
�at����:xn��`50
��{`L��!�IwB)<
��C���p;t�����rXI>���L��a,���e��0>���j���E�-���?�`�9x��7�=`/��Yp(����'P��ax>G�v�
S�
��'a1l$q����� ����1p�Q������n%#�?��M'����06B�.�A~zzL�8�n�6x�a��$�>BF�d���I�@1����	@H�gE�xpC�`4,����
��������U����C9L���t�W�G|;�/'�s��#�+�����yQ)���"�?E��P�0��{�Qx>CQ��a�,�wV�KK�A�<O���	���Z��(�+�g����!n+��8|p%\��K�#���h<��C�����:��C'�p<	/���y�������&r����*�^�D�1u0�?���U0n��<OA'����!|���GdFf�� ��nA����}��mE��=��}�~@�Bg1`���A���q/�7����0>����?�'Q2���&�����d=YOv��p^�0��+�
~�����������H��{���g)H����jOu����|�3a&���`<��dDn�E��0t�f������@w���y6���}���6a��<��c�X|%��[�z� ��G�"��O.&q2�,!��� ��O��49G��4��\>�����n)��5�5?���R��	����q�8L'���:q�xDj�]�*�������������q%�������\�����m��7	C�Pt9��"�!�:��O��dj@a>.��&����{��}x��	7	Ft;�^0B;@��2.J�����H����9=r�n�����0���	�
iA��N\�?+�EQt9zV�$T�~"i �r��j�W��?C7�w�#h67��Jt+|
��O�"�z���@o�k�58u���*@���](N6
��a)���y������1�I~���m�
Z�+`�����AS ����VR��!X�a:��.pC$a8��
��G��1��Qhn�k�S�z�t
�p��f4���IM�i�_�c��p}�A(I���[!	[�KX[���-���!|�.�G����t	^�?��������o�70��k�?�D�M�M����1�
.��p|_�%�T�.�;���"X���/�H����X���<����%���n�9xBz	����u0TXK��G9y�p�v�E5C��W���(/+PR�_��0.���?�����]N�=�f�-f����I��s#(��jV��	]rI	��T��}��L%1��1	��
S.��TW��HU���D�R5%�J}HI�)I4m|cHI�WjR����6������bE�w��S�Y�O��a������b���9G_R;�������bH�B�v �0�����d*)�OxCu�	O��� A��3g'��o���
�J�h���U	�HX�l�d�I#"{�r
}�W�Q|`���W5G��C�gNoL��M��h��K�n>�>YR���l\��7���w_���5kV+�����i���.).)N����5������7LT�7&����Z�TR��7�o����P=mi��$t��yk�7�T�5	��,����{���[����
&jsCM3�|;��f����x.�))�![���a�d��/0���Al8�&��,�3
�N��	e���	�������f��� �4����������$t#���Ch;�}��!e�� ��C���ef�E��
R<�E�����h��"���@�`�XR|C�B�d%��������4��]R�
�7��U%��D��F�Z��r�A-�6%p3�9��qL�=m����7��%��@�WGB��������yC����=G�o�j?�Q�_��Y��I\i��{�2P"gd#���������;�^4\8�������4�QkA����|�V6����������W�:���4C�^��������I	.�&M[�FA_���Gg��J&5��	��� �	'��oSnB��H{&5&pXk�\^0077555Q�,)��f���2jM����t�U!E���_���YT��E�d�������M	�yRR���!t��*�{���=2�r���v����M;
����(*k���6��^@j����%6>w�
��z9���g%�6)��`Vkmr�
��$�����~(�9��/�0�l*��@��F�pf���e�r����4�PO�d(�)�����60]Hm�����I�_G�xo)�c�0)j����B��>4�&I�����/!
 (e�BXH�N�O�`���a9�C��v�����?��z���������G|�J@^H<��x����{��4q�qA)q�X�����6lmYH\���`?q�I��W���j����eU��k+��L�rz�]vLm��1���n�6l�6��Jk0B�����h���Tq`��8�=���>� �������:
"����l��n�V��4�l���n�ww����_�����������_�����r|��9>��l��`?>��1������s��,�S(��B-�f�Oa���O�{�)��S�)���M��p-�0�2��� ��(}�^����hi�3�+7��I�~��E�=$B|���|�$�=\Hw{�5�$�k�
l^��@���������8|��X�����(,�G�
���(l�G!��R,�GA�GA�o����2|T|��� ����h ��GF�;����>��d���uV��_c�[�5�C��_o�`�����2~J�k���w���V��~(������x?���a�����gl�
x/�-p;|��_�3��jd��
U�Ed�E�2���M��V#�PZD��BUh�km���"r��
U�E��*T����+T��i3*T����*Te���$~jwAa�z������*�����7��3G��x{���$��F����P�>�6�=������Q�
�V���DmQ��Cm~�����h0 hCj��1����Fm�P[+j���0j+@m
�V�8�>��U���N�;.Van�A��AX��@`?�a�4�Rq�C��{�����_�]R�p�%�UX�_�u�U��
��W��
���@��_�Z�*�����*|�_�4~���u���|(��P��a���8��� ��|���aaf����J3�K����Uf�qP��}rT����!�������N�Y%k�v�h��G�����x�A������$���������9�
C
�Vv=|����_��h�M	$��=R�Bf��]��}�����
���
�IIr�=�G_��+p�wO�������"I��]
��78��m6tE$�6�n����m��|�c��qek�C�%0!2-p���@���������@���@�6j ���@�oo ���}SE>�����pru�S��
b�8V$V��bP�yb�h�l�,�%���$I�8	K ���cj�JM�������c��i�51����RH���0qjH�
W)��CI�?-��F���&�H�6$���Du�!!���qB�7%G��$�I�I��M+s������/���V���n�
��Z�0klT�/��2z����Klh���x1�)QA�t^SC�W�8��~@'�������q�~��@�������$������Q_�"�jj�#�A��@�������0���+�US���l\X�c�8D��h-���QP���hecZ]J�1o���v��l��
�fc�v��1�al��W_���cC�|l�y��)���f���;��$����icL��cL����������(��4k:5��C�s�C���{o��N�]�(;f5e,�H�U���z��DShN]bV�N�1t�/tO��CCu;`z�����9u�C�����uM�����Y��>�j�/�l�Y}�����]M�/���������X��=��k�!���������������Ny�0��C���s�8@[�mJC#���U2�d8���u�����r�>4����d�����54�K��.w�5u������%K[�,��������F��	uf]���D��
����w�b}Bm����m3���Z���
�!����������t�����K3�HJmxoR�h	�6���aN�&e��.���CkD���(j��#3�h�k����.Y��2k�$Sk��B�5�$��X��[��\�	�z	Z��p�D��9�8��p��+I����t@����p�L���G���^��C����;�$�A�j�]������d��}�W��j�V!���w�r�T�fLO������@m�\#�����5�j�����E�9d`��T�r}���O����:y2��V�y:�_sM��T����78�8]�j�����F��������t�`�$��3�r��(�n�E�]f�`u�s��evb��������z������m�V�) Xs����0��:��v��.����vB@�V!�v���f�I��A���r��a�[�afQ��L��La�j��5o2c�����\�C�aH��������Ow([�����x���|��>E�K+3+��rOMO�5VZ�S���M>�z��V�����x���X���$�[���#HB9A��	�����KS�������S�����L�;u����S��!
��<���������D�}8����)���\�7��6��\��P�DU������I��\�-Z����q�:,oy��o�F�I5��CP���G�h�����>���o��y�@���7�:�~K��Nc��}�[>�}�j�{�{W�����V��*/s�\�N@�E�h*��*�DPZP@&l	��s�Uo	@����y��"�/���,4 fd 'o �`�]��t���
��TY�t��P~����b���A�������f��U+w����i#���� ������G����z�4�6�7M��=��k���z�����Z�l���L����Gt�S��7��%��]B;�������d�H�Z�6Q��$���Mf)�*U���|%�(����O�e��s�..����P�Bl��T])��(8d�>�c��[����PES��������z����f����)����Q��p?��2���@���$���M�Js2}�S��ds2�S������&��Z�z<�b���e[f�t��m�9!�VF
#��N��!��(��������9��������a�����|�]���C77����z�Y"�����^������x,��c��w�|*>�TMw�|��JO����*/[���Jk��b�jA�^D�R-���-��������	�@���_�h�A�c�G5����C��X�a3��������|*~�����C7CuY�������� T������A����{#c<3��Q����b��/"x�1
�����[t}��q�+(�]^-(�30���Pr�N�$w��p�U7���������`3�fy:�$�M����CL�N�c�@`� ����1�L�*��#a��l'����0zr�	�D[wp7���7v��^S>g�����������F��3v*A&������![T��h4��l1#&��	�J�y7�eZ��V��Q���)�i2���hi1�R�i�S�����9��7�\�d��Y
���:+i8�F<��
J�����h��v���!���NuZ,��W�&�S=�@o)���FZ���[��g��xd_2}�+�����3}l�c`IS/����n������N4�q}�e�K=#s'�LwL�L�] .0����������(�`���ZxT� ����>�x{�KgKW��L��I��Eu	;]$W�N
���tt2�N�l�|t�n}����j�V�6Z[��T�`�J�3e#@�#�����7��x�F���h���f�[ ��
<����h�l�l�is�X�Frd���r$�/
�|����%#����e��z��[��~��q���E/��H�?J�R�n{t7z2���'�<4��kVQ,����>R����Y�go��ko�JBu��r�?&q/sH���GF�������U�T��1�#!��V�2��1�23�J�O��<Ya��5���B������G
�����P
Z	��h��>���Am�+��Ts�C<+��!� �
���g;�0��/J�p��5�����j8�?~��b����uFn�V(�ZQ%��:]i����&wN�0�Dq�je�w�N���&���Sr�~?�����0�o�o�O����].% [1VV[���!Z�R�"�ZZ����{h��0{���Xq�9�T�-O��i�w;������������Q����OcS
��
{���^�
�-�8��d�d^`�m��vs�=�}�/�_����v��\�'��~Y�]�$��c �O�.}R����$o��v��+��a��#&����u��"k�w�7:����,����������������������F��Z</��1��p�����T��R^��������'~�j��bs�V�D���5�Yq_����#�4�z�~B�Lri�� 
������+�����<v�O�=9?���/y��g���m^3�����z���X�s��o�5�����g�S
����s� �:w7c��R���6x�*�<������'����d1Z�z}�����E>��2�6E����vO�)���P)��Xm��-w�b������b��h��������z�*Wo�j�!�Lp^+���v.5-��2������I�+�!��h2s"
����)�'�E4���v��������~���L����[l������B+nJ"J��*f���C��c�9�#�����W��%�$���u������G�/N�3�ejy����{�r�qJ�2C
W���������&�^r���8�/V��Y��"�B��)���,���m���m��������w���Mo/�z��S'��>��t?�:q��O���7�6����Rv�1�}��'�{�����������.��	�����e�>&�!c�+u;g%����J���Y)J���Rd�K��v��+yV
���R�F�<�F�V�[F�2r��!����(���F���6���p��h�D��	0������h���\_����j^�J�zM�`�wL��'Q�j��P��(��`J��
��L0�1�
N������;�I�������r�%��+��
�T��k,�2V���e��SL���m1S2}D5T�H~I�pyy5�Mq��������!fl3���1�3�%1�;�z��>������Z�Y��
=w�'������h��d��K�O=�9�p��'�~�
~tN��a�ic��1
��(t���.
�0���J��*4���A�a����7�o:�m��\h���D���z����Qf�p��g2#����L�d�XT�P�C��+�9(�R&SdcB������5yr^�y�x��q���q��{���/3t��IS��������b7���S?�������fl6[T�Lj��CPV�Dn�J�g�!/����dN�_���17����,Sq�Wl��@�������w�s	\�\.`/qj����u�[c�q�[>�Z�o�AL���x4�n���Fs��dPe8�"��'�����wtn[;um����{v����HZr��7{P������ll[���x9u�����y���c���I�9���1���I�6�dP��%e��,�z�(��'�43db�����^3I��eZ��h"(��_5!����s�|����#��W�<��.����I43O\[���C�kY��w���JxP��:�<u�4�$e�-����f+K�����*�Q���**tc5�!��9�(d"�(TBJ�vX�,����d�E�`��y�.;gD�g�(���p/f�[e�Y2Y���L��Mu_y}�>��9~F�~�������(����f����83�\l]N�[�������^KC�c}Q�[jL����3x��Aq���R�"1R��A���F5�P>X�j*����������|�ong��������'�9�������_���gn��En�y~���a�}2�9���k�oG
�V���[^9�i����S�n�fpN�0�"�|��8�����	(�)���M��`��c��G��"���c�4�Z4-D��<�1R/HK��S������@�r���1M�C-� �� ��z&��6��0������{�3��>����&?���Eo<A���_s�� ed��eMu/c��;C;��|v]�����W���#��3�>�Q���������W��K�R�a��s�8����z����(�����F]�:"��l����T��rr��lA�����L~�Q0���]�o�PAi��54(�:-n�u���(�K���a����F�������B��N��b7������M�8�EX��?�|�F��sj>m����t0��`oz������Pif�/e+����9�9:����2^W�n�����������t��^X��r������L������c�O����/���V%J��O�b������>+�e#�,�S6����u����\�fE�����&�jQ�PU��E�`K=�z*�����q3�nMD�x��(`����u�:����P������lc�94N���^����E��Rg>(��Z�~Q)�8�#��Y�sVy��U�r�}xN_��/�����]p���R�?�0u���V-�������=d���+�l�c�$�����?�|����{_:���h�����1k�]��c���u�/n���(e��?��,�j�.����L�2�Q:[�`/);e����<
9m�2��I��{��)g�}��;���rF����t�m����d�4�o�&ipR�G�mene���Z5��S��X31��0!fH�d�0Q�(���t�F�LJT��+�F)5�T�~�Z�#��$�k|j��YQ���^��{Ti��UC��9�����������sC7{o���������������rZ�����s��)�-�B�-���TwP�~���T5��G��ib��N"���@�|Y/O�����6�e���l��Jk��j��b��.�f�ft�4�o��#�j������[>����*�YI
����M+�VR)�VAe��fe��bD��nms�:s�m��A{��u�������<��G�����������F����l���S ���#yc���R_�:~��T=���k�o�T��\=����4�������������S�s�;�_�E��,r��5���9�KzK�58=^�b
�l��r�V8�=m�ZjN�{X��k=Z�z/(������%�j�e�����s/�2���in'�8��j!/ N�A��0�b���pF��a��~�:���x�q-�o��3�a��������Jp`�rn0y�~��[z�<%��7�] ��QC�����\g���)�����r�y�\�4=3���9$�����3����p�S���8�K����@�g�a�����`���l��
�Cg���L���D/�<+K�2y�4O�,�M��o�����A���<N�gH��4���O��3r&�Lz�qF�YD���9I0��������hE#&��9hQ8����:?�K~�I�H��d�F��.d������L��>��zqI�T�8��s#YoDFz-[��"^.��X�����4D����������n�G��wm����x
�du�8N6��8
����4<�����2��0LlH��Ok��,D��'i��&��%�_�_4
�J"4�M"��Hp����O_�y���?�������B�Rux��������������C+��>�����
M	]j�����.��Z
w�w�B������y��>��`N�\��������/����V���t����e�T�&j��#���}B2����A�Q� t��J`�#�)V	�������N�9�n���G�f��{)�;��y~��`���~��L��4M�D�t2�A�C�Y�!:��W���P'���z���>�]�5=5�X)My�R2j�q|�y
i����4�
Vh^�H(h
VTS�M�
8�������\7���kS�B�|i��;�J}���22r��I�Mm��������������h.'�����^Xtv�h�`��e��|u�k�����w����0�����	u��0��E�$�
��[�����=�����=8���<���g��#��~��>%p�i�}�c�g� ������1�F���b��9����������~1�[����@\n��Hr��N�X.����\l�\l�\l�\EB������R��gM��u�q�����c�3�������i�,E��K�B�x`�����s�*���n����|����_v����v4��OP������j�_�_���g�������G���%����;%�|&��{S���G��S�4�,��<�M	���FV�o��������j~��t(���:��:=���YCh�R��<�����|������"�OY���Bg.�`qy��e� �+J��K��	������7���p��9N"�W��� rK� r��A`%������)���.��a�f
+2a��H(����i�7-i����L�@��/�����oK��w9�.�*u������������~�O�i<��3��T��$Q�������D�m���J�D.U��	��&N�5����0������.azE'3�:&�(�cN��P^��Yh�g��sO���?����m���S�m���������Uo�=�����R����\8~���A������I�/�����W����}�OT��`���y���wf�=�&���3�����xq[!*�c���/�-^^D�#=#O=�*����+. �1r�tfMd�(���k���0[L�����e�2�����>����vD)�EBa�)/NKNYm�E�B>"��Dv�+���`
H0��6�(�����!���0�K�C�4t������A�Vr�n]���Nm��@�>��������v-\������~����p�������=��?E��s��-j3���wo:���mf5�R�������|���2Vb�������3�'H���8��KKX������d����a�0��w����;*����L��iu(��y~�v{�xz�\�����<!
X�!\)�0>����X���ZT���;C+���wYZ���T5&Z���6��5#�7��B����5=��
�0���TZ+��_�

 ���g�w�F�9��A��@��E�-#a8I����3A���(�s�*��/�ZJ�^���]g�����������T���`1�2����O��&+���:)����~2�h7o1 ,�f�tXo�d:kC2}fe��F�2��S���u^��C]d����w�)�Qm}!���	�$��X��RJ�PC�L����y_�J1k�L5���<2*z[����2@�f;z7����S�2����=�����u�EO��bR���F{�xl�2s���-��J�
�U�7�Dgmm!E\�Tln$Wp7�n2�6I�K1� �X�@�DUca�?�#�
���(���l.�����d4��x����q�eR����&��,�}j�����o*o�)��U�Q�gF�fR�����]x
����Wp�-�nF;�Q�e���$��[���6��I���:�����x����%�FN�k�}.���][[�%�f�y�nf����)�o;X^�����1}��Q����iH'6$�����?�0�ik&�ydW0f.����������;Kb��L|3���%-q��J�A�t
�FAk��B��(*@W�9=���MM��j������%�'�����9;�;vV�����@��e�!�d��N�4���G��6:�����x��	QD�.�	�t�:Q�c��[d"mQu�I>��x�����i&�?�Fq���8C�a�����^U��T	��d�qT�/,Hc	�eT��T��C���h<Z�69��	S�h�����F]���j���F�4C��n��JR��*�F����49��SRG�H2}`����Vh`EL���|�]���_���5�e9B1�l���z}jWNL4�i`^L4;(��G,���C�
T"k�5��O�Ap��R|����3����m�\q�k�S��!9���e�=����q2g7���V�.��f�r�s���Z���u��F�Q7U�l��3�5�=�;5�^�cX��	��
:GV	rd��A)�	
GD��M�ci]t���<�X'�e�K��TE����z��P��I!���%B1@��c�}8���k��7c���=��3����U���R-CeX�p$����h�;h�K��]���������1�]��H�	���CO��z���S�w����.�c�0�B�������%���]`7�T�s���An�_!_a�F��l��*� ��KzA��.�0���ZY������(�����W�"���n[2[2[2��*�3Tc�P�����������-mfY��xO�f�3��vZ���^�=8����K��������e�w������9������y�������0Qk�i~�n��C���`F�A�i�������2J�YY�L`�e�\����,`K����y�l������*+��+��3�%S���?w�E�~�W��yY�:���
�/�]�L4L�]�[������R���VK�r���E6���b�Z-V�Qg��A�S/�h^!����.����3�X�T����-�n��,����7DuZ�1/L�����3��*��W�	��W
���|7��b���b��_�jh��_��2��9�����q�(�(�3�JY~����fn��@F�U��j�Y�!V�����d�9���������-�3����o��������:��8�\NWN����P�!&�o��k�{����7����W&_?�$�����
�?�l������'���._�jA�w�l{����e�cY���_s�?�2��(��/"�@�Mp�8 �YyX�e�)�
4fC�e��~���"�J��Yd�J�p\�p ��F���ey����Q�I�8=oj�{r�!�<t��@8D�@a���k@8D�@!E�Q�g��Rd�
���K�)SB���������q/3�l��r����5���1�cZc�O^Ypg�A���?�B�#���W)B�"���(���`*Y�{O.�
;M%��0
�N��?�9�/���N��{�j��5��Vq�QX����UK�f�������(p(\�o6��[�U)���"o�J��i+2R�8����H@I�P�%��������l�K��Q]oY�"TD����if�)5@�Y��jMF�A�
Fl(b�Jl�����co�D��S����g(�]V�e<wrO<z���*X]��u�6��M��~P�Ft9�~\Y��)�p1���x�v��s1�B����M3��m���M��v�5so���g^�wY�mM<�>ll�y��'�H��1�'�����h���r��V?;g��g_��
�����ble��~Cw���p��o(e�p],�����Y)0_�ov��,|��|+�x���'�tI���h����p)��#�e2�O���s�!k�~�����\1��Y������Z���x��s<���N�h��>6:�i��B9�<nM*�7m�v�������|>�vdT�K#�(�%qN���GU7T�]*�`�5�"[iR���g%�=���d��.����j��W3�r�CuL�,���D�s�7;9����
�Ve���I(���>���^1��2��&F��bDwP����\�c�x���������BUZ3��@k�UKV�r���N�=����E�\�������s;>M�{�~d���=���U� ; ���h~/�Vs�b�2��r��I�T6���Z�C����i�l��������o��=��nM
�ZbV�����2y�
����>�@b��x��#��~zH�����jkkYnU�ZP���������2c����z��I�A1�3b$�Im�!�$������r��N�H�]�x@H��c^�0���I7GBs��V�b�$�&���DC|&�����h��1�-V���2��_��O�<��ww��h�v=XB�nY;���t�k��k&�g?��,�d��:��d�-��Z��&6$1����b���e���lAT�����y�}t��@~	Z�z�+|��?�-��&����<�
T��������L�\��,VX���L��`6KR����	�'�<�p��Vbc�gZ�%�=��L1&�G:
Z���9�� b-�E�1d���G��������0��h�����d�Z����uT�U�:������O2^����6.p,��oqXxu+�$�,�aZm-�+��=���y,����
F��d�X�����t���d����Bk��Jku�C�)�c� �#n^�����p��:��a�;6��bQd�]��6�Qr;x�U6�F��e�E��$�v�lV+H^��+������*�h�.�F�<�$�w������1=^wO������~N�W���k��9��2�������F��������r��,��@
	�����b�MOC���'6$�������<~Zc�Q����.�C�s��rU�h�cs
���D!D#�=������������cC���^M]�7�N���������>���
�g=����yo'���Q\|�2����^x��2��|{���G!X��
�	��G/�h�n|���K!����<;x������D���������������F���Q^Fnk��;����������oQ�u&]��������4�Q�;����qS��|�5�����Vmt>��jr��b��K
���b�]���z�{��i?���h�M&�l��,R�7�BW�-�_�b���K��HM���k�7W�2�l�9j��2#N.-�rU��E�p���\k^g�dN��M��������Fb6fN1�=�|J
�3lf�Ffh��9B���|sf�����E��f"������w�P_���C�Lyf_��OY�u��F������T�g��kr�������h�h�&��9�	U|��l^Y5� ��P��*=
�D�"�#|��Tg�$�G��)��������*����1��D���.�h�
��K�\��a)c�������LB��#2L�33����
nf��TW��233}YM(���`'59��-Qvd��W�`����_RM�x���'z\;�������Vf������0������U��)�a�)�N���t�"D�XK�4����3���[/�������������������:W�>�����W\w��g"ywN����W\n7��a��%5��[�mPg^:���gW^4}��'�SzI�c/��R���	�z}e�Co�� �h)���<_Hp ������-
�Crj�5����y�R��h�;�����5��\���{ ���#�G�/r������y���G�K-��2�����Y��W����;#e��0���>������Y�F3[�0C���H6��fC���"�F1w&ds:��9�P��=m�g�^�u"�{�d���0Ck%�qp�8�Y���JM��m��G�Q�D\���� �R�A���(�#�z�i��FQ1�C�)n��N:=�f�d���<���/�t)Z-�b���x4���}���cmw-K}�8��-��%�VZUV��C�P~!���������x�U�[�����WM~����_z��|W����]�vk����'�����=��{��3���'H77���{��>���`[A+-���5��h�U��d1 �\8�� �}���DLd&��Uz�Oa:��f��W�/Uf/�Q�72g�kb�DWsN��q�8�hzN~�k�L�||
��/5.2��~m�����i4:������?�����B,���H��6�"&���I���b��s�YP��e��p�R`�_��ev��,W�N]�p��
 !�1�L�W�?��*����%	�	F�����9
�( ��X43����Ld�I���-��V�5-59����38��#���.>E��gS��R9~\�gno������@�b��{M}�S�fG����(���o���I`�g���_|���������Q������O�.�����r`�>A>�r��~�V=�LaS������M����}s�l~�n���w p��c���/s������K�[��@�KR��r'q.0
p�M
��4�>�7U?�4������:e����
�r}�
z��\�t~���N%��v�][-�"A!C��,�gE�U�6[��\@���q'���+��OY��V�������h5�}�f���l0�j3#�%6���v��a��@d���%��������Q�+�����=�����H�^�3��>��zy����zj��H����X��{����cp l����VZQ����\�����������Gyy�
�o����W=�����Y3~86��m�����?z� �H
����CO���x2��q�d���_��c���{aKu\�@�<;Lf��?c?���mC<����1�������	�����3}7	7�������\{�5k�3{.�~I�������[v�����"7J0QE�UCQ�x�{��*Z[@���� BD{L���r�S������mi9�)�����73a������Y{�evf����}���>N��L������$I��m&1M�L)"�O�c�����\*c�P0E��_�7L��5QG�+��7t�pVg;v�u�Ur}c�G�z���C�uy����AU�*����=hv���I;0kE��1�_k��� ]v�Sp�]_��+ej��a:s��?��iv�r���*�6���[�\m.��V���*�-��XH��\���5��d�__���7���psD�}��{? s�3����Y<?�d/���pC����fz����Cw�s�6�"���(�T�dl�G�G�����j����RBo�{��qg��IT�+$������$
RN@����p���p2��;���#b����9#�r.U���p�a�7��2� ����������}�a!!��y����F_�T�'c���~T��c�r����a#�f+������,�
�	��k�LK�EA�`�vY���9�k����:��f����%?�%e�"��0+s��uk0q��:�g��p��os����<�w�)�.���o� �&�p_������,RU>��fC�SB�\�hR�BM5u\�|���|q�z�zB�{80���~b����77mk�U���t�zn��U_4b���z��EM�>�?R�����[���#�zRA�b3�F�@����HD}�f��S)C��IiJ$��m.�������C�,��b3�DE7Di��jd^x�(x���7��7f������w�RN�G]�4N�}��s5X�jgQ�
��
��
����e�m�c�d�*���ep���j���`�bd�H/����`���x�iu5s�����i���7�=4�h���Ye�a�n���	�B�]QF�X���Ma^:��lAv����k�Nu�9�o�3�kz�p�{�����;�������7�{�g7�}fAbNv�����[?����[6|�������k|�������+��nB�;Y��^@�R��p4Bxei7���S����Q����cd�x1�*�?�j���y#�����<._�q��#0�F��6�:�.��"X��0���}2�~@)�.4 1Pm������^�]������|O�X�\����"4BBp�CpICp�CY��b6��c��4:�!
�����p��-��Bb?f��r���@ ���:;V>q��y������p���n�rV�'B@�-�u�Hb��GQ���4��A$�j,��B�������N��n���Z���_>���������n�{���/"��D�2^%��N�(���^���!.����o|��A!B�l�+�k�"�q_i��J��8�~9��[�G P�`-�Z�5v;�$����5��u�N�,y��M������c���*�Zy����uH.�J��)�d�w���Q�c$�U1�#��n�=���B?=F	�iz�����aG�.?��Ov|
�
��C�;f
t�t���Lv�s�:����1��P�e5[w����?����0�;���z�8O��i�3�Un��E+4�	�y�S�h7�`�������uXS>��I��;x>*��H)�|a]%\H���pJ��S�
b�26�p(Mh��U��*�R��Y*���xv
=�&�*�t�J��5vU	�|��j<�Y����(��[g�,����f��9d��j57o2%���L�N2�$�b���4";���a`S����xuf\��q��g?r���w�����y����^�q��w"�}�������������3��V��4e|}��	���2g���?.p�_���Y�!�
��9����W�-���y��]����&�v}��aG5�pn����<��S)���,P���S>�>�����T����Yz;m������Q�u�?���a�s��
_Ka[Qx��DDY���%)+
!Q8J���yE��2���RPU��>l82O!����)i��M��Y��
'Q�S]m�,r��5R��I����Y�0L���
�tj��,#$�C`#��Q\��X=uI�A�=y�5�wg0�@������H,��i���+��Z�V��O�q�|G���X��]������MY��hXiOEA�+��;
�]���
P@��'@,��wW�pw�m>�mw�4��R}
)v���C��P� K�P+�uhw�}�/����qg���<EpY�K��`���?�yq~����[�����{�k� U7�����������������������W��|$��]t�_��g�\�|��=������N6�|���������c<W�_�o�K<��im�=�&���-����cL������Q��l�w���$/8�)��O��?�I�;��Y
��������%�����,L��Q
��P�s)f��mX~��?})���	�Q����F�o$��������
�o����7l�a���?	4���7,���Q�o�}��tf�z>K����������i��5r,��9��2%���[�BM"n*�xsv[�d��D �����M7��@�&E����gH���cb�4��D�t*cR��5����I8@r�I8@�%-v�$����K2g������!�$;B"�5��5��j���d�A���B��q��*���`$EoF���8!���&p�`m���s�����!)�^w�Wm�����v�B,�mkc��de�eI-�iV�z�����6�aJ^�B��t�k!�-��n�c��G��������tL���.����h�C3/�|���{���V\:���!�������������D��� �	A���g~��)x�;('3�����[���C�R���P �S������-�m��q��5i^�61L�""u(�Q���x�Wm.�S!�Q�	�CU<���c�
���K*Vg���jbQM�X�\�����h�#���Ml(��������`F9-�����y�~�>=8����?7��g�^��+r��v�T�,Y�Q���Il(�g0����
�`�Vf�����E�g�Jo��i�O��GvN�v������[y��������#��ND:�����A1�*�����J���������4i��.]%,���9��i�M1���#Sb|�|��iwF.���W�W�+���+b���,����E�E���
�J�Je��DST�R�Z���y[_9AEk���H���CR�"d�=,����Q�]a�v�~'P���1M1-r�)��1'q��c%KR&q2��7P�EZ�%�����R`����xK�Y�����c"��x/h$�1	���d{�-���u�uv����2�l&#���������MW`��Y���*O|L~�����#7�����G_�����{n���q��k�<����X��[�y����Dm*.��t"�Q%~�Y���1�2���-��&U�ZM�����I��7��	�	����'�����hGr��\[j��.O���0�a����C�+�K�H
���p�`�K�7����(���")�
�T@E��oq� �^Q��y���
6GY�lPh�"�x\���w1�{�G
C�b.h�0�n�jl&���.�E���>j����0��� (,����US���0t�1�S{��������?D�!�a����+���	^}��e�}|�����v�Y���Ww�*.�~��9���<Y<y�OrOx����z�w�L+.����D)��Y��i��I��4�-���\��������������������E�*���?>����#HF���E;���-$K���?�>�|�,�
10�C��*�P��(
4#�z�#�^�1������
���g%\q���P�����iO�������j�����<�jOe�\w��[xS9�X���"�7�����CE���M/�Y���s������[9�KdC�������������������;�\�dq)����Gbq[����8���������o�������Dd�����������9t.]BWSA�$Y���%����
)r�f	K�tIf�(�{��y^p(B�����R������������W�Qj���9��0c�u�p ���_o
@
b�*&��
��u���]|��I���$TI��wM��t���E��E���p��D4���]�����a�P�F�6����zx�Q�72~��o0yK���eB�	�������+3�����jw��KN���Xt�����|��#���X��!w(j���/�����e�2m��[�[o0zz��q��v���+V����P��q���G������d�S
{�^��4�D�o���Z������/�?%����B�����P��j
M���TY�`G&�q�->+~i|g���`�����?������$�s2�r�G'��n�BX�a���P$�������XQA*Ra��-�:~�g?s���hj�Z������������s�������4�d<�>�K�C���y���b$������Fv��T�`#nd�d�o��}���rW�1��}Kc~Au}c~�����m���d,�WkQ�
M�����'�*�]`-U����
���.�v#�9�N�<(���^22�	��
���Q..�3s���f�$�\��a�e��Q��q�.���x��S����B���:�����CMu�i�A�2z:��1����I\J'Qu�&�25�&�P���^V�M�*���h]aXA����q#*��,�yj�0q������%?n���4���$�m�q�M���d�����>����7�|���u/]�,����G�/}�����J-_u���jb���m�9���������]�q���TEP�m>{]�����)����/I#�(��*��]�����E_�m~C!��<�GjnM]~C#��
�P��s�"DR�j��������NO4\�)��E���q�H����=b�xP����@c�>���H�kx5�'���t�M�X���5�����,C1<n����6p�9�F��mpw&���A7���f�.�����xP���b&.h�|E�m��y��`��r�Vs��O���bqE��{0�)��j�#�!Z�b��P����<I#�0���C�\�J����U@���T�����?�����l,��D�Q�K�6@�C�(^���4�@T[�B�"�jSi��R�Gqtf�]�z�&�%���m��D)A.��B.��e�����$-��T�Ryh ��>0>vD?e�Fe�����a�5?�v���@�k-x"����$��
����P'�TK"]�\t��q#�d��c���(������K��e�����9s�?����i+g�t���7f������N~�0�����:��qL��KO_.�MO/%!/�9$}���@�2.^���"	X*��*�Q���(���c��

�t� G�T^b+�W�b��bcl�u�+�����:�����Q$�e�>p�7���t�����@
r�R@-�44U����vi��/!K���Zt=��� ���W6�M��.�N�n�Gh����S���s�O�����|��S��>QN��IA�C�1��Y�Qd��#y�Q���V-#���V����\�}���{	�k*��~�;`��@�@�*��������,+�#$�V3����4A�C��a-#9���'�w�
<��p��������`fy4��L���t�6�!,�*W�`�z���_yu�_l��_�+��p�*�������������5�N�bv'd~�C6)�C�,���z��\��c!���+�h����n=�	#� x��'�<�	� ��m�i�Q�R�_������P&#P&��T���L�_�%
����*~���3����h�������7}���9���>����]���E���]>�+C�x~i�}��7�"���(��x�XF� 
X0����i&T���e����b��������W�&��I'b�p#�������-�J�wm�����`���\U�7S.������SA��BR��<ETPe5 �&
r!1%%��@-���R.�G-����dn���3���9�T�|�b�B{�x�t�}�p��ZzA�o���.��T�5�����:�������p�hO�g�3��y�W��}_��|�1�d'��
���M�-�pgs�u��vR	�F�$JY��Xz" r:��z_�}g<t������q(((�U������J���Zg�m)�B9���p/���N�r���Y������Yv!��8��yYQ$U��b���{xd��J�9K#�~���h�v�C</M�����$�0r�R�Gy�,�T2,-����uMc
l���S�PB�M/������������k�[���y�<���X�X�t���<^������������<8>c��36��5���<Qg�Fy�����G]�0[���l��S5wA�������!�K�P�t��6�v_���^u����\�u:�KdB���{��N�i�T:�KL�{mOa����k��wK}������w�3�~�HC_>��(|�*���i��]yB�����. ����'���AP�>��g�%�GF�^���������m���������;��/>;�w�n����7����u�,9�Y��7o���#���C&�L"������QN9�1?l`U�D��+H1 �3F���A�1������J������G��|��/�e��)$���f��n����(���]lW�[�-�>���J}3���=�������QUC�e�tSp���e��HQ�c��r9��p� p�$�Xd�r�jf@��a����Lt��LE0������ebz�d���������4�Sd���`���!e������ZF1.�������a��4m�qin=��B�g�{������Q�Ss�(����P��}�w'��al��������V�����@�����OEA�D\&
�wWL��p�� ;)_�"��`��i3��Q6��g�;W�
|[��?>92��������?�P��4���SGOj>Y����^�d�~uq�WZ�d�Se�\b�[y��
�A��(7��+�5�.P�
r����1j��X��N�Rq�T!��FZu��g��Q������D�d�g�0��c�X!�G��T�!4���p���H7���l%j��cv�Z�����qzK�QKm���"�v{�=��^\�AX��`��1|�~�u�}o�����%�Ek��O�������R��7�HPM%�1��������w�m�P-�x��L����C�`�VB���aiYU	��d%�������z9ER}��y#��N��\��m�c�K��mb��I{
�AS�
{	����Fk�4n�V����'���TH[o2�nI,����LE9���9p8���w43�B�X ���T�Ma��	���=���{bs.xi�#H-���1T�h����_0��p���U�Y(��:};-�����.u<��z�Sv���-�3�Z�E�:^-�����LU������kG���/^���P�\nT���G���n
Y~�W;'��e����.�C���xH��%��}����X;��N���v4�'V���������AepA9�K����g��"r��P�m����bi�|^-�$��o�����$���)'��-�����3�y�ddV�]c0� +DR�,&!�	f���2>'
�r������$PH6z%I����!����������������Z�R`5R�c��Y�TB��7����}�d��Tb7��j=���9�rX���m���BM�^��W=
C/�tN���G�:����'�s��J����";��hW;���J�6�I�6G�%�I������W%R �H�$"���`��)^`q\su��<���8��b������AN�~�����+�Rqn�Z��/ �wb��	���X��v��,K�1����+#������~T�#��W�B���R<O x�$��v�gqy�B�n�?M�C�.=&�������O�����:�a���ew�{����B]�<.��1�E|q|q�/ue���*�H�C����,}YF�I�N����_h3���6��tv���GG=<z�����9�2��/��;C�p��?k���'^=�~���/�w����l+���B�C�����"�y�M
h?) D��<��>����v���%�}3h7��v����y��z�}O���/���a�
�������`�P@+�B
�e���a�9��=.�b~>����h;?u�O�h�-��������B�K��.�|��������\����
�n4�v�M�
t'~��$El�n�C��~��&{�i�9t;E��P-�F�R��H��j�P[H��p�����s��F���������7��[���{b��U&�4���G�?�4�4
endstream
endobj
121 0 obj
<</Type/Metadata
/Subtype/XML/Length 1202>>stream
<?xpacket begin='���' id='W5M0MpCehiHzreSzNTczkc9d'?>
<?adobe-xap-filters esc="CRLF"?>
<x:xmpmeta xmlns:x='adobe:ns:meta/' x:xmptk='XMP toolkit 2.9.1-13, framework 1.6'>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'>
<rdf:Description rdf:about="" xmlns:pdf='http://ns.adobe.com/pdf/1.3/' pdf:Producer='GPL Ghostscript 10.02.1'/>
<rdf:Description rdf:about="" xmlns:xmp='http://ns.adobe.com/xap/1.0/'><xmp:ModifyDate>2024-07-12T17:30:25+02:00</xmp:ModifyDate>
<xmp:CreateDate>2024-07-12T17:30:25+02:00</xmp:CreateDate>
<xmp:CreatorTool>UnknownApplication</xmp:CreatorTool></rdf:Description>
<rdf:Description rdf:about="" xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' xapMM:DocumentID='uuid:35ac112e-7880-11fa-0000-41bb8ed79242'/>
<rdf:Description rdf:about="" xmlns:dc='http://purl.org/dc/elements/1.1/' dc:format='application/pdf'><dc:title><rdf:Alt><rdf:li xml:lang='x-default'>GIN parallel builds results</rdf:li></rdf:Alt></dc:title></rdf:Description>
</rdf:RDF>
</x:xmpmeta>
                                                                        
                                                                        
<?xpacket end='w'?>
endstream
endobj
2 0 obj
<</Producer(GPL Ghostscript 10.02.1)
/CreationDate(D:20240712173025+02'00')
/ModDate(D:20240712173025+02'00')
/Title(GIN parallel builds results)>>endobj
xref
0 122
0000000000 65535 f 
0000061945 00000 n 
0000385936 00000 n 
0000061769 00000 n 
0000059060 00000 n 
0000000139 00000 n 
0000002407 00000 n 
0000062011 00000 n 
0000345362 00000 n 
0000343782 00000 n 
0000358631 00000 n 
0000062083 00000 n 
0000059193 00000 n 
0000002427 00000 n 
0000004239 00000 n 
0000062113 00000 n 
0000059328 00000 n 
0000004260 00000 n 
0000007777 00000 n 
0000062143 00000 n 
0000344634 00000 n 
0000343498 00000 n 
0000345581 00000 n 
0000062216 00000 n 
0000062324 00000 n 
0000062260 00000 n 
0000062292 00000 n 
0000084217 00000 n 
0000059505 00000 n 
0000007798 00000 n 
0000013306 00000 n 
0000084322 00000 n 
0000084258 00000 n 
0000084290 00000 n 
0000103049 00000 n 
0000059682 00000 n 
0000013327 00000 n 
0000015687 00000 n 
0000103154 00000 n 
0000103090 00000 n 
0000103122 00000 n 
0000121184 00000 n 
0000059859 00000 n 
0000015708 00000 n 
0000019629 00000 n 
0000121289 00000 n 
0000121225 00000 n 
0000121257 00000 n 
0000145609 00000 n 
0000060036 00000 n 
0000019650 00000 n 
0000025332 00000 n 
0000145714 00000 n 
0000145650 00000 n 
0000145682 00000 n 
0000170689 00000 n 
0000060213 00000 n 
0000025353 00000 n 
0000027232 00000 n 
0000170794 00000 n 
0000170730 00000 n 
0000170762 00000 n 
0000189728 00000 n 
0000060390 00000 n 
0000027253 00000 n 
0000031699 00000 n 
0000189769 00000 n 
0000189801 00000 n 
0000060543 00000 n 
0000031720 00000 n 
0000037600 00000 n 
0000189906 00000 n 
0000189842 00000 n 
0000189874 00000 n 
0000213704 00000 n 
0000060720 00000 n 
0000037621 00000 n 
0000042105 00000 n 
0000213809 00000 n 
0000213745 00000 n 
0000213777 00000 n 
0000238467 00000 n 
0000060897 00000 n 
0000042126 00000 n 
0000043143 00000 n 
0000238572 00000 n 
0000238508 00000 n 
0000238540 00000 n 
0000259302 00000 n 
0000061074 00000 n 
0000043163 00000 n 
0000047613 00000 n 
0000259343 00000 n 
0000259375 00000 n 
0000061227 00000 n 
0000047634 00000 n 
0000053444 00000 n 
0000259480 00000 n 
0000259416 00000 n 
0000259448 00000 n 
0000287948 00000 n 
0000061405 00000 n 
0000053465 00000 n 
0000057990 00000 n 
0000288058 00000 n 
0000287990 00000 n 
0000288023 00000 n 
0000321963 00000 n 
0000061587 00000 n 
0000058012 00000 n 
0000059039 00000 n 
0000322073 00000 n 
0000322005 00000 n 
0000322038 00000 n 
0000343456 00000 n 
0000345797 00000 n 
0000358842 00000 n 
0000344249 00000 n 
0000344772 00000 n 
0000345493 00000 n 
0000358542 00000 n 
0000384656 00000 n 
trailer
<< /Size 122 /Root 1 0 R /Info 2 0 R
/ID [<0381EF0C519D468B66674C7429CE4007><0381EF0C519D468B66674C7429CE4007>]
>>
startxref
386098
%%EOF

i5.tgzapplication/x-compressed-tar; name=i5.tgzDownload

��\�W������W���K����%'g&��2�3s�IG���Y��%��������C-��gnxX���U�����%����T������������s��Oo��_|�������{=8|�����7������O���e�0����O?�}�������$��.�*�31IR��(> F�D14P����'"M2��T3)&y9?b`!�p.�{q���G/F��~�-@&/h�0����YRIU���p!�R������\f�&,�H����:���M��mw�E�-ng��H�k���"���3U������*Y��@3�E�����A�W�Df4�$�1�d�'�qq��"f\�0����2��u%���$�*�f�m��X������)�r�ie&=�Ev
1P��-����H�z2I���s��e%��B��-�Zdc�4c��
��t��[J@��@�JUu���3��Ln��bx�����AK%S:R!�'<���Q��'T�YFRU��.���I+�t�h�i PL!�����N�k)�d"�j����bDUjv��+�#�dTWDh{xvq�a$��/���au��v��$��������vdJ�G��y��3������C%^--Os�8[x�G�*'1����F<�S�.I�Q{�z�]!{��d�0�^t#��S�O��������}Y��L��W��*��H�x"pj�-���������\�:K*u$���R��X�}F��@jE0�$*s%��X�Ah}�Or���
)��2)T�>2���\N�������^
��4y��:L� Qk(������a����
>1���Bmmmu4����x~q&.�>���.��}c&����%��L�q*���K��I��Y*��-�-����O�G�Bf7I�gd6����B��E@�>��;M`��:qX�A��*��b�
�<{�!/$�Lx��XL)��3���i���AYS�r�3�[>��+�!p�����;�e�)`�#�/?>Kg�<<�f�o�|��s�����2�IX��`�H���tA�j�����S�Zn���$9�U:X�/��pN��+|��puy	�A!�/O����Og��s�%O�$]�Oc����4DA���0��e���ei�)���V���0 /�zID#����Lde�*I��T��G���4��N���+����sF���9��{q��z�1�yx�{4`
�������K�K��G�T]��2�������T]���D���j^�<�n|��M�1���
���y��e��7oy,�0n����d��
SRl6,n�g����l>�h�0��n�FK7I(N4���B����X����FV�+F>7z��L>�.���m��XVa���v�(��8��"]�S�U,T������l���/����$�	+�q�kw�JN��[\�~9����'!���i��	j�$0R�������/�����������6�`����x�D3]�����"[�J�4�P|b���D5�(�������4�[�g�Hi�8n�`��m��1a����T���G`W�J��e�,*:JA�!��Y��;8x�;E1�����	�������qH��8>�n��%��D\���.���.���:R
�����b���2��
�=�T�xG�E�7�L��*��[��RAE��-�aMC�~d����R��A����	t�=3�fTVv�LW�K��`�����'�L/|�'��G��N��~:����g�����_/.������mP4>��	�f��P���q�� �1JS%NGu30���A"��(OPP�O;������g����A�a��	�Z�	Q1+qB��{����*���{���������g������x!�HX���/�&2�'�A[J8�q����j��	+0�f����'����T��Sv�U�`�<�%�q���,��d�|��~�V�5Y�n�t�m���fko�}l��e�b0�q��v������4�K.2��r�B|���#�$���[#_)�_A�g<{�aje�N4x#���bZX
����F�:��i8X�����>�~��s�3A�!�R��x��.��~�/�A(���w`�y9a������4����-~~Fu=������;�[GQ{>������0�+���$�H��bh��v�	�@���9+FP-
�?�����`b� ))�,�H�*n-(/�������L�������m�����k�@��D����J�V_��{�]6�Sl�����rk7F*�G��.����dL�d�T����
��4AC�b5r��j,A�Z)�*�+�Qb�^\S9{`i��s(%/�m)!?��^|�*�qz`K�����#<j4�S�2�T�5� �)��T�f!�E���hW�8�W\��8��@���48(��%����+�
��pA�7��#siP�9���{�M�����ic>&���D]s�^�b'yd�@��*}�"~aa!X]��k+2�K��k
"���&����a)b���������7��|��YM�@r�b�a)��(w����W������ptv)���!"�L���TH!#�(����D8����B��6&��fw�5�_�n��4�@��v
������Z��u�/��_��X�QI�*j
0e�d>6�F�]������;���n�9�������T�+��)�J���+f��L9W.�ct�a�eMQt#1���;g�LQ�������T������VQ�2��=b��F,N*�t0�����UMq
����Z�N,�"�fe���'r�$��x�i�v�q�}��@+HrJ����H=Vh�Q��p�>U4��h�C��P`��8�+�D���t_��1t���6�$����zIQQ��|�o���"}JS�|���Zg���e�!�:�%2�1����N���hl���V������n���k�6�A�L53M*��(�!	RX��gO�����|t�=����T\\}|�|�U��
i�H"�)�
)"i
'�0�����H{��mBz�|2���N_����Z��JJ TQorB$T\E��|�|`Q��� uV����]`j.���,k�]����� �3C&�(y�.:��k���y^AF�$�oaQ��U1�yDaP5��(��~�,C��Y��y<��H�����E��w��XJy2<}�a,�-�>+������d���NS��T���K{>}����lH�M�2R�O�j���V�;��>FB�>�12�n�o$q�.���j�4�����U�	F�M�?�yh��7w[�i�
"�mW|SU�:z�HpV�XA����R��[��p<g8>-"�B5��KpM�ibE
�^�	�R�G!��W��C�*�l����Bt��<���|��-k%�N��%�,Oc�~m�l����.}S��{�J�)�1D~j��&���u�����E��W^>����F�W�b��8qU:,��+���'q@�����,@���^�E������M"��m���{Z��A{>��.��#���{f������j�c�XG��K`#S�����$\�E4#]+�����sI��lv
:������pF�P�:�=2gU^��Q�p\����W	�
�:�����ha���d�J�������X�i2.C�����9������@������<i�����;]z��5i(K���D,�
����rn����H�_�*�f��t���f��0I}�-�	���	�:���<K����6S����D��g�+}��PDN�
��w4�� ;�V[Wb��F����MH �^Q�`��D���l5c0<��?5���a�7�bE�!$Z=�U��&ga �����/�����N%z����c7|R��E{W�M��?�q��� l�<�A�FT#K�����������Bk����q�����Nm�e���|��,�>��d>+��;���A��&�����0�t�R��]��N���q����]���=v�b�Kn�&��n)������w�����+����lwI���)������0U�?!�����)fJ�.=�1Q���Q2���$g�t�S���!�{
�h-��4�'pF6_P.�������VA'��M
�����U��]��vv�S}og��O7�3��2�M7����D��]���)��m����UB:�a
�R���/��I�OP?.!��R��	X����:���1-W�f��
��)�5�:x�����rx����O�7��#�9d�V�,�]������U�(��N`a�Y����\:�C�$�g*��h����76�@����M��)����%�6}e�8uZ4KEs��y��
-�=#(�xH�����-�K��V�������F5��������J�3H4#��b��Y�����?brM_�.�4�K���NF�[�*�
�K#�t�Qt��&�%r���M���\�qC��E���N�|o�f�������j���B�|�����S��T�����Yy4{C�0�Y�'���h�H��0U�uJ���6L*���V�-�~����7�e�p�X�M��a�k3)
w3h[�	4�����ee�����g���q��m��6�� )�v���nr�����ll��Y��~�'�
GT�N�n7;
�	z���8�'A1T]q	:oV��ON��$���89�>�i�+�h1�pEo!�A�
�]j�Q9������X9rw���_ }K�)��[}��l�M(K��PU�>�(���o��'1����4�Sd1��������6Gb����!��ZZp�B?3�������(���:s�p�T��������@��N�g�3�@�U���`H����Pa)�i(&!bzy���O�����o�^g.
��F�U' ���M���l����R
K�r���!��b3h����b���_=VQ��7�|���i�G�h����6�O������J���iS��Ux��M�������X��m�>��G�)�����}�M��^�_>�������K�IZ/�����S�g��T�7#�����[�����D����p:]}�{�,�@*����{E���h���nub~L���9,25�,=E��#��C,|uE4�������An
\���y�.b�����b��y-�ZR��\��V��Z��G�?Vy���*��3����2/y�M���WH��2���&�wg��A~���*�s&k�22����������oPQ���1l�V'*�����A�{����,A+H����5e^�{��8������f��>�������2�����m����N[0z�>_N�=U���>�����g�7��fLR��|�%�$���Px�Jk���r^��A�;hW�?�����	h�$�JL{nzT�9v)&ufz��=_�|��3�NI�s����T�������N�����m��}����H��$���t:��F[���u�`�<���w����x��i����Qc�����&e��XX�J����P�e���i�2K�t�z}+c�l�Mi���kFf�W�!�*���=fc#�M"���|���)�Z�+�['�w\��@�Y�g��~��u	{j�vR���Q���;oT��_m�����y$��,���'�`�mv�Me�\��&��^�_�(�b$���yF��"�u�5��]}��O<�Mp��ZUd�1�4B���'m�&!������}��
��
�ij�W_F��]�����,����Gr��kYj�Q$����Q��~
��0Mb��:E�����T���F����]��B�C�����0��39~����3�g��J�\F��� r6H�x4�$���r����&���?��������A��Z�3`���x�8�@_T:Y�B������7����9M��=�E��1�.R7[1����z�''���N�0�������?��K�S��2n�L�t����V�p���.
s�ER}�S&��	�tD�����g����}>�>���?=��|���t������l������j�*�U9����K!*d$���D����\!����n�3�vW����[rV1��F�=��������T��8Bw�M�.���WO�-Me�D������l^G3�~^�_{o�,�q�	�_�S��H��#$�>JM��E�m��V�zgvm
�$L8(<�:���Y�qee$Ng��I�@<������?w?^�����i�OM�R �8}�����+�W����$�cW��v�8Y���\��?��&{��������o:������s�����C�k�KR������o��>0���\������]/_�����26�w�z�����"��v�f����g����%���wS���s��%��b�y@t��o�G��>{�������W�����\dO�B"O�
�������s���j��nN7O��Q2OW��Q�����J�D?=�;V��|�����F��!
B���+pu����'�0��-x6,RA{������7_�jo�������e�p��C��R#rs���/;��|�R>�^�xoH��x�8���R�9A\���/������><�8S_�rL�+���y����;�4.�H8�������G�S_��}���+x���_�9%������Sa}�NO�� ����7�$��c������_�6���?���rr��=�����EeJ�fs����}����]������	���+��(��}��`k�MVWz��W�O�n`���)[�3�5��qq�A+��h�oQ-��o�������Ah���-�2������#-�C�Om��#����p E�x������e��������_?8�����-e#/�N��)�����p���������^�0�j�� ������b�9���l����5V��}����<c&��I�������]^�j�	�{A�B5���_�K���A����Y^����=��5���[�,�8�.A��_��K�����������a�����#��wA���8�3�����/�%>|~��|���{��+'q����1�!~�$Mt����hR�a�� fT��6�rnk�E$�K�)_�Q�$�\
w�������s����E���c�O��J]�R�Tv��%�F:NIxZ^��y��+_���O���^��n�}����������M�G�~"���x��]JX�s��'�%O��S����v�7:8�J�~�������`�������1�W�e���\���U����{Jz���*�2{��*�r���?�O���/�M��H���f�����9���?�����������7~���_��������0��/�8�Y �����S��	�t>R��������x��>���SN�������n�3N
<�����_s[�d".H�)�������<Xz���2����N��J�'�*Oh{b<����������M��T|C����O�~�%|7W���_�e���_�]�)QI��I��`
�T,��>A����W�D�����Gl�W|=zQ��r8#/����S��=�Jt���eJ����+26>���G_���L�Kz������{�]���Ade���')�pR��q����DA�@�@�����=n��j������������*}p��7�����t�E�9����6�R�.
��8�6O��������F��2��y��|��!�WO_B���.���k.����3��Y��c������r��������8�H�����^<��������M�^�x�>�s��W������'��?al����>���T����|7�'��7���/���_7[f����{F�{��1�����%�t��������<7�Oi@��������
e�Bw���~��W���>�L����y��H-�I�X��Q�7�a)�������|��}����tB��wz��Ws`�S��;G��KO�L/)���s���Q���*�4XN���j��o&�=�_q�E��9�����_/�?�q�A�e��;r��$���{�������n��N'J�?���<��J�&����y�������^�5l:|z9>~9�Y�����mF=G�bi���'E�2!O�����[w�L�7w�['���o�5�����/�h�dg���t}������Y���O?��O���/�'S��'K�h
3wD�B�)������v������r���}����~�[s������{?&}��)
)\3��+�rR+�����on�Z�xU,r,�|t����3"�.��cdo������m�}�^Heb������f��=������S�F��w�/p�L!R}|b�W���'����/�,���f���w/��x�G*�y}��~�
��!�)���	��S��c�����z�2u5%������L�_�F��MV9��,�_v<�{o]������V���e�{�+�7|���#_�����OG�v���E��/i_�����:���+`;��-��{������a:�O��_��C�����p��y��7)}aH�G�~���2~���h���H�L	6��p� �2��VE��W�~���Rz�����}������:�������2��_�L)��k����Gin�r��^�5=
�H����42>F�O�����k��#���/�|u���Sj���d����������R�c�g���'���S�R�-�	�Ns3^e��������'o��������|����- +�������q��^�I��Rm�_�<�t����:�|t���]��4>����_={�=��NA���i�.��9W�2P��B�p�����}N�$�������1'�%7��"��g�=~�t>h��:E���:��Oa��������yd: ����?���
����������)p����?�~���� �6��:�`�|�rVW������z~�����e_��k\M~�7.�����c�ry]y!�&����o��|��I�co�,
��9�����wZ^
���Zx:�&~+H�����|��n�s������Q��H����8���t��kT����sC��_��X~��u�&��q��#�R��y���aW�s��.��vYb/w�7��T�������������2���
e���O�O�E��q������O�����Wo�����xT��O���~���������'���_���=��u���X������)?A/����b���p?9�~���4F~���_�$��n�.���;��:'�����3��S���~�ogo���������~�{8zH<���>��_��_����~�������>���}OE_�]��&�{p7��z����h�=;��8 �{�~�z���-��:"�����F�����Y��t��N�#<�O�#���`��G=$�K�u8�X�F�!��Wi�^jq3��`w*9?�8����0�3�S�/~O�w 1#�	���9����5f���>���~����Bw����������e9D;�I�4"XJ�c������S�'z�S����u���������S`������������_�5����
$���|����f�x�P8�sSnQ�kQ������k����Iji�����9_���N3�>��_���~�}���;�v���FN�13��8�)(�\��\����d���7���
!�R�rjSR;��T��t<z�Gw�9��\lN�t��u�P�<u�A9����]l�@?m����������446�������_v�;���y�%;n_MO�/�����H|�|L�))\��+�H�<]����������7�����������{�����'�M�:}�*5�Rp���G"~���C���������?����2=����g�+��5��o������3~�[����$�������=�
��0���������x�������-]"������7VF�qNJ�������g�|p����z���_n�X���E�?���n���;\m�FI|[��}kdP��b��BZ�4�N�����2]g�� ��Qz�Vt���2� ���iRM�D��<�^)���������U�<���g���
����bqd�\�B:�u�� v]����-l�oepQ��^�>}H�>IK����q�*�VFE����I��5�R�~:�(��$P]�����l���p��W4�z��R��pz���ADo����~t�+u��c����@_��_������&�x	��m�T�"��;��"��b�$-��������9w��b�Mz)E�b�����T�.�X������J;E��� �$�K�9���Dm���4I��k��I	��hJ^�(��V�H��W�	��I���%�fBq�a�t��@�[_��!B��1C�!U��u��"\��XLK���!��6]�@�c���&R���V���NBIR�~b��7C�� C���I@XO:K(�+��t�0(G���$RL*-mG%���n�!FB�)*K:nt��!��$C��"x2�/fQ6�d()yQ�}o5	�/��$���8XIp+�����vx~-�@*��I�����`"�wKj��(�&���H�d�LK���J|	Y�6O�0hy0b�VJ�:B)�,�����oX�x�}s�9��-��(R��d���=)�-��)�ekB<AF����d���<V������[1T�B;�-ut��g��%��F��f�!����� �����Rro��6�R�e��L���������aI���0�O�]��p�WO��o#9�p�s�F�n&X^���#M�������������4���vx]R^_�z5�]�H�n�R0C�����l$K��j]��D;�p�B��D$o���������7Rx-�H|��J�����? `��$m�Q��*�y&y"H)Gd�V���N7!��Z�b��H�9����r�W��k��%d',&� �]C���O4E�_�`@F,��m,Vd�����"%���������8��H:�����;!���GS���E	�U���d	�������q��;�?)r�����v�~-�@���5g�c������i$g,�@@�? ���
,��M���?���#B�a&�pu8j��P��A�7��+R:�����=:dfV@��6A[�V��i���[��C��Q�&�-�V���^�B�ZFB�vP�(Y�$��
���QT\(�����P@�����PKY�(b�NMA)8J��$D	��XzBx�eX��x��o�<D g�z���������G&��v��@��@�.�c�Ee��$��C
)��,��B*��@��B��K#X����ol�C�O���`����	m���a�`��9Wp:]��Bl<������!����*j%�������SZ�Kv�+9�	�l�@�g��!(`��W�C�����G�#L#zK/z�(��,v����F��`7D��A'��	"/��C�N��T��@�0D��';d���	8`&���{E���j���8������R
:��m���;�%������Hy�x4E�N*h�@���X�7���`�77{~	�j��x��x5��A����N�"$=Z����FI���W�`OA��0@��@�o9�@��{�����`�����f�
�����d�RSM���r�
@DO75w�SF738�{�	�Z�)xN�
����p����!�t�	�1U�����Rr"h�%:����M�����$5,Q���:�l�=�8�(��}W�\(z��H]�k@v��,�r�Gd+�{�9Y�����\69� �hr4j%����p����:�.�-Q�CX#dN��5���D^�f g��ma�Z���������@��L��qtp��L�V�*�/�:�i�>`��&�z@�2mZXM�p�Z���
�&�������^z��w�Z�.<hi�����&i�:gGre�H7�\(f�� �j�����"��>�F>��]���@�h��G�A����#��V��+D����
>l|�@D�Y�����;�(�@�)l�@#!������=Y��R��)����$�vL�G���Q5�.X6N!��b;��J���Fwe����������=��73PX�H���Y]=����P��
�G�B��g�(��EE����
:zR;}�@v���op�;�\?�@x5�JT��8�0�i��f��R�U�XhkNj%������6��@�G;��D �M':�C4R;^�k��|:�p	$�7y~I�ZNeUQK9�$���"���v3H��2���� F"�-^��&2H��F|i��qPF�:���v�~= ����4l���$pn��M�����BD��F��_	6Y	����r�&2�t��+��H��!"��;I�^�����
K6|��d%g-�oi��zm��'i;��0���V���^B�d��d���Hp9\�A"Q��y���'��m��r�8$	�|���9�h��$vd��I
����������K������=�7|w�"
������ L��V-�@��#@xM"
e������K���i5�0�33�����������!�@�Ix����%�@Q�1�5r��^��)�`����%��<��b��T�P�z�9��WWQ+y�$��"�)z�T�`� �T%����^��fV)!�F��_�L�\���,��;e��?g���umrYM,����(����R�C
n<� ����M%�n����Fx�RF[���"�L���q�C*���Kj����ym���fps+�3�i�D����Y�O�GsX��i2����9o�	�a&��8|�j�4�L(�o��H�p��w	 H������9X�V=r.���������-M���V��)<(�do�#'I���P�Mh"�9���^x	[�����s:4�B��������v���P��*���DobHM��y��K�}�xR����nbZNXt+t�^R;|��@�(?�cIinf� u
�t?�I���UZJ�WM�>'��
��l�Y(�,��h��#��)��m�RS�X4�@2�>��z����F�a �� �G�6�!���j!��"�&en���%��F4�p�s����j��I~���j�����C���>���J�A����?�@L�r
d��f����T�Dn�adj�/m�A�b��dCV�.H	��9k�
���^$3��jM
��7��5���X������n}0q8�k�j��;o�a���l��T�p��� ��Z��tBH������J�[�O$n�f�4�J;E������e�����c'��.0|�>�)C�!����#)������r�$ M.N�:nv;�c�>��yI\/[I��K�+%qG���p�� �����%���k��m"=M:{��,x\��#��<%�w���-�H����6��=u�@���@v�R��h��P��������Zj�7������������>��G���j%������5A�.�eq��v�RC�EqY�!�~��A�0$sq�����,Jj)�=A�wf��(�.�-���K8H���Q�`G�jX�B��'Z�h��_H(k��3�0�����+���)nP��=�����l�d@h�&���\$;o�A
�T�$��;���
���5RZ�N���Hg%,I+�z5L���C����	7��(�l�	g����J{	m�0j��%;���e~�����%rID��:�T=��5p<��`��3���d"y��c�r�����HK��9����Qx�D@��jB+g�b���Nh%��#|G�n�1�-�+j��W��?rZ�H�mf��R�-���Y���<�L�:��(WZk������ ��m�*\�a���Pv�^�p�J������go��0��`����:�W�������f��E;G���m���Z�c�@�c&r1e|�@��	�TP+�k�	�'D=8Qw�mN���]��Ekz�b�h��R;�	�#�B�\��a��J��L0�&8���o��\�2�uOX�*=���n��_L8�*�����0�����4�0HW�������P�o�=�WN;�i���Jj.��7� ��B��3D�����Nk�64���fM����v������<v*��m"t��&�����?���$�E�jQO�k~����C������A-��Z�bO@�Yz����2�0)wi���pd�b�%7�nLf ������93��I
����
�3Im8����$��^����S�*�0#��qF1���<
b.��7�@�6'���p��F��-�+���@R�w�������P;����,vAI�}"��\�[��<��x���Z�����r�z}��@������,��EI-e�#x�Q��L}��:�[��o6���97U�D�^Lv����}� �^�	i��%�����'�=y.�n2�8{&�p���_��g�l%q���4h��Pv�d��_Lyp1�����^:$������T�������6���`:� Gb(��c�9�yT�J�
�WN��%�;�@Z��Z��<��s�������`�6��m����2��$Q�BK�����xM,%�#��7s�B*����{d�&���fgp�h�%�5����F)G�V�q�C�����������)��-�G�nf�@���* ����'|��flP~*Aw��>�`��t��E�h��Pv�h�D)T�H���w|��?q���V$GO=/��$���J�d���6
���u$*j%���h()�1H4N�v�`=�M*-��FN$z:`@�%������$v�@�����$�K�f���q�iY�S�I��rp�y$��Kj��W$h��73H�L.��
+�Mp7�r>��	P�B���M�YC%��+i���*�bBb%���]	�1++�g���L�H�^����y5���:{�!�#_w�k���XIc� au4�bx���d4�H�v|G��"��3�>�(�����Ad4�	��L���A��9�G��PK)� TPVLP��3�K�&m����\?���7~�X�Z+G:Ke$Y��8]d�j�7������ya���v�~-��>1	z��-�� ,Bwh���q��E/���>�\I�Kl"���t=Zm���xK�=JJ�#���	e��6�C�r�t�;}|�B
���Z;����lg6E�F��A��!JR%��Bj��p�g.�1<�@)�	��m�q��*^j���)e�^����6�0�����<�-'��;���w	��vH}kcG1��-�|��(��D�r3�h|dj��_
<D�/A���?gX �6��*m?�IJ�{dZ��E�ZY��l��FIea\G{
��u�pEm����
KDn�a���])pHxP���$�]4G+�%_�CE-��SL�R��'��p5�KlL�5��^C����-�x�|) ���P��},�V�Wh�4���Ps����b�8U��puj�A�E{
<h�.m��g7��u����(��6���/	�l�����
����hh�!���� �u�1��Q�v^\C��x��5<����:
�w�z�+��8WI��.T~�����)xz}�4�J;E��H���������*���3Hk���j���@��,�B�4��nA�����,Hra��	�Z2�c5����g������{,f�4�x�@`��������Nj��_D�Z5A����Lp��#�3-��11��-�O�k:'��Vf��5�
5���l��k�!��"�z|�����As�
�M6����V�R�g!�*�8�,9�M|��V��+HDRj�R!�d|�0��T��n��TP�BD�����������Hh��J�$�.�e �H��&����>	L����#�t�(����v�~5 ����]���}p�E��1�s����O�@������|@TG(%5���l84k��3��w|�{���	���U�C����$��*A	�)��!��Q��]�S����v�qC���W��-�ue��`�������)�g
y[�",���J����{Jhq���@�@2$:����,�h���)
�f|7E 52{)[i^��F�B���x�,%o���n��&����^R;|�����8����@Av,J(�����+a</v%[)������M��Zu<���e��6�	�����_�;����N���n�X��+��=
+��������!gm#��Z�c�@?�L8����]��P�<
��F����:Y���(,
GFd��I[M������P�{'�*�x�n
x�|�>�5��e�D.Zk�����VdB������&Hm��HQ��8��a���@.�O��9��>[ �L��H�8��5�����Ps������h�8a�	����&����z�
������~3���@��x�
�����5�
���^a"����u>\��`��0aH�km�j�k4[��C�Gd��E�n�!W�R;���������.�����Z�=�?�e+�������92NF�o���������)?22�gI�nFA��I�c	gO'�9���8�8	$�7�
5��[m�b�CH�_�	>�N�]u��M[��z yy�3��M
���^A"
��d�E��.�<����#l�sh�������)�sC�c�~�uVE-e�'�0NN�L��-@��F�
�x�V����@��~+|�h������������"��&���{$��q�I4S�
��~�ZR(��C��T3'����y�
���'��������p_e��Is{�S��$�Y�O&L3'"���2��@����)7�a�����M\K���W�$m=ko�%��d�x�x��p!��j�V#[Ob>�8vv�u��Y8VG��Y��R;	e��/Z�.�� ���8�h�m`���<��q�s�
B���p6��4�K�:��R;|�� a2&HF���VD=���'"U U��i��|2s|Y
��bZE���Rs����f��)���|��op�&��6��7��P�d����K���b���Lp<;��V��-L�������]��:�$��4��EBX��'Ao3����F�HT�R�	�m�L���f�@������HU�yiBN��$���$��pP����of�����$d�rM[73H����z����Q��l��@>�f�	l�H�����^+�=<�__E�
���&!����-�C�J����
esj���c>Tes���!�d���Oc�-�
��V��)>�qa������7�K�D��e�s:�E.�S�CP���K��h�������93�j)�����B�Y��<}zt���!�sQ �y��h}89RMj@K�p�Z��Lzr&O�!��w�x���9�.�sNI������������An��8��R��TG�'p��_�]` ���h���it�6�xt:�������%QE�Lm�� vJb�U�8?��a �0`2Pj��]pm���Z���N�l���
���O�~p���J0��+���}���X	��ik�;��"��44���2�	�;�r�F�����}P�`rEeq\G
C�V��B4���2���`�Y��l��4��I$r�
����T`��_��dH��
�����TXQ�+���@��$��m����������t��s�^����k�+9`�+c&���������ZD�X2��!4�dJ�����-�� ,z
��A���w�6K����e�L��Y�KV�1rheN����z;#�X�XM��M6j�� ������������?�1�	��U��X����=����� �����W�.0�Ss=�������
���`�a}��Y;��"��H���Jo?�Y�?��P�"uM�����{LIg�����l;{Ea����Go�W�\;o�Ao��������
/���`0����dW�c��@Z�B�� C�k
��Z	b��v:�BJX�c�u�/=���E�_�8�|t�I����$������t�MA�/�e��$�o�����|��xO�9C1���TS��R;���OzC��������`���7�9��^�pE�E���<��<�5^���~q��g����]�o�<��w�A��!K����k�� ���c�SD���`�����G�ge����7#YY�'��	9��'��Z
`G�_g1-$}�.S�>S��.|����9�D);I$�66NC�p�ZT?��J=Y$��v��!��0�1����n����uEK�������C\
-��w� R���	:w  D���
>[��4B+��Z	a��0�or�����n���a����r#��j�m���!{�n�$�@j)�A�P~r��&�w��-���9�c|�e���i�n_��M����Cjx�@����[k�-w����:�\Qsq��-)ve�1iQ{4�[ �81a���S3B?�!���'����\cH	gE���G���W�� +TY��h�>�]�Xw���b%VW�X�:PUfz�Z
`O�-�k��x�.��24��/�������P��=j�������>p6��  �H�n �����NT��=j.��7�@3������x�v��R?}H����� �sm�����{��K��DE��W�� �`�������9z�*u��������T�R;�%Y��������'@�kD��g�S^����@E�p�Z`���v��������	b�r�����[�1�QsQ���l�(��H�w����e*[�p���l�A�%�OX=IU���VB���GA�������_�Y�|VF�U2n���E.d�ITc5���;R�����	���L�������~���p[�������v�}-�_"V��D��E�[��]��uA��p��@���kb.������Z�kRd�����'a���0�P��������UA����
��^@I-������0����Gg��$+(	=�&h��b��A��b��NI}�.���7bx���f�>�k��v�}= ,:F�����.=p"�W7:[N�&�!���P������r?�n��<{}���}������ez|��?�3L8c�JoE�
�f�V�@��H^�t�~/��_��������{����
������<�J�����(�bLJ;DG�w�-��s�/�:y��^Y��W]=i�U�}��K��_�;}>�%���j���.v��Ji���#��J9k5�l�;\�F�-�NsI���^�K�MS
/�m����is�/�7��
�����'��ZJb7�^^�AJ��9�a���N�1��
-�zi�����Db����U���! ���yI���Z�=�C�av$��nX.h"n��d����Z����"S��R%#���H���^3�5���Rw�RRY(�R�O7z�*�.��;~?����%e�*���`�D���3�U-��Z�a���9ah"G����]bc
�������\��/��!������
�M $�������
2I��y"��W�^�z��_azE�)�2Z���^R^_H�� ����u��� �}#|~�)�K���_b7�7���ew�9&����c����{t/_������������$ rS5����F\5�j�@�8W�LA� M+��Z�c�H�CcB/K����p5:+�+�H����T�h���o��jL������,Jj)�����2	���
�����4���Y|r3i�u>$<s5t��	��Fkb���3�'2;���
J����R���bK�T{Fo���\����y/z2)��Pv���JR05)���n�5��`]C����jE��4��+G�%�1LLZ���GE���W� ��8!�#��"�1��?��� ��<������"B������k�A]"*j)��@D$k�N(!��&�w~��v��T�
D�hF �X�o&_�X��%����	��.%�{�63H�������T��JK���	�F������$��=�TRs������t�����1�hx�7�SQ�@�K����[��?��5���p.]'�@*G���<pZK�HKt���^ABx���(�p�$x�E8N@�<����Y^>+$������@3�x �cVgn{@r�':�(��$vR��H���`7�BC��Bu�,��
z"H�b�z�� y&��H������E��>%�������7A�sp3������/��R����p"�M�D��#L�y���]-'Pa�L�=%5���
h�����c�q��or�'B���m�U��F�����0v ����l�Q��?Q���V2�+T�Z��>p�w)�$������!'��zJ�������L��gn����uQ�&�PKY�,bD;�8�����L@aW���`��m���|���1���D�
�P;<����\&-�tp3�N6�-�+n����x>�D���_���0��'���e�=	\ ��\!��vH��)�V����U���	/6zd%��]�T�e^t�Z�c� !��nR�&���$�s	�t����^W��U�t.��JJ{�m���/���������Hx��$����$T	��Iz2hQ��&bl���r&d��x^R;<�"��Y���f	����w2�����1�0n�I����� ����
b.���,�����\�v���"|���=��G��u��p"��#�m=J��Av��Z�c����S�S^�wD�%�U	�����NS��Vp�B�t�5�)�����4�R��I��n'�E����&�E�<YQ*�_;�y�%W[_BoT4���\$l�����v�~50t�S>���6G����Lx���lLZ�R��5Lh=�[�l�a��U���\(;�l`4X9�H�G_���T��*�%,+'��I�LB�c5�q6��u�
�|;�{�����2n�g:�HI���[���l��^�c��6���K"��Ih�q�h���Q�_A6�DH'v7��<���T�mR;j)�=�����hj0��!"B���&���.�����T�4����!bk���Uw�j
�0ux^R;<���:����A���V����Mz����,����hR��/z>�`��nc,��v��nd#w��j.���?0)��1�jx�787��K�S�L.*�A�(R��RS&�0�f@�sE��&'�q�j%��B��a��pQ��.A�������/N��D�s�Ri��#�erd�mL�y�a��-u�VJa7��	�v"��J5�����g�Z�I�X�Z��0z�ii��x�a
���}5��������]������`�\�����=�H�E����U{2��h�>hg����']I$�J����7W��J����f` ��mV{��r��J1����@������>�����J{��E��PZ���1,D����s�2�:�d������]���zD�,�8�UG���n�A�G �Dq����|����TP����b�VP��"����'%-{�QE���Z��5��;���P���&�j5=��TP�����'���5�t�����B��
j�N��t&����E�L�P�:�pu87������>2a`�F���j!��b��+=YR;����������������5�n�&#�4�~)�,�G�d�7������t���'�
�p�?��Z�T@7�A{�#�����p�Z @30�V�v�~���i�N���l����M�����7����������Yo��{�W'��nB�����i�C ���9dZNdY��k�QQ+y��!+��L������?��t��=��+�Fq�kGm��@��8m05�����Jj)��y����-����X�@��m^��9�F����my�o�N��������
������m���u8���\B�"�.��
jA/�9](<CO��P�I���#���e���d��^i����C<2���!��6�X�v���J�����,���)OC���_p�^!�J�Kv@���Au2�l���Dk�Nb�~�_��0�b�z�
����B��;I�h6�["�'k7�i^��� H�=�7|wv�-=4����� �z
�cG7@�DV�V���%o�;_�@����Dl�g��}�B���Pv^=m��gx�|]�[�w|������!b�����J�3��
��K)M&�%��J��AX����h��
���^����7	[��ax�@��n��'
�E$�z_��V�"��V��������1G��`�"��w�s�4�SC-e�#� x5�J��73H:D�����~�,��E����c��	�Y$�a�����j��W�@'n�G73H���q����H���e�X�iE'/b��&�
��u&S�S;�Ps���vZ	2o��i{�H^�����(�p��Q,���E�VW6w<
2�G�d�Y�|����V2�)P�|Y
$mKa/�1P$����$�BN,��:��x���mKn[4��p���h5�R;
4��t��Z]����G�p(�[;�6��s�Q?=����t)-IwN~j��W8O�H�[�F7G�@�(���)��Ci���|��|�����A=y�h`����B�y�t���XM�`���8n����8���:�U?�����p�������Z�c�@A Y����u�.
��u���<ag���P�v���������I��hk�j)��42��A�����Z_W���#��������7�����AK
���������:�c�#
�73H ������K���dX\���2��R�m��q]�`	%zJ���B�yut� D0�B?�[@�v������|x�:��}�M��=
�u q�����<�
V&���KvH���m���/�!Z����D�@H��T��I��n�N5���� B��*�7G���n�	?�v]�%�S6-���N��V�9��O�����m��_L��0����/!E����i��*w�v����q��C�j��<���d�����YB7��q�;���B�q^g��]���	�_�j�j_�8�u���G<�%�����V��+D�����Q�����t��'x.�<d�)�)��M���x*1�g�����u|���v��	Y�f��n����G�����������@.�]]�a���F)w�iR8j��W�<,y�
{p3����H���iN���v21WO{?��Y�=0��	65�\(;��&�6(?g��2��|(����r���5I7��yO��0��h��4�
Fi�~�j���1L4^v�!���K������0�I��0a�tM������H����%O�� 1�E{q0��7'�6�8��<�9�����
����	��
5"F��mf�@�i	�'����_�>�!-���N S�	&��Pl���H��Pr*	E^I%�'���d1^�\�;��v$fpx,VmUR+���;X?�Em��FC���W� �����_�;�����J�����y���9'K��gk��P]��GS��$Jj)����������|w	t�9{"�)l���DX�r��?h��4Y�m�������)��'�o�3��Ho�6��o�
<Z�����Zd��y�1��6+���P���:�H6��������:j�n��v\G��p��^���D���GVr{%R�m�YC-D�S\�����6�����u�O��W��N��a��w�7�L��FM���( L!�5��v|���*j�����VQ���<���W2�&��P^_���d<��4�������������P���:��~�� R���a��Ps$�y
5!�w�O��\���x����:j���U��Gk5�Z��f�v�����C�XG
���^q@���$���`��g������W�'�TZ��P�8F
8��a����
_C-e�#X^�)�}.��*b��{q�0�X��a�g=%�
�P--��F�v�^R;|��p�l�I�����K�����R��q6�%�=��0F�����5�Y;B)��Pv^K�Y���=�\�[�"#z�cg�D�{��l�T+'=��U�J"�Y��
���N��R�A�*����2����4��	$�����:�$���Z�@Aw����sf�R=����,���'�m�����ltm%��$/���*a�@[�_�L���
����
Go=y�q��a������J�@A���cD��Dn�m��j.���T{�'M�����opg���s+�/��(�E�nZ�i���N���I�(�x��
O'�i�Q+i�&H��L��F���]�	+��2OmM� ������_Zc=�������dMy2� f+$-tDQRKY�&Yz�=��a"
��2L���%2�R��.�ci�J�Y��Mnm��_
LD�$�.F�nf��h��O~�'R��Pi��Yh]UG��D�7&�����\$;���1b �$|t(�d��x�`�w�Q+�2��3��=
�7�7�)P��*�4�J";
�;�m�����O��P���� g"�k��=��@�ob��e�K���h��,vVi����w3P��H��_<�0�~�a�e��1r�h��9�dI��!PK���z����I;����@b���o���{�t9��'���{"_�W��\(;����P��	#v�����[��x��,G>��x'����8w�j�2���_K�d�S��2��!�(��e`q�*��������������Z�3(�-%j��Im�+��R��ID�x�CC�wXl�T"�
K8��U�����$�:[��F[��Kj��W�|�	�j&0�Y����j-W*��i�Meu�mG��v���=��u$���B�ye5}��n(v��w�@���2���
:����F��(v��P�\��Z�c�@��a"���ix�Eu����B�ak65��!Yx$tx�����,���\@��vPly��"�KE���
��O)��V�E4a�l'Jb��Wh55���6��&6|��1��[���a����SHdSJr�eh��[j.��WW�YGA���;�����8����P���&FX�E[:�L�n9����f�k3)G�y-���Na�T7�j8A�P�`�`���������N������v��\am�S
e������/����Z�bG0!���73DD��Ys�yQ�Il@��8)�.Msu��k"�v�����+�
C����������q�����A�/��N(�Y��x}�%�'f��+���H���e���^"����]���}�� �W]�0��S5D�!';��(~C��DT�J��M�����]���p.8nUVW������QOm��x�m<).��Y��9�h��$�����kG7��n�?B�\"��{���t:��g�g�D?�ll
�Zj��W��P��F73H���&H������?�PeC`;���nQ�; QQ��<U3�$�M��
��%;��W:�y�_�����"�?+��8�zL{�B�0�R+!���[�O�G��w��(��1��9+�xN�".B��%F�� �!}�@I-�H '����l����;\��t!/��Zn��?#�c����}�R;�� �d��)��N�0�:�J����!�Z���}\���#�����(��n�D<�9:�/�.������V*��P�NKz��@#�O����Mi��������xJ�/����u�/�<=���7��j�"[��D���x
5g����Fs"DE��L����m���o��z���i_�]�5J�'�j��P^_���"�M�G7���i3&�*V����i�&������p���XK�Qx����q�vV���K�������P�hL������&g���m��,4e�-��n����N�����3�?B���Zg��#��^B��Z�����;R�����8��p�.S�o��l�N$8SA5e�-���k�������<ox���M  �;���c��
8*&X�g
5������G��������J� `�K��\�R�1��F�2��K��[����h�������I �V�������X�x���x�
����������$��i�3?-�6�eQg���������SF��(�:�����v�|�g��;?EgD96�K����Jgh�L�:�K.�.���MR����`����������{����&ze�R+A�V�g���d�.�-����+(��/������$��Jh�a�������������i���L��L�s9��:���zjVNg*~�@*dR��V���E�#�MO�,��&�� ��@�
���6L���I���b�u��+�����l�@����fgb-������#�	t�Ln�R+A�z�M&Z���r3��9��!E|v�|���d	��C
��� ��('�2r�w���z7�?Dc6Z�%~"K=N^!O��vE�p�Z @
IF����\�[`������a�_�jk��p�VS�^�����{�2���J�.�l���t	�J�@��@�|&P��}�egb�Qp��3�oL��Z	b�0 q����</�-0����@HD����(8�#����3�R9��'VK���tb"!b���.�@@�#���#Z�
L:���^;�� ^�S�
�8���i����Tg~��$%�|���Psq�����:�B��fV���47��8F0H�\ ����!�87��QT,����t�� �

uS�R���`�L����D�#T��2�Q!��[Y�C
���@�z2��N�w����zG��"�WE\��vI�p��`�d<E/��]1��y���0��B��!�pC7����c������XM���e�Ng���8%�D$�B�R�<����j%����4��f-�z�d��@���}������{�!�k��Z
`G0��6h��Yhx����(���1��520�������+�Q�-���8��y����F����� I���g�+j.���#xL���
R���(3��N0������!5���W 1����h�� �
�)�&O��S�8�B&����+0��FiXb������W��5�R���v
:���kx��@�������]��7��F�4���bJM�w�z�,`���d�������1�@Es �P���z%���4:0�"����;\m���X��K� ��D��FU1�������s�RC��R�V��Iz�	J4�J;�%��q"V��y�WG��Y���)	i���D��LbR���|B��BN�!I���Dh�.j)��B$����.2K7�~b��u�]������}�W)�~��-��	�l�����k�B6�&��a�nX.hU�U���X��P
]���T��U�]p�����n�d����PY(�Q;l����FV�e[~�N��E�G�!di��h&����%��NQAKCR��&E���]b�z����e�$��8����S�����g�����
#fC�
�;\�|�BD��CD���O
4�pM��*�{��
�������
J�V{x� ����
l|'� ����=:���	mp���x���a��vv�g�����%GR�7p�m�-���I�Ms92�6�����O��$0��y6j%���;O]S�]��W���?��o�����v��W��3��T��7��gn��hk�j)���}��p|7���?�*dq�z���C�/���I����G�!�����yI���Z�����P��\`� ��,%�6�@�/,/�iR]�d�)%����1��ik��Pv^[���!N�L%�- �f�0<,P���e����?�<����j2�8��AjR��O��V��+H(#�������2�����,Mw;�d0G�6A�$���}g�n�6`��M�EC-e����<[��=��n� A�W)�dW�-{��x�M��2���@�qU���7�w�W���$<������el3���Q���^�d �&!a�g�HA�}�Fl7�l������d����������v|{;�������N�b�2�P�`
��#\� $� �'�D�5�h����4�J{�A�O�7�SD�w�����UOn����,���g?�����1�	"�7+5��m�����,Jj)��@}��L� ���w3LD����,�4-�a�$�
_B�_���=C���kr*IWw�^R;|���[���kps�	)����f����[�m�x1��p�	�|	�a�@��;�I��T'*j.���)k�q����AA\��������e�t2����7�N������+&4E����P+��,�������0Pt}	��$K����[s���g���Q�n���G5���n`"Fm5����nf��'a�'���M1H�Ru(�=���7y����6'��vx~5���E(/�nf�p�(z���]O"hG�S��M�)����E���!1��#���e�5�.x���LGw�n���=cl���b�����ZB+_3	+���CE���Wx �'��
����2��Y�R�R6���������"���(�D3�����Z�bG A�?yR���n�@R����-D�R]��&-�MV��������_H`)=�� ���|e��$3�]dtUM��������I�e�3����c���M��E.��������&!��F���d�X0�������n��D��R]���A/���p�����j
��b;�dQ��$G����QQKY����p����.�!�n�C!�u?���������U�0~�9I�Y44p�:!������@D��!}S&7G��.`�2�Cn����Q,U�R
�o��<�������uC����
h���@q�$"��;����1�(�����dE�|A�N�&�
��u*�Ia7�����]T���*j%����*�I��Fwd��7�1&G(�z?;{ �P�mi�gXggs�����w@����_x��5�R;
2Y��uT��f����	(�r�����>�����'�0j
9p"���2��y7�RC���z�"h=��>�63PX�Q�b��R�P�I`��$d�O��L�������vDW'�L[��Ps���F�#��$���?����D��,/��'M&�G��V����{[R�1�50�P+i�&��q���U�w	&�@������|��@��(/J6���@��0P ���)�R5���ZJb7 QJI7�S����H��lrk	��n���������n�@�!�RKF��Hj��_@X+'E�R�� ��p���I�Hp���*�W{^d���6
A�}�+*	E]I5*mED�C���f� S�������"V���vmN��G���
���6TE���W��^z�O��1@D�%�,'R:R�'z"POgqv��pi����%�Q�'RN6vC-%����t"�`P�9�;�D��9q�������%���	�����3it��<���k	�0������;�9�*n2��K�� �cqgO%<D<"����>��0�T�uTQ�1�DL���-�C1���:�Y�i��nomur�p���-F�C�4�u
��N�AS����5����^w�y����G��e�a{�h�����PQs���	va�.
x[�;\]�����������;�8zS�_�s��`/�E��64������Q��=q�\�c(o�w����~<b�RM���v�l�9���:����-&�\����7p���&��x5�)C}M��M�go�G��;����<v����n&�RZ�w�z�t^AM�n���s|�����)CumGp{�$Ud�����k��,��,�'< �bx�j��T���V�vmm U�����vYrr������"}�=�����",)�0y������H���uY�(��T�~j���TnC(u�toH�A���e�U�dgj�8O0@��-@�����z��n�An��<�Gv7����qj%���*�������(����#����0�������K~^(H�T��:�]QKY�(�`��WN�vG��8zP����S�DKu���?���zg�A�����T���(T�d�������Dj�'���Jxn��D�����"��_B���GA����*j�F2���r��om�I�g�����2��F��1U�RE��*j�����jub��=���V��+D�uj���X�w�X��X��T%��NU�2��R=�h����,c��K��:"�����?�9m��,�R���
�C�w3DD>h���VEM_��%��\2��z���s8�B�v�j��_
DX�0�1G73LXQ��T@�3_N�����0	&�������fl��j.��WQ+d��f@����|��P=�p�@�����u�]@�i�����
�P+��,H��a�������9�F��9�&���U+����N
D��oE�t$�����r�P8!�DJU)u�n
�L
:����6X^�����er^�tSpF�����v�~=@������3�������P��ZjA�+%6j �@�\��a�'�k��5�{��NFl�5PhL��d�<��]	�t8z�F������f���V��D�.Ns-$���[=����G<`�}�Kjy4�J;	R@��)�����u���j�����'
�*Z,5Lse<_�~9�
4��H�F�	<O�AuC-%�x�\_�l`b��&A�#Z' w�h��-N�O~�#�K��>H����<-DH�&�z�#��������@�
:�������
���	!���NEe����T	
C
p���5!�S���lG*�%?���J*	E]	4b���
�J�`��A
2}�����n��94�B�5������
�������&P�!�6��V��+4`Rv��^�]�K���:6	�2_,Yy\d�J^jI(D���$����J8�&@J����ZJb7����zt3C���=�0IH�b������Tqo+����<��S��y�+yj"g���������@*��$��rqt3C�u�L��>�8E�h�?)D;��)�t���&"�p1P��lG�Pe}��T������h5I	��w	"D@;�H���p�%c��xx��$w����\�`0_h[hmL�)i�6���V��)D(B_-�(�E�mx� �q^-a&��^�p��^^|�]������L|| �����'(�w���ZJb7!I��s"��d�`��(�A�(8��o�WJ�{�������`�t�� CF�������;��v�~-0�0�~BQ�R��#L���E���o� aEz^T�����>�(���
�����	$�Y� SC�f��&��^�HvG@��E��h�1=�>����l�	j[R��G�c)�n�~��J;���G/o�	"�w���)rqn��.�v���b��,�H	 '^��Ih���F���q��p�\����w�#GO��+���vB�Q[������$}c��5�:��B�3S%�y�J��8"�L�?���%����������������3�������S��e���40
5��@H���n����N?o�^?zusC�����;O����o��{���O_|�%)�{��n_�<��w���|�_��W�~��u��O�������G�_������<z��#���Q�������_?}����7O�=y������������?t�}��<m����I~��:���W�^<�����7��������������{?���O�n^����}|s��7�������/^�?��|���328n���!I�5�������w_����������7��>'>={���?��g��W�����>���3����w���r��L|������[n��|���9^�����5d�~����g�n�=|���g��?z�����n��������&�|0�3]��V��=������{���7�](?�����	���~�����	�������W�^>��������������W�y����7�������t���������t��=�5o?H7��p�M����J?���/nHno�����}w���w��_�|u�����~rw��yc���'/��~/y{������}��eb�5��K��h7���oRE�qo������?����s��;�#����?n>��M~������y�����/��?������|����������~v�����g���3�
��w�_~{�����Won^���k~��A���e7|ys������^��3�5{��xg��m�3ot4��������m��`���/Hw���?Z���y���������?�����y���������?�����y���������?����������]W

xeon.tgzapplication/x-compressed-tar; name=xeon.tgzDownload

v20240712-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20240712-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From ce3dc46be1ed3cf02e15941ead39f547fb856ebe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20240712 01/10] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1454 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  203 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1694 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c..f3b51878d5 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +464,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +584,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +633,93 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +859,1102 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 5747ae6a4c..dd22b44aca 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 8613fc6fb5..c9ea769afb 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 05a853caa3..ea15af104d 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -580,6 +590,82 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+                if (!OidIsValid(sortKey->ssup_collation))
+                        sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -817,6 +903,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -989,6 +1106,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1777,6 +1917,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a50..be76d8446f 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 0000000000..6f529a5aaf
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f6201..0ed71ae922 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e710fa48e5..cbcab21023 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1016,11 +1016,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1033,9 +1035,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.39.2

v20240712-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20240712-0002-Use-mergesort-in-the-leader-process.patchDownload

From 7ba7aac362152a07bb07be2cd38a94b09a288298 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20240712 02/10] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f3b51878d5..feaa36fd5a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -162,6 +162,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -472,23 +480,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -548,7 +556,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1146,7 +1154,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1176,8 +1183,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1299,11 +1304,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1331,28 +1332,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1416,6 +1411,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1497,7 +1510,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1514,7 +1527,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1524,6 +1537,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1562,6 +1578,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1594,6 +1706,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1630,7 +1747,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1639,6 +1756,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.39.2

v20240712-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20240712-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From 572f4fd5679ba2e838e65a92008a726fb3676822 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20240712 03/10] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 107 +++++++++++++++++------------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index feaa36fd5a..93438ac216 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1161,19 +1161,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1190,8 +1198,10 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
 	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX actually with the mergesort in GinBufferStoreTuple, we
+	 * should not need 'false' here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1300,8 +1310,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1347,33 +1375,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
 
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1510,7 +1514,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1520,14 +1524,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1630,7 +1637,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1644,7 +1651,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1654,7 +1664,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1959,6 +1969,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2042,6 +2053,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2054,6 +2071,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2075,10 +2093,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf..55dd8544b2 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.39.2

v20240712-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20240712-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From f21cec6defbf7e734dccb8fd7b4707777cba8677 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20240712 04/10] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 93438ac216..59e35fd1e0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -188,7 +188,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1342,7 +1344,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1378,6 +1381,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1896,6 +1902,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1908,6 +1923,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1921,6 +1941,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1947,12 +1972,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2002,37 +2049,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2045,6 +2095,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2090,8 +2162,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index cbcab21023..659699e815 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1035,6 +1035,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.39.2

v20240712-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20240712-0005-Collect-and-print-compression-stats.patchDownload

From 04dc8612e9863f8a03833017e463310e2b43fb42 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20240712 05/10] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 42 +++++++++++++++++++++++-------
 src/include/access/gin.h           |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 59e35fd1e0..7a2d377d94 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,7 +191,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -554,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1199,9 +1200,9 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
-	 * XXX actually with the mergesort in GinBufferStoreTuple, we
-	 * should not need 'false' here. See AssertCheckItemPointers.
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
 	AssertCheckItemPointers(buffer, false);
 #endif
@@ -1619,6 +1620,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1645,7 +1655,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1672,7 +1682,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1687,6 +1697,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1759,7 +1774,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1853,6 +1868,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1930,7 +1946,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2064,6 +2081,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f..2b6633d068 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.39.2

v20240712-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20240712-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From e8c1cf00755cb4434134bcc63769987b1428a2a4 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20240712 06/10] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7a2d377d94..60dca65d1b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1155,8 +1155,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1223,6 +1227,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound
+	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1308,6 +1324,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1336,6 +1400,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1364,21 +1433,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1417,11 +1537,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1489,7 +1627,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1531,6 +1674,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1554,6 +1725,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1614,7 +1787,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1667,6 +1846,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1702,6 +1916,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068..9381329fac 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.39.2

v20240712-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20240712-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From f400429c747ff7aadbad1a3165f62c4acc57507a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20240712 07/10] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 132 ++++++++++++++---------------
 1 file changed, 63 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 60dca65d1b..f79c9a7d83 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -144,6 +144,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -475,6 +476,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -499,6 +541,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -515,6 +562,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -533,40 +590,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -603,6 +627,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1232,8 +1257,8 @@ GinBufferInit(Relation index)
 	 * with too many TIDs. and 64kB seems more than enough. But maybe this
 	 * should be tied to maintenance_work_mem or something like that?
 	 *
-	 * XXX This is not enough to prevent repeated merges after a wraparound
-	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
 	 * because it quickly reaches the end of the second list and can just
 	 * memcpy the rest without walking it item by item.
 	 */
@@ -1969,39 +1994,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2086,6 +2079,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.39.2

v20240712-0008-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20240712-0008-Use-a-single-GIN-tuplesort.patchDownload

From 0fbf98e98a3c18730790c45f28a96d3f47d3b0c4 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 19:22:32 +0200
Subject: [PATCH v20240712 08/10] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read it back,
merge the GinTuples, and write it into the shared sort, to later be used by the
shared tuple sort.

The new approach is to use a single sort, merging tuples as we write them to disk.
This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize tuples unless
we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's writetup can
now decide to buffer writes until the next flushwrites() callback.
---
 src/backend/access/gin/gininsert.c         | 427 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 6 files changed, 307 insertions(+), 250 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f79c9a7d83..e02cb6d0e6 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -163,14 +163,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(buildstate, attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1169,8 +1159,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * synchronized (and thus may wrap around), and when combininng values from
  * multiple workers.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached; /* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1188,7 +1184,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1203,8 +1199,7 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1224,7 +1219,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
@@ -1244,7 +1239,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1294,15 +1289,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1314,37 +1312,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1397,6 +1429,55 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer	items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else {
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1431,32 +1512,28 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * as that does palloc internally, but if we detected the append case,
  * we could do without it. Not sure how much overhead it is, though.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
-
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		GinTuple   *tuple = palloc(tup->tuplen);
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1530,6 +1607,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(NULL, buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1543,14 +1647,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX Might be better to have a separate memory context for the buffer.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1566,6 +1677,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1589,7 +1701,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1600,6 +1712,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1613,7 +1726,7 @@ GinBufferFree(GinBuffer *buffer)
  * the TID array, and returning false if it's too large (more thant work_mem,
  * for example).
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1690,6 +1803,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1718,6 +1832,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1731,7 +1846,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1739,6 +1857,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer, true);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1790,162 +1909,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* print some basic info */
-	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	/* reset before the second phase */
-	state->buildStats.sizeCompressed = 0;
-	state->buildStats.sizeRaw = 0;
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer, true);
-
-		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	/* print some basic info */
-	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1978,11 +1941,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1996,13 +1954,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2159,8 +2110,7 @@ static GinTuple *
 _gin_build_tuple(GinBuildState *state,
 				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2228,8 +2178,6 @@ _gin_build_tuple(GinBuildState *state,
 	 */
 	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
@@ -2291,12 +2239,15 @@ _gin_build_tuple(GinBuildState *state,
 		pfree(seginfo);
 	}
 
-	/* how large would the tuple be without compression? */
-	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		nitems * sizeof(ItemPointerData);
+	if (state)
+	{
+		/* how large would the tuple be without compression? */
+		state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+									 nitems * sizeof(ItemPointerData);
 
-	/* compressed size */
-	state->buildStats.sizeCompressed += tuplen;
+		/* compressed size */
+		state->buildStats.sizeCompressed += tuplen;
+	}
 
 	return tuple;
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index a3921373c5..cfaa17d9bc 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -399,6 +399,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2277,6 +2278,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2406,6 +2409,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index ea15af104d..516c85f80e 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -211,6 +224,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -289,6 +303,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -399,6 +414,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -480,6 +496,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -526,6 +543,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -583,6 +601,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -602,6 +621,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -626,6 +646,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -657,9 +681,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -702,6 +728,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -904,17 +931,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -922,7 +949,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1942,19 +1969,61 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
-
+	unsigned int tuplen = tup->tuplen;
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple *tuple = GinBufferBuildTuple(arg->buffer);
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1980,6 +2049,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3013a44bae..149191b7df 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 55dd8544b2..4ac8cfcc2b 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -35,6 +35,16 @@ typedef struct GinTuple
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0ed71ae922..6c56e40bff 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,14 @@ typedef struct
 	 */
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
-- 
2.39.2

v20240712-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchtext/x-patch; charset=UTF-8; name=v20240712-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchDownload

From ca5aec2c40c33d0a8768d8d78b11872c56613d96 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 20:58:37 +0200
Subject: [PATCH v20240712 09/10] Reduce the size of GinTuple by 12 bytes

The size of a Gin tuple can't be larger than what we can allocate, which is just
shy of 1GB; this reduces the number of useful bits in size fields to 30 bits; so
int will be enough here.

Next, a key must fit in a single page (up to 32KB), so uint16 should be enough for
the keylen attribute.

Then, re-organize the fields to minimize alignment losses, while maintaining an
order that does make logical grouping sense.

Finally, use the first posting list to get the first stored ItemPointer; this
deduplicates stored data and thus improves performance again. In passing, adjust the
alignment of the first GinPostingList in GinTuple from MAXALIGN to SHORTALIGN.
---
 src/backend/access/gin/gininsert.c | 21 ++++++++++++---------
 src/include/access/gin_tuple.h     | 19 +++++++++++++++----
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e02cb6d0e6..b9444b6db7 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1550,7 +1550,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	 * when merging non-overlapping lists, e.g. in each parallel worker.
 	 */
 	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1567,7 +1568,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
 	{
 		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
 			break;
 
 		buffer->nfrozen++;
@@ -2176,7 +2178,7 @@ _gin_build_tuple(GinBuildState *state,
 	 * alignment, to allow direct access to compressed segments (those require
 	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	/*
 	 * Allocate space for the whole GIN tuple.
@@ -2191,7 +2193,6 @@ _gin_build_tuple(GinBuildState *state,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
-	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2222,7 +2223,7 @@ _gin_build_tuple(GinBuildState *state,
 	}
 
 	/* finally, copy the TIDs into the array */
-	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
 	/* copy in the compressed data, and free the segments */
 	dlist_foreach_modify(iter, &segments)
@@ -2292,8 +2293,8 @@ _gin_parse_tuple_items(GinTuple *a)
 	int			ndecoded;
 	ItemPointer items;
 
-	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
 	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
 
@@ -2355,8 +2356,10 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 								&ssup[a->attrnum - 1]);
 
 		/* if the key is the same, consider the first TID in the array */
-		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
 	}
 
-	return ItemPointerCompare(&a->first, &b->first);
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 4ac8cfcc2b..f4dbdfd3f7 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,10 +10,12 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
+#include "access/ginblock.h"
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
 /*
+ * XXX: Update description with new architecture
  * Each worker sees tuples in CTID order, so if we track the first TID and
  * compare that when combining results in the worker, we would not need to
  * do an expensive sort in workers (the mergesort is already smart about
@@ -24,17 +26,26 @@
  */
 typedef struct GinTuple
 {
-	Size		tuplen;			/* length of the whole tuple */
-	Size		keylen;			/* bytes in data for key value */
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
 	int16		typlen;			/* typlen for key */
 	bool		typbyval;		/* typbyval for key */
-	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
-	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
 typedef struct GinBuffer GinBuffer;
 
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
-- 
2.39.2

v20240712-0010-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20240712-0010-WIP-parallel-inserts-into-GIN-index.patchDownload

From ab893f39539913120ad7b170494c6daca8c441bd Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2024 20:53:20 +0200
Subject: [PATCH v20240712 10/10] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 432 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 289 insertions(+), 145 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b9444b6db7..a6c7867a34 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -87,6 +94,9 @@ typedef struct GinShared
 	int			nparticipantsdone;
 	double		reltuples;
 	double		indtuples;
+ 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
 
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
@@ -172,7 +182,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -554,10 +569,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	/* scan wrapped around - flush accumulated entries and start anew */
 	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
-	{
-		elog(LOG, "calling ginFlushBuildState");
 		ginFlushBuildState(buildstate, index);
-	}
 
 	/* remember the TID we're about to process */
 	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
@@ -718,8 +730,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -1009,6 +1025,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinShared(ginshared),
 								  snapshot);
@@ -1080,6 +1102,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1093,6 +1120,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1739,145 +1768,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- *
- * FIXME Maybe should have local memory contexts similar to what
- * _brin_parallel_merge does?
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 *
-	 * XXX Maybe we should sort by key first, then by category? The idea is
-	 * that if this matches the order of the keys in the index, we'd insert
-	 * the entries in order better matching the index.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer, true);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2061,6 +1951,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2071,6 +1964,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2363,3 +2270,238 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinShared  *shared = state->bs_leader->ginshared;
+	BufFile	  **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char	fname[128];
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile *file;
+	char	fname[128];
+	char   *buff;
+	int64	ntuples = 0;
+	Size	maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %ld", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted %ld tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer, true);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	elog(LOG, "_gin_parallel_insert ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index db37beeaae..3998cf33ec 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -115,6 +115,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.39.2

#32

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andy Fan (#30)

Re: Parallel CREATE INDEX for GIN indexes

On Tue, 9 Jul 2024 at 03:18, Andy Fan <zhihuifan1213@163.com> wrote:

and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
we have some CTID merges activity in '....(1)', the tuples are still
ordered, so the sort (in both tuplesort_putgintuple and
'tuplesort_performsort) are not necessary, what's more, in the each of
'flush-memory-to-disk' in tuplesort, it create a 'sorted-run', and in
this case, acutally we only need 1 run only since all the input tuples
in the worker is sorted. The reduction of 'sort-runs' in worker will be
helpful to leader's final mergeruns. the 'sorted-run' benefit doesn't
exist for the case-1 (RBTree -> worker_state).

If Matthias's proposal is adopted, my optimization will not be useful
anymore and Matthias's porposal looks like a more natural and effecient
way.

I think they might be complementary. I don't think it's reasonable to
expect GIN's BuildAccumulator to buffer all the index tuples at the
same time (as I mentioned upthread: we are or should be limited by
work memory), but the BuildAccumulator will do a much better job at
combining tuples than the in-memory sort + merge-write done by
Tuplesort (because BA will use (much?) less memory for the same number
of stored values). So, the idea of making BuildAccumulator responsible
for providing the initial sorted runs does resonate with me, and can
also be worth pursuing.

I think it would indeed save time otherwise spent comparing if tuples
can be merged before they're first spilled to disk, when we already
have knowledge about which tuples are a sorted run. Afterwards, only
the phases where we merge sorted runs from disk would require my
buffered write approach that merges Gin tuples.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#33

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#31)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi,

I got to do the detailed benchmarking on the latest version of the patch
series, so here's the results. My goal was to better understand the
impact of each patch individually - especially the two parts introduced
by Matthias, but not only - so I ran the test on a build with each fo
the 0001-0009 patches.

This is the same test I did at the very beginning, but the basic details
are that I have a 22GB table with archives of our mailing lists (1.6M
messages, roughly), and I build a couple different GIN indexes on
that:

Very impresive testing!

And let's talk about the improvement by Matthias, namely:

* 0008 Use a single GIN tuplesort
* 0009 Reduce the size of GinTuple by 12 bytes

I haven't really seen any impact on duration - it seems more or less
within noise. Maybe it would be different on machines with less RAM, but
on my two systems it didn't really make a difference.

It did significantly reduce the amount of temporary data written, by
~40% or so. This is pretty nicely visible on the "trgm" case, which
generates the most temp files of the four indexes. An example from the
i5/32MB section looks like this:

label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
------------------------------------------------------------------------
trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
1016

After seeing the above data, I want to know where the time is spent and
why the ~40% IO doesn't make a measurable duration improvement. then I
did the following test.

create table gin_t (a int[]);
insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]/messages/by-id/87le0iqrsu.fsf@163.com
select pg_prewarm('gin_t');

postgres=# create index on gin_t using gin(a);
INFO: pid: 145078, stage 1 took 44476 ms
INFO: pid: 145177, stage 1 took 44474 ms
INFO: pid: 145078, stage 2 took 2662 ms
INFO: pid: 145177, stage 2 took 2611 ms
INFO: pid: 145177, stage 3 took 240 ms
INFO: pid: 145078, stage 3 took 239 ms

CREATE INDEX
Time: 79472.135 ms (01:19.472)

Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
3% execution time. and the leader's work takes the rest 41% execution
time. I think that's why we didn't see much performance improvement of
0008 since it improves the "stage 2 and stage 3".

==== Here is my definition for stage 1/2/3.
stage 1:
reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
ginBuildCallbackParallel, state, scan);

/* write remaining accumulated entries */
ginFlushBuildState(state, index);

stage 2:
_gin_process_worker_data(state, state->bs_worker_sort)

stage 3:
tuplesort_performsort(state->bs_sortstate);

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

So I was wondering if we might
parallelize that too, and how much time would it save - 0010 is an
experimental patch doing that. It splits the processing into 3 phases:

1. workers feeding data into tuplesort
2. leader finishes sort and "repartitions" the data
3. workers inserting their partition into index

The patch is far from perfect (more a PoC) ..

This does help a little bit, reducing the duration by ~15-25%. I wonder
if this might be improved by partitioning the data differently - not by
shuffling everything from the tuplesort into fileset (it increases the
amount of temporary data in the charts). And also by by distributing the
data differently - right now it's a bit of a round robin, because it
wasn't clear we know how many entries are there.

Due to the complexity of the existing code, I would like to foucs on
existing patch first. So I vote for this optimization as a dedeciated
patch.

and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
we have some CTID merges activity in '....(1)', the tuples are still
ordered, so the sort (in both tuplesort_putgintuple and
'tuplesort_performsort) are not necessary,
..
If Matthias's proposal is adopted, my optimization will not be useful
anymore and Matthias's porposal looks like a more natural and effecient
way.

I think they might be complementary. I don't think it's reasonable to
expect GIN's BuildAccumulator to buffer all the index tuples at the
same time (as I mentioned upthread: we are or should be limited by
work memory), but the BuildAccumulator will do a much better job at
combining tuples than the in-memory sort + merge-write done by
Tuplesort (because BA will use (much?) less memory for the same number
of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples). but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort. However I can't define a clean API for
the former case. c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch.

What do you think?

[1]: /messages/by-id/87le0iqrsu.fsf@163.com

--
Best Regards
Andy Fan

#34

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Andy Fan (#33)

Re: Parallel CREATE INDEX for GIN indexes

On 8/27/24 12:14, Andy Fan wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

Hi,

I got to do the detailed benchmarking on the latest version of the patch
series, so here's the results. My goal was to better understand the
impact of each patch individually - especially the two parts introduced
by Matthias, but not only - so I ran the test on a build with each fo
the 0001-0009 patches.

This is the same test I did at the very beginning, but the basic details
are that I have a 22GB table with archives of our mailing lists (1.6M
messages, roughly), and I build a couple different GIN indexes on
that:

..

Very impresive testing!

And let's talk about the improvement by Matthias, namely:

* 0008 Use a single GIN tuplesort
* 0009 Reduce the size of GinTuple by 12 bytes

I haven't really seen any impact on duration - it seems more or less
within noise. Maybe it would be different on machines with less RAM, but
on my two systems it didn't really make a difference.

It did significantly reduce the amount of temporary data written, by
~40% or so. This is pretty nicely visible on the "trgm" case, which
generates the most temp files of the four indexes. An example from the
i5/32MB section looks like this:

label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
------------------------------------------------------------------------
trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
1016

After seeing the above data, I want to know where the time is spent and
why the ~40% IO doesn't make a measurable duration improvement. then I
did the following test.

create table gin_t (a int[]);
insert into gin_t select * from rand_array(30000000, 0, 100, 0, 50); [1]
select pg_prewarm('gin_t');

postgres=# create index on gin_t using gin(a);
INFO: pid: 145078, stage 1 took 44476 ms
INFO: pid: 145177, stage 1 took 44474 ms
INFO: pid: 145078, stage 2 took 2662 ms
INFO: pid: 145177, stage 2 took 2611 ms
INFO: pid: 145177, stage 3 took 240 ms
INFO: pid: 145078, stage 3 took 239 ms

CREATE INDEX
Time: 79472.135 ms (01:19.472)

Then we can see stage 1 take 56% execution time. stage 2 + stage 3 take
3% execution time. and the leader's work takes the rest 41% execution
time. I think that's why we didn't see much performance improvement of
0008 since it improves the "stage 2 and stage 3".

Yes, that makes sense. It's so small fraction of the computation that it
can't translate to a meaningful speed.

==== Here is my definition for stage 1/2/3.
stage 1:
reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
ginBuildCallbackParallel, state, scan);

/* write remaining accumulated entries */
ginFlushBuildState(state, index);

stage 2:
_gin_process_worker_data(state, state->bs_worker_sort)

stage 3:
tuplesort_performsort(state->bs_sortstate);

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

I agree. I expressed the same impression earlier in this thread, IIRC.

Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

So I was wondering if we might

parallelize that too, and how much time would it save - 0010 is an
experimental patch doing that. It splits the processing into 3 phases:

1. workers feeding data into tuplesort
2. leader finishes sort and "repartitions" the data
3. workers inserting their partition into index

The patch is far from perfect (more a PoC) ..

This does help a little bit, reducing the duration by ~15-25%. I wonder
if this might be improved by partitioning the data differently - not by
shuffling everything from the tuplesort into fileset (it increases the
amount of temporary data in the charts). And also by by distributing the
data differently - right now it's a bit of a round robin, because it
wasn't clear we know how many entries are there.

Due to the complexity of the existing code, I would like to foucs on
existing patch first. So I vote for this optimization as a dedeciated
patch.

I agree. Even if we decide to do these parallel inserts, it relies on
doing the parallel sort first. So it makes sense to leave that for
later, as an additional improvement.

and later we called 'tuplesort_performsort(state->bs_sortstate);'. Even
we have some CTID merges activity in '....(1)', the tuples are still
ordered, so the sort (in both tuplesort_putgintuple and
'tuplesort_performsort) are not necessary,
..
If Matthias's proposal is adopted, my optimization will not be useful
anymore and Matthias's porposal looks like a more natural and effecient
way.

I think they might be complementary. I don't think it's reasonable to
expect GIN's BuildAccumulator to buffer all the index tuples at the
same time (as I mentioned upthread: we are or should be limited by
work memory), but the BuildAccumulator will do a much better job at
combining tuples than the in-memory sort + merge-write done by
Tuplesort (because BA will use (much?) less memory for the same number
of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples). but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort. However I can't define a clean API for
the former case. c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch.

What do you think?

Yeah. I think we have agreement on 0001-0007. I'm a bit torn about 0008,
I have not expected changing tuplesort like this when I started working
on the patch, but I can't deny it's a massive speedup for some cases
(where the patch doesn't help otherwise). But then in other cases it
doesn't help at all, and 0010 helps. I wonder if maybe there's a good
way to "flip" between those two approaches, by some heuristics.

regards

--
Tomas Vondra

#35

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andy Fan (#33)

Re: Parallel CREATE INDEX for GIN indexes

On Tue, 27 Aug 2024 at 12:15, Andy Fan <zhihuifan1213@163.com> wrote:

Tomas Vondra <tomas.vondra@enterprisedb.com> writes:

And let's talk about the improvement by Matthias, namely:

* 0008 Use a single GIN tuplesort
* 0009 Reduce the size of GinTuple by 12 bytes

I haven't really seen any impact on duration - it seems more or less
within noise. Maybe it would be different on machines with less RAM, but
on my two systems it didn't really make a difference.

It did significantly reduce the amount of temporary data written, by
~40% or so. This is pretty nicely visible on the "trgm" case, which
generates the most temp files of the four indexes. An example from the
i5/32MB section looks like this:

label 0000 0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
------------------------------------------------------------------------
trgm / 3 0 2635 3690 3715 1177 1177 1179 1179 696 682
1016

After seeing the above data, I want to know where the time is spent and
why the ~40% IO doesn't make a measurable duration improvement. then I
did the following test.

[...]

==== Here is my definition for stage 1/2/3.
stage 1:
reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
ginBuildCallbackParallel, state, scan);

/* write remaining accumulated entries */
ginFlushBuildState(state, index);

stage 2:
_gin_process_worker_data(state, state->bs_worker_sort)

stage 3:
tuplesort_performsort(state->bs_sortstate);

Note that tuplesort does most of its sort and IO work while receiving
tuples, which in this case would be during table_index_build_scan.
tuplesort_performsort usually only needs to flush the last elements of
a sort that it still has in memory, which is thus generally a cheap
operation bound by maintenance work memory, and definitely not
representative of the total cost of sorting data. In certain rare
cases it may take a longer time as it may have to merge sorted runs,
but those cases are quite rare in my experience.

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I'd imagine nbtree would like to use this too, for applying some
deduplication in the sort stage. The IO benefits are quite likely to
be worth it; a minimum space saving of 25% on duplicated key values in
tuple sorts sounds real great. And it doesn't even have to merge all
duplicates: even if you only merge 10 tuples at a time, the space
saving on those duplicates would be at least 47% on 64-bit systems.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

I could adapt the patch for nbtree use, to see if anyone's willing to
review that?

Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

nbtree does sorted insertions into the tree, constructing leaf pages
one at a time and adding separator keys in the page above when the
leaf page was filled, thus removing the need to descend the btree. I
imagine we can save some performance by mirroring that in GIN too,
with as additional bonus that we'd be free to start logging completed
pages before we're done with the full index, reducing max WAL
throughput in GIN index creation.

I think they might be complementary. I don't think it's reasonable to
expect GIN's BuildAccumulator to buffer all the index tuples at the
same time (as I mentioned upthread: we are or should be limited by
work memory), but the BuildAccumulator will do a much better job at
combining tuples than the in-memory sort + merge-write done by
Tuplesort (because BA will use (much?) less memory for the same number
of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples). but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort.

You'd still need to keep track of sorted runs on those tapes, which is what
tuplesort.c does for us.

However I can't define a clean API for
the former case.

I imagine a pair of tuplesort_beginsortedrun();
tuplesort_endsortedrun() -functions to help this, but I'm not 100%
sure if we'd want to expose Tuplesort to non-PG sorting algorithms, as
it would be one easy way to create incorrect results if the sort used
in tuplesort isn't exactly equivalent to the sort used by the provider
of the tuples.

c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

I think improving 3% on reindex operations can be well worth the effort.

Also, do note that the current patch does (still) not correctly handle
[maintenance_]work_mem: Every backend's BuildAccumulator uses up to
work_mem of memory here, while the launched tuplesorts use an
additional maintenance_work_mem of memory, for a total of (workers +
1) * work_mem + m_w_m of memory usage. The available memory should
instead be allocated between tuplesort and BuildAccumulator, but can
probably mostly be allocated to just BuildAccumulator if we can dump
the data into the tuplesort directly, as it'd reduce the overall
number of operations and memory allocations for the tuplesort. I think
that once we correctly account for memory allocations (and an improved
write path) we'll be able to see a meaningfully larger performance
improvement.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch.

As mentioned above, I think I could update the patch for a btree
implementation that also has immediate benefits, if so desired?

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#36

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Matthias van de Meent (#35)

Re: Parallel CREATE INDEX for GIN indexes

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

tuplesort_performsort usually only needs to flush the last elements of
... In certain rare
cases it may take a longer time as it may have to merge sorted runs,
but those cases are quite rare in my experience.

OK, I am expecting such cases are not rare, Suppose we have hundreds of
GB heap tuple, and have the 64MB or 1GB maintenance_work_mem setup, it
probably hit this sistuation. I'm not mean at which experience is the
fact, but I am just to highlights the gap in our minds. and thanks for
sharing this, I can pay more attention in this direction in my future
work. To be clearer, my setup hit the 'mergeruns' case.

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I'd imagine nbtree would like to use this too, for applying some
deduplication in the sort stage.

The current ntbtree do the deduplication during insert into the nbtree
IIUC, in your new strategy, we can move it the "sort" stage, which looks
good to me [to confirm my understanding].

The IO benefits are quite likely to
be worth it; a minimum space saving of 25% on duplicated key values in
tuple sorts sounds real great.

Just be clearer on the knowledge, the IO benefits can be only gained
when the tuplesort's memory can't hold all the tuples, and in such case,
tuplesort_performsort would run the 'mergeruns', or else we can't get
any benefit?

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

I could adapt the patch for nbtree use, to see if anyone's willing to
review that?

I'm interested with it and can do some review & testing. But the
keypoint would be we need some authorities are willing to review it, to
make it happen to a bigger extent, a dedicated thread would be helpful.

Now, what's the 0010 patch about?

For some indexes (e.g. trgm), the parallel builds help a lot, because
they produce a lot of temporary data and the parallel sort is a
substantial part of the work. But for other indexes (especially the
"smaller" indexes on jsonb headers), it's not that great. For example
for "jsonb", having 3 workers shaves off only ~25% of the time, not 75%.

Clearly, this happens because a lot of time is spent outside the sort,
actually inserting data into the index.

You can always foucs on the most important part which inpires me a lot,
even with my simple testing, the "inserting data into index" stage take
40% time.

nbtree does sorted insertions into the tree, constructing leaf pages
one at a time and adding separator keys in the page above when the
leaf page was filled, thus removing the need to descend the btree. I
imagine we can save some performance by mirroring that in GIN too,
with as additional bonus that we'd be free to start logging completed
pages before we're done with the full index, reducing max WAL
throughput in GIN index creation.

I agree this is a promising direction as well.

I think they might be complementary. I don't think it's reasonable to
expect GIN's BuildAccumulator to buffer all the index tuples at the
same time (as I mentioned upthread: we are or should be limited by
work memory), but the BuildAccumulator will do a much better job at
combining tuples than the in-memory sort + merge-write done by
Tuplesort (because BA will use (much?) less memory for the same number
of stored values).

Thank you Matthias for valuing my point! and thanks for highlighting the
benefit that BuildAccumulator can do a better job for sorting in memory
(I think it is mainly because BuildAccumulator can do run-time merge
when accept more tuples). but I still not willing to go further at this
direction. Reasons are: a). It probably can't make a big difference at
the final result. b). The best implementation of this idea would be
allowing the user of tuplesort.c to insert the pre-sort tuples into tape
directly rather than inserting them into tuplesort's memory and dump
them into tape without a sort.

You'd still need to keep track of sorted runs on those tapes, which is what
tuplesort.c does for us.

However I can't define a clean API for
the former case.

I imagine a pair of tuplesort_beginsortedrun();
tuplesort_endsortedrun() -functions to help this.

This APIs do are better than the ones in my mind:) during the range
between tuplesort_beginsortedrun and tuplesort_endsortedrun(), we can
bypass the tuplessort's memory.

but I'm not 100%
sure if we'd want to expose Tuplesort to non-PG sorting algorithms, as
it would be one easy way to create incorrect results if the sort used
in tuplesort isn't exactly equivalent to the sort used by the provider
of the tuples.

OK.

c). create-index is a maintenance work, improving it by
30% would be good, but if we just improve it by <3, it looks not very
charming in practice.

I think improving 3% on reindex operations can be well worth the effort.

Also, do note that the current patch does (still) not correctly handle
[maintenance_]work_mem: Every backend's BuildAccumulator uses up to
work_mem of memory here, while the launched tuplesorts use an
additional maintenance_work_mem of memory, for a total of (workers +
1) * work_mem + m_w_m of memory usage. The available memory should
instead be allocated between tuplesort and BuildAccumulator, but can
probably mostly be allocated to just BuildAccumulator if we can dump
the data into the tuplesort directly, as it'd reduce the overall
number of operations and memory allocations for the tuplesort. I think
that once we correctly account for memory allocations (and an improved
write path) we'll be able to see a meaningfully larger performance
improvement.

Personally I am more fans of your "buffer writetup" idea, but not the
same interests with the tuplesort_beginsortedrun /
tuplesort_endsortedrun. I said the '3%' is for the later one and I
guess you understand it as the former one.

So my option is if we can have agreement on 0008, then we can final
review/test on the existing code (including 0009), and leave further
improvement as a dedicated patch.

As mentioned above, I think I could update the patch for a btree
implementation that also has immediate benefits, if so desired?

If you are saying about the buffered-writetup in tuplesort, then I think
it is great, and in a dedicated thread for better exposure.

--
Best Regards
Andy Fan

#37

Matthias van de Meent

boekewurm+postgres@gmail.com

over 1 year ago

In reply to: Andy Fan (#36)

Re: Parallel CREATE INDEX for GIN indexes

On Wed, 28 Aug 2024 at 02:38, Andy Fan <zhihuifan1213@163.com> wrote:

Matthias van de Meent <boekewurm+postgres@gmail.com> writes:

tuplesort_performsort usually only needs to flush the last elements of
... In certain rare
cases it may take a longer time as it may have to merge sorted runs,
but those cases are quite rare in my experience.

OK, I am expecting such cases are not rare, Suppose we have hundreds of
GB heap tuple, and have the 64MB or 1GB maintenance_work_mem setup, it
probably hit this sistuation. I'm not mean at which experience is the
fact, but I am just to highlights the gap in our minds. and thanks for
sharing this, I can pay more attention in this direction in my future
work. To be clearer, my setup hit the 'mergeruns' case.

Huh, I've never noticed the performsort phase of btree index creation
(as seen in pg_stat_progress_create_index) take much time if any,
especially when compared to the tuple loading phase, so I assumed it
didn't happen often. Hmm, maybe I've just been quite lucky.

But 0008 still does many good stuff:
1. Reduce the IO usage, this would be more useful on some heavily IO
workload.
2. Simplify the building logic by removing one stage.
3. Add the 'buffer-writetup' to tuplesort.c, I don't have other user
case for now, but it looks like a reasonable design.

I'd imagine nbtree would like to use this too, for applying some
deduplication in the sort stage.

The current ntbtree do the deduplication during insert into the nbtree
IIUC, in your new strategy, we can move it the "sort" stage, which looks
good to me [to confirm my understanding].

Correct: We can do at least some deduplication in the sort stage. Not
all, because tuples need to fit on pages and we don't want to make the
tuples so large that we'd cause unnecessary splits while loading the
tree, but merging runs of 10-30 tuples should reduce IO requirements
by some margin for indexes where deduplication is important.

The IO benefits are quite likely to
be worth it; a minimum space saving of 25% on duplicated key values in
tuple sorts sounds real great.

Just be clearer on the knowledge, the IO benefits can be only gained
when the tuplesort's memory can't hold all the tuples, and in such case,
tuplesort_performsort would run the 'mergeruns', or else we can't get
any benefit?

It'd be when the tuplesort's memory can't hold all tuples, but
mergeruns isn't strictly required here, as dumptuples() would already
allow some tuple merging.

I think the current blocker is if it is safe to hack the tuplesort.c.
With my current knowledge, It looks good to me, but it would be better
open a dedicated thread to discuss this specially, the review would not
take a long time if a people who is experienced on this area would take
a look.

I could adapt the patch for nbtree use, to see if anyone's willing to
review that?

I'm interested with it and can do some review & testing. But the
keypoint would be we need some authorities are willing to review it, to
make it happen to a bigger extent, a dedicated thread would be helpful.

Then I'll split it off into a new thread sometime later this week.

nbtree does sorted insertions into the tree, constructing leaf pages
one at a time and adding separator keys in the page above when the
leaf page was filled, thus removing the need to descend the btree. I
imagine we can save some performance by mirroring that in GIN too,
with as additional bonus that we'd be free to start logging completed
pages before we're done with the full index, reducing max WAL
throughput in GIN index creation.

I agree this is a promising direction as well.

It'd be valuable to see if the current patch's "parallel sorted"
insertion is faster even than the current GIN insertion path even if
we use only the primary process, as it could be competative.
Btree-like bulk tree loading might even be meaningfully faster than
GIN's current index creation process. However, as I mentioned
significantly upthread, I don't expect that change to happen in this
patch series.

I imagine a pair of tuplesort_beginsortedrun();
tuplesort_endsortedrun() -functions to help this.

This APIs do are better than the ones in my mind:) during the range
between tuplesort_beginsortedrun and tuplesort_endsortedrun(), we can
bypass the tuplessort's memory.

Exactly, we'd have the user call tuplesort_beginsortedrun(); then
iteratively insert its sorted tuples using the usual
tuplesort_putYYY() api, and then call _endsortedrun() when the sorted
run is complete. It does need some work in tuplesort state handling
and internals, but I think that's quite achievable.

Kind regards,

Matthias van de Meent

#38

Andy Fan

zhihuifan1213@163.com

over 1 year ago

In reply to: Tomas Vondra (#34)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas@vondra.me> writes:

Hi Tomas,

Yeah. I think we have agreement on 0001-0007.

Yes, the design of 0001-0007 looks good to me and because of the
existing compexitity, I want to foucs on this part for now. I am doing
code review from yesterday, and now my work is done. Just some small
questions:

1. In GinBufferStoreTuple,

/*
* Check if the last TID in the current list is frozen. This is the case
* when merging non-overlapping lists, e.g. in each parallel worker.
*/
if ((buffer->nitems > 0) &&
(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
buffer->nfrozen = buffer->nitems;

should we do (ItemPointerCompare(&buffer->items[buffer->nitems - 1],
&tup->first) "<=" 0), rather than "=="?

2. Given the "non-overlap" case should be the major case
GinBufferStoreTuple , does it deserve a fastpath for it before calling
ginMergeItemPointers since ginMergeItemPointers have a unconditionally
memory allocation directly, and later we pfree it?

new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
(buffer->nitems - buffer->nfrozen), /* num of unfrozen */
items, tup->nitems, &nnew);

3. The following comment in index_build is out-of-date now :)

/*
* Determine worker process details for parallel CREATE INDEX. Currently,
* only btree has support for parallel builds.
*

4. Comments - Buffer is not empty and it's storing "a different key"
looks wrong to me. the key may be same and we just need to flush them
because of memory usage. There is the same issue in both
_gin_process_worker_data and _gin_parallel_merge.

if (GinBufferShouldTrim(buffer, tup))
{
Assert(buffer->nfrozen > 0);

state->buildStats.nTrims++;

/*
* Buffer is not empty and it's storing a different key - flush
* the data into the insert, and start a new entry for current
* GinTuple.
*/
AssertCheckItemPointers(buffer, true);

I also run valgrind testing with some testcase, no memory issue is
found.

I'm a bit torn about 0008, I have not expected changing tuplesort like
this when I started working
on the patch, but I can't deny it's a massive speedup for some cases
(where the patch doesn't help otherwise). But then in other cases it
doesn't help at all, and 0010 helps.

Yes, I'd like to see these improvements both 0008 and 0010 as a
dedicated improvement.

--
Best Regards
Andy Fan

#39

Michael Paquier

michael@paquier.xyz

over 1 year ago

In reply to: Tomas Vondra (#31)

Re: Parallel CREATE INDEX for GIN indexes

On Fri, Jul 12, 2024 at 05:34:25PM +0200, Tomas Vondra wrote:

I got to do the detailed benchmarking on the latest version of the patch
series, so here's the results. My goal was to better understand the
impact of each patch individually - especially the two parts introduced
by Matthias, but not only - so I ran the test on a build with each fo
the 0001-0009 patches.

_gin_parallel_build_main() is introduced in 0001. Please make sure to
pass down a query ID.
--
Michael

#40

Tomas Vondra

tomas@vondra.me

over 1 year ago

In reply to: Michael Paquier (#39)

10 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On 10/8/24 04:03, Michael Paquier wrote:

_gin_parallel_build_main() is introduced in 0001. Please make sure to
pass down a query ID.

Thanks for the ping. Here's an updated patch doing that, and also fixing
a couple whitespace issues. No other changes, but I plan to get back to
this patch soon - before the next CF.

regards

--
Tomas Vondra

Attachments:

v20241008-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20241008-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From e53fc8b9e0600d19419da531cd87a4ee9f030f2e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20241008 01/10] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1454 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  203 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1694 insertions(+), 16 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 71f38be90c3..f3b51878d52 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +464,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +584,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,25 +633,93 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, (void *) &buildstate,
-									   NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, (void *) &buildstate,
+										   NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -534,3 +859,1102 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "invalid typlen");
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if ((a->category == GIN_CAT_NORM_KEY) &&
+		(b->category == GIN_CAT_NORM_KEY))
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 830d67fbc20..bb2e3895b90 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index d4e84aabac7..7db50b3b4f6 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -146,6 +147,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 558309c9850..59df02c9481 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,82 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+						sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +889,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1092,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1903,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 25983b7a505..be76d8446f4 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..6f529a5aaf0
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index cde83f62015..0ed71ae922a 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a65e1c07c5d..ba9dc4df3a9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1020,11 +1020,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1037,9 +1039,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.46.2

v20241008-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20241008-0002-Use-mergesort-in-the-leader-process.patchDownload

From 5525308ff9eccc44596d3efebc6108ee8fa9e3ee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20241008 02/10] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f3b51878d52..feaa36fd5aa 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -162,6 +162,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -472,23 +480,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -548,7 +556,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1146,7 +1154,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1176,8 +1183,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1299,11 +1304,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1331,28 +1332,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1416,6 +1411,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1497,7 +1510,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1514,7 +1527,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1524,6 +1537,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1562,6 +1578,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1594,6 +1706,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1630,7 +1747,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1639,6 +1756,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.46.2

v20241008-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20241008-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From e0b7783922af3ea5559f8b46797f924b0fb39f04 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20241008 03/10] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 107 +++++++++++++++++------------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 74 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index feaa36fd5aa..93438ac216c 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1161,19 +1161,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1190,8 +1198,10 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
 	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX actually with the mergesort in GinBufferStoreTuple, we
+	 * should not need 'false' here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1300,8 +1310,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1347,33 +1375,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
-
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
 
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1510,7 +1514,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1520,14 +1524,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1630,7 +1637,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1644,7 +1651,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1654,7 +1664,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1959,6 +1969,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2042,6 +2053,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2054,6 +2071,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2075,10 +2093,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf0..55dd8544b21 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.46.2

v20241008-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20241008-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From de25b951648c0b3a099cc6084a92fd33ca9f3dd5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20241008 04/10] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 93438ac216c..59e35fd1e0f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -188,7 +188,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1342,7 +1344,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1378,6 +1381,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1896,6 +1902,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1908,6 +1923,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1921,6 +1941,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1947,12 +1972,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "invalid typlen");
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2002,37 +2049,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2045,6 +2095,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2090,8 +2162,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	if ((a->category == GIN_CAT_NORM_KEY) &&
 		(b->category == GIN_CAT_NORM_KEY))
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ba9dc4df3a9..f56d754d609 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1039,6 +1039,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.46.2

v20241008-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20241008-0005-Collect-and-print-compression-stats.patchDownload

From ad22c0eabdb446a0206025d47b320e2fbcf532c5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20241008 05/10] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 42 +++++++++++++++++++++++-------
 src/include/access/gin.h           |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 59e35fd1e0f..7a2d377d941 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,7 +191,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -554,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1199,9 +1200,9 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
-	 * XXX actually with the mergesort in GinBufferStoreTuple, we
-	 * should not need 'false' here. See AssertCheckItemPointers.
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
 	AssertCheckItemPointers(buffer, false);
 #endif
@@ -1619,6 +1620,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1645,7 +1655,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1672,7 +1682,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1687,6 +1697,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1759,7 +1774,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1853,6 +1868,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1930,7 +1946,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2064,6 +2081,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index be76d8446f4..2b6633d068a 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.46.2

v20241008-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20241008-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From 4a5e9001f85c63e395d777633b327c5372f3262b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20241008 06/10] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7a2d377d941..60dca65d1b8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1155,8 +1155,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1223,6 +1227,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound
+	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1308,6 +1324,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1336,6 +1400,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1364,21 +1433,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1417,11 +1537,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1489,7 +1627,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1531,6 +1674,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1554,6 +1725,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1614,7 +1787,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1667,6 +1846,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1702,6 +1916,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2b6633d068a..9381329fac5 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.46.2

v20241008-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20241008-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From 27a35bcda6d2ab5aa5b4f779d8a586d03cb7e1a8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20241008 07/10] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 132 ++++++++++++++---------------
 1 file changed, 63 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 60dca65d1b8..f79c9a7d83f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -144,6 +144,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -475,6 +476,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -499,6 +541,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -515,6 +562,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -533,40 +590,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -603,6 +627,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1232,8 +1257,8 @@ GinBufferInit(Relation index)
 	 * with too many TIDs. and 64kB seems more than enough. But maybe this
 	 * should be tied to maintenance_work_mem or something like that?
 	 *
-	 * XXX This is not enough to prevent repeated merges after a wraparound
-	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
 	 * because it quickly reaches the end of the second list and can just
 	 * memcpy the rest without walking it item by item.
 	 */
@@ -1969,39 +1994,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2086,6 +2079,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.46.2

v20241008-0008-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20241008-0008-Use-a-single-GIN-tuplesort.patchDownload

From 5d52f95c6fb5cdb6fb3066e881ab88a63407e205 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 19:22:32 +0200
Subject: [PATCH v20241008 08/10] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read it back,
merge the GinTuples, and write it into the shared sort, to later be used by the
shared tuple sort.

The new approach is to use a single sort, merging tuples as we write them to disk.
This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize tuples unless
we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's writetup can
now decide to buffer writes until the next flushwrites() callback.
---
 src/backend/access/gin/gininsert.c         | 427 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 6 files changed, 307 insertions(+), 250 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f79c9a7d83f..e02cb6d0e67 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -163,14 +163,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(buildstate, attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1169,8 +1159,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * synchronized (and thus may wrap around), and when combininng values from
  * multiple workers.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached; /* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1188,7 +1184,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1203,8 +1199,7 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1224,7 +1219,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
@@ -1244,7 +1239,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1294,15 +1289,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1314,37 +1312,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1397,6 +1429,55 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer	items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else {
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1431,32 +1512,28 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * as that does palloc internally, but if we detected the append case,
  * we could do without it. Not sure how much overhead it is, though.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
-
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		GinTuple   *tuple = palloc(tup->tuplen);
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1530,6 +1607,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(NULL, buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1543,14 +1647,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX Might be better to have a separate memory context for the buffer.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1566,6 +1677,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1589,7 +1701,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1600,6 +1712,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1613,7 +1726,7 @@ GinBufferFree(GinBuffer *buffer)
  * the TID array, and returning false if it's too large (more thant work_mem,
  * for example).
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1690,6 +1803,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1718,6 +1832,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1731,7 +1846,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1739,6 +1857,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer, true);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1790,162 +1909,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* print some basic info */
-	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	/* reset before the second phase */
-	state->buildStats.sizeCompressed = 0;
-	state->buildStats.sizeRaw = 0;
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer, true);
-
-		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	/* print some basic info */
-	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1978,11 +1941,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1996,13 +1954,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2159,8 +2110,7 @@ static GinTuple *
 _gin_build_tuple(GinBuildState *state,
 				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2228,8 +2178,6 @@ _gin_build_tuple(GinBuildState *state,
 	 */
 	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
@@ -2291,12 +2239,15 @@ _gin_build_tuple(GinBuildState *state,
 		pfree(seginfo);
 	}
 
-	/* how large would the tuple be without compression? */
-	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		nitems * sizeof(ItemPointerData);
+	if (state)
+	{
+		/* how large would the tuple be without compression? */
+		state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+									 nitems * sizeof(ItemPointerData);
 
-	/* compressed size */
-	state->buildStats.sizeCompressed += tuplen;
+		/* compressed size */
+		state->buildStats.sizeCompressed += tuplen;
+	}
 
 	return tuple;
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index c960cfa8231..fd838a3b1bf 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 59df02c9481..14be95ed2b0 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -590,6 +609,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -614,6 +634,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -645,9 +669,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -688,6 +714,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -890,17 +917,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -908,7 +935,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1928,19 +1955,61 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
-
+	unsigned int tuplen = tup->tuplen;
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple *tuple = GinBufferBuildTuple(arg->buffer);
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1966,6 +2035,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 3013a44bae1..149191b7df2 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 55dd8544b21..4ac8cfcc2bf 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -35,6 +35,16 @@ typedef struct GinTuple
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 0ed71ae922a..6c56e40bff1 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,14 @@ typedef struct
 	 */
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
-- 
2.46.2

v20241008-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchtext/x-patch; charset=UTF-8; name=v20241008-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchDownload

From 0e2b1fa309daea440530536a8de5efe86038d31d Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 20:58:37 +0200
Subject: [PATCH v20241008 09/10] Reduce the size of GinTuple by 12 bytes

The size of a Gin tuple can't be larger than what we can allocate, which is just
shy of 1GB; this reduces the number of useful bits in size fields to 30 bits; so
int will be enough here.

Next, a key must fit in a single page (up to 32KB), so uint16 should be enough for
the keylen attribute.

Then, re-organize the fields to minimize alignment losses, while maintaining an
order that does make logical grouping sense.

Finally, use the first posting list to get the first stored ItemPointer; this
deduplicates stored data and thus improves performance again. In passing, adjust the
alignment of the first GinPostingList in GinTuple from MAXALIGN to SHORTALIGN.
---
 src/backend/access/gin/gininsert.c | 21 ++++++++++++---------
 src/include/access/gin_tuple.h     | 19 +++++++++++++++----
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e02cb6d0e67..b9444b6db7f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1550,7 +1550,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	 * when merging non-overlapping lists, e.g. in each parallel worker.
 	 */
 	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1567,7 +1568,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
 	{
 		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
 			break;
 
 		buffer->nfrozen++;
@@ -2176,7 +2178,7 @@ _gin_build_tuple(GinBuildState *state,
 	 * alignment, to allow direct access to compressed segments (those require
 	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	/*
 	 * Allocate space for the whole GIN tuple.
@@ -2191,7 +2193,6 @@ _gin_build_tuple(GinBuildState *state,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
-	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2222,7 +2223,7 @@ _gin_build_tuple(GinBuildState *state,
 	}
 
 	/* finally, copy the TIDs into the array */
-	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
 	/* copy in the compressed data, and free the segments */
 	dlist_foreach_modify(iter, &segments)
@@ -2292,8 +2293,8 @@ _gin_parse_tuple_items(GinTuple *a)
 	int			ndecoded;
 	ItemPointer items;
 
-	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
 	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
 
@@ -2355,8 +2356,10 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 								&ssup[a->attrnum - 1]);
 
 		/* if the key is the same, consider the first TID in the array */
-		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
 	}
 
-	return ItemPointerCompare(&a->first, &b->first);
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 4ac8cfcc2bf..f4dbdfd3f7f 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,10 +10,12 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
+#include "access/ginblock.h"
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
 /*
+ * XXX: Update description with new architecture
  * Each worker sees tuples in CTID order, so if we track the first TID and
  * compare that when combining results in the worker, we would not need to
  * do an expensive sort in workers (the mergesort is already smart about
@@ -24,17 +26,26 @@
  */
 typedef struct GinTuple
 {
-	Size		tuplen;			/* length of the whole tuple */
-	Size		keylen;			/* bytes in data for key value */
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
 	int16		typlen;			/* typlen for key */
 	bool		typbyval;		/* typbyval for key */
-	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
-	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
 typedef struct GinBuffer GinBuffer;
 
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
-- 
2.46.2

v20241008-0010-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20241008-0010-WIP-parallel-inserts-into-GIN-index.patchDownload

From 4f6de3719ac0330670f8a24a40d5fb91a8d92213 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2024 20:53:20 +0200
Subject: [PATCH v20241008 10/10] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 432 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 289 insertions(+), 145 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b9444b6db7f..c2f6ba35230 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -172,7 +182,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -554,10 +569,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	/* scan wrapped around - flush accumulated entries and start anew */
 	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
-	{
-		elog(LOG, "calling ginFlushBuildState");
 		ginFlushBuildState(buildstate, index);
-	}
 
 	/* remember the TID we're about to process */
 	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
@@ -718,8 +730,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -1009,6 +1025,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinShared(ginshared),
 								  snapshot);
@@ -1080,6 +1102,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1093,6 +1120,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1739,145 +1768,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- *
- * FIXME Maybe should have local memory contexts similar to what
- * _brin_parallel_merge does?
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 *
-	 * XXX Maybe we should sort by key first, then by category? The idea is
-	 * that if this matches the order of the keys in the index, we'd insert
-	 * the entries in order better matching the index.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer, true);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2061,6 +1951,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2071,6 +1964,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2363,3 +2270,238 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinShared  *shared = state->bs_leader->ginshared;
+	BufFile	  **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char	fname[128];
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile *file;
+	char	fname[128];
+	char   *buff;
+	int64	ntuples = 0;
+	Size	maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %ld", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted %ld tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer, true);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	elog(LOG, "_gin_parallel_insert ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 8efb4044d6f..30864f8f3aa 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.46.2

#41

Kirill Reshke

reshkekirill@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#40)

Re: Parallel CREATE INDEX for GIN indexes

On Tue, 8 Oct 2024 at 17:06, Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/24 04:03, Michael Paquier wrote:

_gin_parallel_build_main() is introduced in 0001. Please make sure to
pass down a query ID.

Thanks for the ping. Here's an updated patch doing that, and also fixing
a couple whitespace issues. No other changes, but I plan to get back to
this patch soon - before the next CF.

regards

--
Tomas Vondra

Hi! I was looking through this series of patches because thread of
GIN&GIST amcheck patch references it.

I have spotted this in gininsert.c:
1)

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/
sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
tuplesort_initialize_shared(sharedsort, scantuplesortstates,>

pcxt->seg);

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/

Is it necessary to duplicate the entire comment?

And, while we are here, isn't it " initialize the opaque state "?

2) typo :
* the TID array, and returning false if it's too large (more thant work_mem,

3) in _gin_build_tuple:

....
else if (typlen == -2)
keylen = strlen(DatumGetPointer(key)) + 1;
else
elog(ERROR, "invalid typlen");

Maybe `elog(ERROR, "invalid typLen: %d", typLen); ` as in `datumGetSize`?

4) in _gin_compare_tuples:

if ((a->category == GIN_CAT_NORM_KEY) &&

(b->category == GIN_CAT_NORM_KEY))

maybe just a->category == GIN_CAT_NORM_KEY? a->category is already
equal to b->category because of previous if statements.

5) In _gin_partition_sorted_data:

char fname[128];
sprintf(fname, "worker-%d", i);

Other places use MAXPGPATH in similar cases.

Also, code `sprintf(fname, "worker-%d",...);` duplicates. This might
be error-prone. Should we have a macro/inline function for this?

I will take another look later, maybe reporting real problems, not nit-picks.

--
Best regards,
Kirill Reshke

#42

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Kirill Reshke (#41)

10 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On 11/24/24 19:04, Kirill Reshke wrote:

On Tue, 8 Oct 2024 at 17:06, Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/24 04:03, Michael Paquier wrote:

_gin_parallel_build_main() is introduced in 0001. Please make sure to
pass down a query ID.

Thanks for the ping. Here's an updated patch doing that, and also fixing
a couple whitespace issues. No other changes, but I plan to get back to
this patch soon - before the next CF.

regards

--
Tomas Vondra

Hi! I was looking through this series of patches because thread of
GIN&GIST amcheck patch references it.

I have spotted this in gininsert.c:
1)

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/
sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
tuplesort_initialize_shared(sharedsort, scantuplesortstates,>

pcxt->seg);

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/

Is it necessary to duplicate the entire comment?

Yes, that's a copy-paste mistake. Removed the second comment.

And, while we are here, isn't it " initialize the opaque state "?

Not sure, this is copy pasted as-is from nbtree code.

2) typo :
* the TID array, and returning false if it's too large (more thant work_mem,

Fixed.

3) in _gin_build_tuple:

....
else if (typlen == -2)
keylen = strlen(DatumGetPointer(key)) + 1;
else
elog(ERROR, "invalid typlen");

Maybe `elog(ERROR, "invalid typLen: %d", typLen); ` as in `datumGetSize`?

Makes sense, I reworded it a little bit. But it's however supposed to be
a can't happen condition.

4) in _gin_compare_tuples:

if ((a->category == GIN_CAT_NORM_KEY) &&

(b->category == GIN_CAT_NORM_KEY))

maybe just a->category == GIN_CAT_NORM_KEY? a->category is already
equal to b->category because of previous if statements.

True. I've simplified the condition.

5) In _gin_partition_sorted_data:

char fname[128];
sprintf(fname, "worker-%d", i);

Other places use MAXPGPATH in similar cases.

OK, fixed the two places that format worker-%d.

Also, code `sprintf(fname, "worker-%d",...);` duplicates. This might
be error-prone. Should we have a macro/inline function for this?

Maybe. I think using a constant might be a good idea, but anything more
complicated is not worth it. There's only two places using it, not very
far apart.

I will take another look later, maybe reporting real problems, not nit-picks.

Thanks. Attached is a rebased patch series fixing those issues, and one
issue I found in an AssertCheckGinBuffer, which was calling the other
assert (AssertCheckItemPointers) even for empty buffers. I think this
part might need some more work, so that it's clear what the various
asserts assume (or rather to allow just calling AssertCheckGinBuffer
everywhere, with some flags).

I still need to go through the comments / question by Matthias and Andy
Fan, that I missed when they posted them in August.

My plan is to eventually commit the first couple patches, possibly up
0007 or even 0009. The rest would be left as an improvement for the
future. I need to figure out how to squash the patches - I don't want to
squash this into a single much-harder-to-understand commit, but maybe it
has too many parts.

regards

--
Tomas Vondra

Attachments:

v20250104-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20250104-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 4827067556bb35a56174f9fe09bf28294e350995 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20250104 01/10] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1446 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  203 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1687 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..21a3620f3ab 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinShared;
+
+/*
+ * Return pointer to a GinShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +464,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +584,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +633,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +857,1097 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinShared  *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinShared  *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 2500d16b7bc..f6021e46a13 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..9a5e8a21899 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..b65a3bc47bb 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,82 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+						sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +889,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1092,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1903,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..6f529a5aaf0
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..f159b409b8c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1022,11 +1022,13 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1039,9 +1041,11 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinShared
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.47.1

v20250104-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20250104-0002-Use-mergesort-in-the-leader-process.patchDownload

From 20b0a33a9944cb97b33e439a725dabf1c037e2fe Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20250104 02/10] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 21a3620f3ab..91ee5cc07fa 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -162,6 +162,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -472,23 +480,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -548,7 +556,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1140,7 +1148,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1170,8 +1177,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1293,11 +1298,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1325,28 +1326,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1410,6 +1405,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1491,7 +1504,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1508,7 +1521,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1518,6 +1531,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1556,6 +1572,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1588,6 +1700,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1624,7 +1741,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1633,6 +1750,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.47.1

v20250104-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20250104-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From b2f51e710c703d612c1d70e1511a192714e80b1b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20250104 03/10] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 118 ++++++++++++++++++-----------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 85 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 91ee5cc07fa..8fd5ffec844 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1155,19 +1155,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1180,12 +1188,25 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
+	/*
+	 * The buffer may be empty, in which case we must not call the
+	 * check of item pointers, because that assumes non-emptiness.
+	 *
+	 * XXX Would be better to have AssertCheckGinBuffer with flags,
+	 * instead of calling AssertCheckGinBuffer in some placess and then
+	 * AssertCheckItemPointers directly in others.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
 	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX actually with the mergesort in GinBufferStoreTuple, we
+	 * should not need 'false' here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1294,8 +1315,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1341,33 +1380,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
 
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1504,7 +1519,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1514,14 +1529,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1624,7 +1642,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1638,7 +1656,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1648,7 +1669,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1953,6 +1974,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2036,6 +2058,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2048,6 +2076,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2068,10 +2097,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf0..55dd8544b21 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.47.1

v20250104-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20250104-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 384dc719f49e9aa7b45fa376f69b605b272676c0 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20250104 04/10] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8fd5ffec844..e0b12ecfb1d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -188,7 +188,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1347,7 +1349,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1383,6 +1386,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1901,6 +1907,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1913,6 +1928,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1926,6 +1946,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1952,12 +1977,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2007,37 +2054,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2050,6 +2100,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2094,8 +2166,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f159b409b8c..07a7f32f40d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1041,6 +1041,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinShared
 GinState
 GinStatsData
-- 
2.47.1

v20250104-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20250104-0005-Collect-and-print-compression-stats.patchDownload

From 7bbfe011dbaec120bdb67d0b50937bc38f8d5e45 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20250104 05/10] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 42 +++++++++++++++++++++++-------
 src/include/access/gin.h           |  2 ++
 2 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e0b12ecfb1d..d6367ec1617 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,7 +191,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -554,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1204,9 +1205,9 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
-	 * XXX actually with the mergesort in GinBufferStoreTuple, we
-	 * should not need 'false' here. See AssertCheckItemPointers.
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
 	AssertCheckItemPointers(buffer, false);
 #endif
@@ -1624,6 +1625,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1650,7 +1660,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1677,7 +1687,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1692,6 +1702,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1764,7 +1779,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1858,6 +1873,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1935,7 +1951,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2069,6 +2086,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2debdac0f43..c1938d0b9c6 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.47.1

v20250104-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20250104-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From 5db934f1ae23df083958c51f480260b24b069d6f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20250104 06/10] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d6367ec1617..c946c19573d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1149,8 +1149,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1228,6 +1232,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound
+	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1313,6 +1329,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1341,6 +1405,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1369,21 +1438,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1422,11 +1542,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1494,7 +1632,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1536,6 +1679,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1559,6 +1730,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1619,7 +1792,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1672,6 +1851,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1707,6 +1921,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index c1938d0b9c6..9ed3cf97ad0 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.47.1

v20250104-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20250104-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From 537dd29ac3e8d803f20b0884e287aeb2e1a91409 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20250104 07/10] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 132 ++++++++++++++---------------
 1 file changed, 63 insertions(+), 69 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c946c19573d..8852036d339 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -144,6 +144,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -475,6 +476,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -499,6 +541,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -515,6 +562,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -533,40 +590,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -603,6 +627,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1237,8 +1262,8 @@ GinBufferInit(Relation index)
 	 * with too many TIDs. and 64kB seems more than enough. But maybe this
 	 * should be tied to maintenance_work_mem or something like that?
 	 *
-	 * XXX This is not enough to prevent repeated merges after a wraparound
-	 * of the parallel scan, but it should be enough to make the merges cheap
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
 	 * because it quickly reaches the end of the second list and can just
 	 * memcpy the rest without walking it item by item.
 	 */
@@ -1974,39 +1999,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2091,6 +2084,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.47.1

v20250104-0008-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250104-0008-Use-a-single-GIN-tuplesort.patchDownload

From 5c478d1860281420b714d937f72088e2c90c53b4 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 19:22:32 +0200
Subject: [PATCH v20250104 08/10] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read it back,
merge the GinTuples, and write it into the shared sort, to later be used by the
shared tuple sort.

The new approach is to use a single sort, merging tuples as we write them to disk.
This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize tuples unless
we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's writetup can
now decide to buffer writes until the next flushwrites() callback.
---
 src/backend/access/gin/gininsert.c         | 429 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 6 files changed, 308 insertions(+), 251 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8852036d339..ccf4245b6eb 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -163,14 +163,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(buildstate, attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1163,8 +1153,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * synchronized (and thus may wrap around), and when combininng values from
  * multiple workers.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached; /* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1182,7 +1178,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1197,8 +1193,7 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1218,7 +1213,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the
@@ -1249,7 +1244,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1299,15 +1294,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1319,37 +1317,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1402,6 +1434,55 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer	items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else {
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1436,32 +1517,28 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * as that does palloc internally, but if we detected the append case,
  * we could do without it. Not sure how much overhead it is, though.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
-
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		GinTuple   *tuple = palloc(tup->tuplen);
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1535,6 +1612,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(NULL, buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1548,14 +1652,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX Might be better to have a separate memory context for the buffer.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1571,6 +1682,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1594,7 +1706,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1605,6 +1717,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1615,10 +1728,10 @@ GinBufferFree(GinBuffer *buffer)
  * Returns true if the buffer is either empty or for the same index key.
  *
  * XXX This could / should also enforce a memory limit by checking the size of
- * the TID array, and returning false if it's too large (more thant work_mem,
+ * the TID array, and returning false if it's too large (more than work_mem,
  * for example).
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1695,6 +1808,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1723,6 +1837,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1736,7 +1851,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1744,6 +1862,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer, true);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1795,162 +1914,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* print some basic info */
-	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	/* reset before the second phase */
-	state->buildStats.sizeCompressed = 0;
-	state->buildStats.sizeRaw = 0;
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer, true);
-
-		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	/* print some basic info */
-	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1983,11 +1946,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2001,13 +1959,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2164,8 +2115,7 @@ static GinTuple *
 _gin_build_tuple(GinBuildState *state,
 				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2233,8 +2183,6 @@ _gin_build_tuple(GinBuildState *state,
 	 */
 	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
@@ -2296,12 +2244,15 @@ _gin_build_tuple(GinBuildState *state,
 		pfree(seginfo);
 	}
 
-	/* how large would the tuple be without compression? */
-	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		nitems * sizeof(ItemPointerData);
+	if (state)
+	{
+		/* how large would the tuple be without compression? */
+		state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+									 nitems * sizeof(ItemPointerData);
 
-	/* compressed size */
-	state->buildStats.sizeCompressed += tuplen;
+		/* compressed size */
+		state->buildStats.sizeCompressed += tuplen;
+	}
 
 	return tuple;
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index bda1bffa3cc..35dd9d8ec30 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index b65a3bc47bb..38a6ac9ac5d 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -590,6 +609,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -614,6 +634,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -645,9 +669,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -688,6 +714,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -890,17 +917,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -908,7 +935,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1928,19 +1955,61 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
-
+	unsigned int tuplen = tup->tuplen;
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple *tuple = GinBufferBuildTuple(arg->buffer);
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1966,6 +2035,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index dcd1ae3fc34..3faf6c80915 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 55dd8544b21..4ac8cfcc2bf 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -35,6 +35,16 @@ typedef struct GinTuple
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..238ef18656b 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -194,6 +194,14 @@ typedef struct
 	 */
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
 
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
-- 
2.47.1

v20250104-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchtext/x-patch; charset=UTF-8; name=v20250104-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchDownload

From b77aef97c8f08eeba4fe3f2801787541c88810d3 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 20:58:37 +0200
Subject: [PATCH v20250104 09/10] Reduce the size of GinTuple by 12 bytes

The size of a Gin tuple can't be larger than what we can allocate, which is just
shy of 1GB; this reduces the number of useful bits in size fields to 30 bits; so
int will be enough here.

Next, a key must fit in a single page (up to 32KB), so uint16 should be enough for
the keylen attribute.

Then, re-organize the fields to minimize alignment losses, while maintaining an
order that does make logical grouping sense.

Finally, use the first posting list to get the first stored ItemPointer; this
deduplicates stored data and thus improves performance again. In passing, adjust the
alignment of the first GinPostingList in GinTuple from MAXALIGN to SHORTALIGN.
---
 src/backend/access/gin/gininsert.c | 21 ++++++++++++---------
 src/include/access/gin_tuple.h     | 19 +++++++++++++++----
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index ccf4245b6eb..65616e629c0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1555,7 +1555,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	 * when merging non-overlapping lists, e.g. in each parallel worker.
 	 */
 	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1572,7 +1573,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
 	{
 		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
 			break;
 
 		buffer->nfrozen++;
@@ -2181,7 +2183,7 @@ _gin_build_tuple(GinBuildState *state,
 	 * alignment, to allow direct access to compressed segments (those require
 	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	/*
 	 * Allocate space for the whole GIN tuple.
@@ -2196,7 +2198,6 @@ _gin_build_tuple(GinBuildState *state,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
-	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2227,7 +2228,7 @@ _gin_build_tuple(GinBuildState *state,
 	}
 
 	/* finally, copy the TIDs into the array */
-	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
 	/* copy in the compressed data, and free the segments */
 	dlist_foreach_modify(iter, &segments)
@@ -2297,8 +2298,8 @@ _gin_parse_tuple_items(GinTuple *a)
 	int			ndecoded;
 	ItemPointer items;
 
-	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
 	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
 
@@ -2359,8 +2360,10 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 								&ssup[a->attrnum - 1]);
 
 		/* if the key is the same, consider the first TID in the array */
-		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
 	}
 
-	return ItemPointerCompare(&a->first, &b->first);
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 4ac8cfcc2bf..f4dbdfd3f7f 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,10 +10,12 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
+#include "access/ginblock.h"
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
 /*
+ * XXX: Update description with new architecture
  * Each worker sees tuples in CTID order, so if we track the first TID and
  * compare that when combining results in the worker, we would not need to
  * do an expensive sort in workers (the mergesort is already smart about
@@ -24,17 +26,26 @@
  */
 typedef struct GinTuple
 {
-	Size		tuplen;			/* length of the whole tuple */
-	Size		keylen;			/* bytes in data for key value */
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
 	int16		typlen;			/* typlen for key */
 	bool		typbyval;		/* typbyval for key */
-	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
-	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
 typedef struct GinBuffer GinBuffer;
 
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
-- 
2.47.1

v20250104-0010-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250104-0010-WIP-parallel-inserts-into-GIN-index.patchDownload

From a1f629c3a04753b6b2370368afbad6a5faf72380 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2024 20:53:20 +0200
Subject: [PATCH v20250104 10/10] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 432 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 289 insertions(+), 145 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 65616e629c0..c13e14d5714 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -172,7 +182,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -554,10 +569,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	/* scan wrapped around - flush accumulated entries and start anew */
 	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
-	{
-		elog(LOG, "calling ginFlushBuildState");
 		ginFlushBuildState(buildstate, index);
-	}
 
 	/* remember the TID we're about to process */
 	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
@@ -717,8 +729,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -1007,6 +1023,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinShared(ginshared),
 								  snapshot);
@@ -1074,6 +1096,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1087,6 +1114,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1744,145 +1773,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- *
- * FIXME Maybe should have local memory contexts similar to what
- * _brin_parallel_merge does?
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 *
-	 * XXX Maybe we should sort by key first, then by category? The idea is
-	 * that if this matches the order of the keys in the index, we'd insert
-	 * the entries in order better matching the index.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer, true);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2066,6 +1956,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2076,6 +1969,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2367,3 +2274,238 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinShared  *shared = state->bs_leader->ginshared;
+	BufFile	  **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char	fname[MAXPGPATH];
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile *file;
+	char	fname[MAXPGPATH];
+	char   *buff;
+	int64	ntuples = 0;
+	Size	maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer, true);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	elog(LOG, "_gin_parallel_insert ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..37db056bebd 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -115,6 +115,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.47.1

#43

Matthias van de Meent

boekewurm+postgres@gmail.com

about 1 year ago

In reply to: Tomas Vondra (#42)

Re: Parallel CREATE INDEX for GIN indexes

On Sat, 4 Jan 2025 at 17:58, Tomas Vondra <tomas@vondra.me> wrote:

On 11/24/24 19:04, Kirill Reshke wrote:

On Tue, 8 Oct 2024 at 17:06, Tomas Vondra <tomas@vondra.me> wrote:

On 10/8/24 04:03, Michael Paquier wrote:

_gin_parallel_build_main() is introduced in 0001. Please make sure to
pass down a query ID.

Thanks for the ping. Here's an updated patch doing that, and also fixing
a couple whitespace issues. No other changes, but I plan to get back to
this patch soon - before the next CF.

regards

--
Tomas Vondra

Hi! I was looking through this series of patches because thread of
GIN&GIST amcheck patch references it.

I have spotted this in gininsert.c:
1)

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/
sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
tuplesort_initialize_shared(sharedsort, scantuplesortstates,>

pcxt->seg);

/*
* Store shared tuplesort-private state, for which we reserved space.
* Then, initialize opaque state using tuplesort routine.
*/

Is it necessary to duplicate the entire comment?

Yes, that's a copy-paste mistake. Removed the second comment.

And, while we are here, isn't it " initialize the opaque state "?

Not sure, this is copy pasted as-is from nbtree code.

2) typo :
* the TID array, and returning false if it's too large (more thant work_mem,

Fixed.

3) in _gin_build_tuple:

....
else if (typlen == -2)
keylen = strlen(DatumGetPointer(key)) + 1;
else
elog(ERROR, "invalid typlen");

Maybe `elog(ERROR, "invalid typLen: %d", typLen); ` as in `datumGetSize`?

Makes sense, I reworded it a little bit. But it's however supposed to be
a can't happen condition.

4) in _gin_compare_tuples:

if ((a->category == GIN_CAT_NORM_KEY) &&

(b->category == GIN_CAT_NORM_KEY))

maybe just a->category == GIN_CAT_NORM_KEY? a->category is already
equal to b->category because of previous if statements.

True. I've simplified the condition.

5) In _gin_partition_sorted_data:

char fname[128];
sprintf(fname, "worker-%d", i);

Other places use MAXPGPATH in similar cases.

OK, fixed the two places that format worker-%d.

Also, code `sprintf(fname, "worker-%d",...);` duplicates. This might
be error-prone. Should we have a macro/inline function for this?

Maybe. I think using a constant might be a good idea, but anything more
complicated is not worth it. There's only two places using it, not very
far apart.

I will take another look later, maybe reporting real problems, not nit-picks.

Thanks. Attached is a rebased patch series fixing those issues, and one
issue I found in an AssertCheckGinBuffer, which was calling the other
assert (AssertCheckItemPointers) even for empty buffers. I think this
part might need some more work, so that it's clear what the various
asserts assume (or rather to allow just calling AssertCheckGinBuffer
everywhere, with some flags).

Thanks for the rebase.

0001

In gininsert.c, I think I'd prefer GinBuildShared over GinShared.
While current GIN infrastructure doesn't do parallel index scans (and
I can't think of an easy way to parallelize it) I think this it's
better to make clear that this isn't related to index scan.

+ * mutex protects all fields before heapdesc.

This comment is still inaccurate.

+ /* FIXME likely duplicate with indtuples */

I think this doesn't have to be duplicate, as we can distinguish
between number of heap tuples and the number of GIN (key, TID) pairs
loaded. This distinction doesn't really exist anywhere else, though,
so to expose this to users we may need changes in
pg_stat_progress_create_index.

While I haven't checked if that distinction is being made in the code,
I think it would be a useful distinction to have.

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

include/access/gin_tuple.h
+ OffsetNumber attrnum; /* attnum of index key */

I think this would best be AttrNumber-typed? Looks like I didn't
notice or fix that in 0009.

My plan is to eventually commit the first couple patches, possibly up
0007 or even 0009.

Sounds good. I'll see if I have some time to do some cleanup on my
patches (0008 and 0009), as they need some better polish on the
comments and commit messages.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

#44

Tomas Vondra

tomas@vondra.me

about 1 year ago

In reply to: Matthias van de Meent (#43)

10 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On 1/6/25 20:13, Matthias van de Meent wrote:

...

Thanks. Attached is a rebased patch series fixing those issues, and one
issue I found in an AssertCheckGinBuffer, which was calling the other
assert (AssertCheckItemPointers) even for empty buffers. I think this
part might need some more work, so that it's clear what the various
asserts assume (or rather to allow just calling AssertCheckGinBuffer
everywhere, with some flags).

Thanks for the rebase.

0001

In gininsert.c, I think I'd prefer GinBuildShared over GinShared.
While current GIN infrastructure doesn't do parallel index scans (and
I can't think of an easy way to parallelize it) I think this it's
better to make clear that this isn't related to index scan.

Agreed, renamed to GinBuildShared.

+ * mutex protects all fields before heapdesc.

This comment is still inaccurate.

Hmm, yeah. But this comment originates from btree, so maybe it's wrong
there (and in BRIN too)? I believe it refers to the descriptors stored
after the struct, i.e. it means "all fields after the mutex".

+ /* FIXME likely duplicate with indtuples */

I think this doesn't have to be duplicate, as we can distinguish
between number of heap tuples and the number of GIN (key, TID) pairs
loaded. This distinction doesn't really exist anywhere else, though,
so to expose this to users we may need changes in
pg_stat_progress_create_index.

While I haven't checked if that distinction is being made in the code,
I think it would be a useful distinction to have.

I haven't done anything about this, but I'm not sure adding the number
of GIN tuples to pg_stat_progress_create_index would be very useful. We
don't know the total number of entries, so it can't show the progress.

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

Good point! I fixed this by copying the logic from initGinState.

include/access/gin_tuple.h
+ OffsetNumber attrnum; /* attnum of index key */

I think this would best be AttrNumber-typed? Looks like I didn't
notice or fix that in 0009.

You're probably right, but I see the GIN code uses OffsetNumber for
attrnum in a number of places. I wonder why is that. I don't think it
can be harmful, because we can't have GIN on system columns, right?

My plan is to eventually commit the first couple patches, possibly up
0007 or even 0009.

Sounds good. I'll see if I have some time to do some cleanup on my
patches (0008 and 0009), as they need some better polish on the
comments and commit messages.

Thanks!

regards

--
Tomas Vondra

Attachments:

v20250107-0001-Allow-parallel-create-for-GIN-indexes.patchtext/x-patch; charset=UTF-8; name=v20250107-0001-Allow-parallel-create-for-GIN-indexes.patchDownload

From 47803eb69076d1a036be090f4212668fd6f194c5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Wed, 19 Jun 2024 12:42:24 +0200
Subject: [PATCH v20250107 01/10] Allow parallel create for GIN indexes

Add support for parallel create of GIN indexes, using an approach and
code very similar to the one used by BRIN indexes.

Each worker reads a subset of the table (from a parallel scan), and
accumulated index entries in memory. But instead of writing the results
into the index (after hitting the memory limit), the data are written
to a shared tuplesort (and sorted by index key).

The leader then reads data from the tuplesort, and combines them into
entries that get inserted into the index.
---
 src/backend/access/gin/gininsert.c         | 1464 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  203 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   31 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1705 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 8e1788dbcf7..e5047038dc8 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinBuildShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all fields before heapdesc.
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinBuildShared;
+
+/*
+ * Return pointer to a GinBuildShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinBuildShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinBuildShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinBuildShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,48 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinBuildShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +464,109 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is very similar to the serial build callback ginBuildCallback,
+ * except that instead of writing the accumulated entries into the index,
+ * we write them into a tuplesort that is then processed by the leader.
+ *
+ * XXX Instead of writing the entries directly into the shared tuplesort,
+ * we might write them into a local one, do a sort in the worker, combine
+ * the results, and only then write the results into the shared tuplesort.
+ * For large tables with many different keys that's going to work better
+ * than the current approach where we don't get many matches in work_mem
+ * (maybe this should use 32MB, which is what we use when planning, but
+ * even that may not be sufficient). Which means we are likely to have
+ * many entries with a small number of TIDs, forcing the leader to merge
+ * the data, often amounting to ~50% of the serial part. By doing the
+ * first sort workers, the leader then could do fewer merges with longer
+ * TID lists, which is much cheaper. Also, the amount of data sent from
+ * workers to the leader woiuld be lower.
+ *
+ * The disadvantage is increased disk space usage, possibly up to 2x, if
+ * no entries get combined at the worker level.
+ *
+ * It would be possible to partition the data into multiple tuplesorts
+ * per worker (by hashing) - we don't need the data produced by workers
+ * to be perfectly sorted, and we could even live with multiple entries
+ * for the same key (in case it has multiple binary representations with
+ * distinct hash values).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&buildstate->accum);
+		while ((list = ginGetBAEntry(&buildstate->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the index key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			/* GIN tuple and tuple length that we'll use for tuplesort */
+			GinTuple   *tup;
+			Size		tuplen;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &tuplen);
+
+			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(buildstate->tmpCtx);
+		ginInitBA(&buildstate->accum);
+	}
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +584,15 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +633,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
+	 */
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
+
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (state->bs_leader)
+	{
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +857,1115 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinBuildShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinBuildShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinBuildShared *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * XXX The TID values in the "items" array are not guaranteed to be sorted,
+ * we have to sort them explicitly. This is due to parallel scans being
+ * synchronized (and thus may wrap around), and when combininng values from
+ * multiple workers.
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	int			maxitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+{
+#ifdef USE_ASSERT_CHECKING
+	for (int i = 0; i < nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&items[i]));
+
+		if ((i == 0) || !sorted)
+			continue;
+
+		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+	}
+#endif
+}
+
+/* basic GinBuffer checks */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	Assert(buffer->nitems <= buffer->maxitems);
+
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * we don't know if the TID array is expected to be sorted or not
+	 *
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+#endif
+}
+
+/*
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		Oid			cmpFunc;
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * If the compare proc isn't specified in the opclass definition, look
+		 * up the index key type's default btree comparator.
+		 */
+		cmpFunc = index_getprocid(index, i + 1, GIN_COMPARE_PROC);
+		if (cmpFunc == InvalidOid)
+		{
+			TypeCacheEntry *typentry;
+
+			typentry = lookup_type_cache(att->atttypid,
+										 TYPECACHE_CMP_PROC_FINFO);
+			if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_FUNCTION),
+						 errmsg("could not identify a comparison function for type %s",
+								format_type_be(att->atttypid))));
+
+			cmpFunc = typentry->cmp_proc_finfo.fn_oid;
+		}
+
+		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are simply appended
+ * to the array, without sorting.
+ *
+ * XXX We expect the tuples to contain sorted TID lists, so maybe we should
+ * check that's true with an assert. And we could also check if the values
+ * are already in sorted order, in which case we can skip the sort later.
+ * But it seems like a waste of time, because it won't be unnecessary after
+ * switching to mergesort in a later patch, and also because it's reasonable
+ * to expect the arrays to overlap.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* enlarge the TID buffer, if needed */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		/* 64 seems like a good init value */
+		buffer->maxitems = Max(buffer->maxitems, 64);
+
+		while (buffer->nitems + tup->nitems > buffer->maxitems)
+			buffer->maxitems *= 2;
+
+		if (buffer->items == NULL)
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 buffer->maxitems * sizeof(ItemPointerData));
+	}
+
+	/* now we should be guaranteed to have enough space for all the TIDs */
+	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+
+	/* copy the new TIDs into the buffer */
+	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
+	buffer->nitems += tup->nitems;
+
+	/* we simply append the TID values, so don't check sorting */
+	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+}
+
+/* TID comparator for qsort */
+static int
+tid_cmp(const void *a, const void *b)
+{
+	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
+}
+
+/*
+ * GinBufferSortItems
+ *		Sort the TID values stored in the TID buffer.
+ */
+static void
+GinBufferSortItems(GinBuffer *buffer)
+{
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
+
+	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ *
+ * XXX Might be better to have a separate memory context for the buffer.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example).
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ *
+ * FIXME Maybe should have local memory contexts similar to what
+ * _brin_parallel_merge does?
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinBufferSortItems(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * Perform a worker's portion of a parallel sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinBuildShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinBuildShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	{
+		ItemPointerData *list;
+		Datum		key;
+		GinNullCategory category;
+		uint32		nlist;
+		OffsetNumber attnum;
+		TupleDesc	tdesc = RelationGetDescr(index);
+
+		ginBeginBAScan(&state->accum);
+		while ((list = ginGetBAEntry(&state->accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* information about the key */
+			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+			GinTuple   *tup;
+			Size		len;
+
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+
+			tup = _gin_build_tuple(attnum, category,
+								   key, attr->attlen, attr->attbyval,
+								   list, nlist, &len);
+
+			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+
+			pfree(tup);
+		}
+
+		MemoryContextReset(state->tmpCtx);
+		ginInitBA(&state->accum);
+	}
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * XXX We might try using memcmp(), based on the assumption that if we get
+ * two keys that are two different representations of a logically equal
+ * value, it'll get merged by the index build. But it's not clear that's
+ * safe, because for keys with multiple binary representations we might
+ * end with overlapping lists. Which might affect performance by requiring
+ * full merge of the TID lists, and perhaps even failures (e.g. errors like
+ * "could not split GIN page; all old items didn't fit" when inserting data
+ * into the index).
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		return ApplySortComparator(keya, false,
+								   keyb, false,
+								   &ssup[a->attrnum - 1]);
+	}
+
+	return 0;
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 2500d16b7bc..f6021e46a13 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 7817bedc2ef..9a5e8a21899 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..28a50be82e3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,82 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+/*
+ * XXX Maybe we should pass the ordering functions, not the heap/index?
+ */
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +889,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1092,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1903,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..6f529a5aaf0
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,31 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/* XXX do we still need all the fields now that we use SortSupport? */
+typedef struct GinTuple
+{
+	Size		tuplen;			/* length of the whole tuple */
+	Size		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	OffsetNumber attrnum;		/* attnum of index key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f84..a23896d42a0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1022,11 +1022,14 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
+GinBuildShared
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1042,6 +1045,7 @@ GinScanOpaqueData
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.47.1

v20250107-0002-Use-mergesort-in-the-leader-process.patchtext/x-patch; charset=UTF-8; name=v20250107-0002-Use-mergesort-in-the-leader-process.patchDownload

From 26c454a459ad8fb1fbfd86e5662b85aba5c5ca09 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:02:29 +0200
Subject: [PATCH v20250107 02/10] Use mergesort in the leader process

The leader process (executing the serial part of the index build) spent
a significant part of the time in pg_qsort, after combining the partial
results from the workers. But we can improve this and move some of the
costs to the parallel part in workers - if workers produce sorted TID
lists, the leader can combine them by mergesort.

But to make this really efficient, the mergesort must not be executed
too many times. The workers may easily produce very short TID lists, if
there are many different keys, hitting the memory limit often. So this
adds an intermediate tuplesort pass into each worker, to combine TIDs
for each key and only then write the result into the shared tuplestore.

This means the number of mergesort invocations for each key should be
about the same as the number of workers. We can't really do better, and
it's low enough to keep the mergesort approach efficient.

Note: If we introduce a memory limit on GinBuffer (to not accumulate too
many TIDs in memory), we could end up with more chunks, but it should
not be very common.
---
 src/backend/access/gin/gininsert.c | 200 +++++++++++++++++++++++------
 1 file changed, 162 insertions(+), 38 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e5047038dc8..c65325b2831 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -162,6 +162,14 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -472,23 +480,23 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * except that instead of writing the accumulated entries into the index,
  * we write them into a tuplesort that is then processed by the leader.
  *
- * XXX Instead of writing the entries directly into the shared tuplesort,
- * we might write them into a local one, do a sort in the worker, combine
+ * Instead of writing the entries directly into the shared tuplesort, write
+ * them into a local one (in each worker), do a sort in the worker, combine
  * the results, and only then write the results into the shared tuplesort.
  * For large tables with many different keys that's going to work better
  * than the current approach where we don't get many matches in work_mem
  * (maybe this should use 32MB, which is what we use when planning, but
- * even that may not be sufficient). Which means we are likely to have
- * many entries with a small number of TIDs, forcing the leader to merge
- * the data, often amounting to ~50% of the serial part. By doing the
- * first sort workers, the leader then could do fewer merges with longer
- * TID lists, which is much cheaper. Also, the amount of data sent from
- * workers to the leader woiuld be lower.
+ * even that may not be sufficient). Which means we would end up with many
+ * entries with a small number of TIDs, forcing the leader to merge the data,
+ * often amounting to ~50% of the serial part. By doing the first sort in
+ * workers, this work is parallelized and the leader does fewer merges with
+ * longer TID lists, which is much cheaper and more efficient. Also, the
+ * amount of data sent from workers to the leader gets be lower.
  *
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
- * It would be possible to partition the data into multiple tuplesorts
+ * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
  * for the same key (in case it has multiple binary representations with
@@ -548,7 +556,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
-			tuplesort_putgintuple(buildstate->bs_sortstate, tup, tuplen);
+			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
 
 			pfree(tup);
 		}
@@ -1140,7 +1148,6 @@ typedef struct GinBuffer
 
 	/* array of TID values */
 	int			nitems;
-	int			maxitems;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1170,8 +1177,6 @@ static void
 AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
-	Assert(buffer->nitems <= buffer->maxitems);
-
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
@@ -1311,11 +1316,7 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  * to the array, without sorting.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
- * check that's true with an assert. And we could also check if the values
- * are already in sorted order, in which case we can skip the sort later.
- * But it seems like a waste of time, because it won't be unnecessary after
- * switching to mergesort in a later patch, and also because it's reasonable
- * to expect the arrays to overlap.
+ * check that's true with an assert.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1343,28 +1344,22 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
-	/* enlarge the TID buffer, if needed */
-	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
-		/* 64 seems like a good init value */
-		buffer->maxitems = Max(buffer->maxitems, 64);
+		int			nnew;
+		ItemPointer new;
 
-		while (buffer->nitems + tup->nitems > buffer->maxitems)
-			buffer->maxitems *= 2;
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
 
-		if (buffer->items == NULL)
-			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
-		else
-			buffer->items = repalloc(buffer->items,
-									 buffer->maxitems * sizeof(ItemPointerData));
-	}
+		Assert(nnew == buffer->nitems + tup->nitems);
 
-	/* now we should be guaranteed to have enough space for all the TIDs */
-	Assert(buffer->nitems + tup->nitems <= buffer->maxitems);
+		if (buffer->items)
+			pfree(buffer->items);
 
-	/* copy the new TIDs into the buffer */
-	memcpy(&buffer->items[buffer->nitems], items, sizeof(ItemPointerData) * tup->nitems);
-	buffer->nitems += tup->nitems;
+		buffer->items = new;
+		buffer->nitems = nnew;
+	}
 
 	/* we simply append the TID values, so don't check sorting */
 	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
@@ -1428,6 +1423,24 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
 /*
  * GinBufferCanAddKey
  *		Check if a given GIN tuple can be added to the current buffer.
@@ -1509,7 +1522,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1526,7 +1539,7 @@ _gin_parallel_merge(GinBuildState *state)
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1536,6 +1549,9 @@ _gin_parallel_merge(GinBuildState *state)
 		GinBufferReset(buffer);
 	}
 
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1574,6 +1590,102 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			GinBufferSortItems(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/* now remember the new key */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		GinBufferSortItems(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -1606,6 +1718,11 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1642,7 +1759,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
-			tuplesort_putgintuple(state->bs_sortstate, tup, len);
+			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
 
 			pfree(tup);
 		}
@@ -1651,6 +1768,13 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 		ginInitBA(&state->accum);
 	}
 
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.47.1

v20250107-0003-Remove-the-explicit-pg_qsort-in-workers.patchtext/x-patch; charset=UTF-8; name=v20250107-0003-Remove-the-explicit-pg_qsort-in-workers.patchDownload

From 7504c6e5e99f0f653258d18844704c9b043798ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:14:52 +0200
Subject: [PATCH v20250107 03/10] Remove the explicit pg_qsort in workers

We don't need to do the explicit sort before building the GIN tuple,
because the mergesort in GinBufferStoreTuple is already maintaining the
correct order (this was added in an earlier commit).

The comment also adds a field with the first TID, and modifies the
comparator to sort by it (for each key value). This helps workers to
build non-overlapping TID lists and simply append values instead of
having to do the actual mergesort to combine them. This is best-effort,
i.e. it's not guaranteed to eliminate the mergesort - in particular,
parallel scans are synchronized, and thus may start somewhere in the
middle of the table, and wrap around. In which case there may be very
wide list (with low/high TID values).

Note: There's an XXX comment with a couple ideas on how to improve this,
at the cost of more complexity.
---
 src/backend/access/gin/gininsert.c | 120 ++++++++++++++++++-----------
 src/include/access/gin_tuple.h     |  11 ++-
 2 files changed, 86 insertions(+), 45 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c65325b2831..b452a8cae59 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1155,19 +1155,27 @@ typedef struct GinBuffer
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
  * expect it to be).
+ *
+ * XXX At this point there are no places where "sorted=false" should be
+ * necessary, because we always use merge-sort to combine the old and new
+ * TID list. So maybe we should get rid of the argument entirely.
  */
 static void
-AssertCheckItemPointers(ItemPointerData *items, int nitems, bool sorted)
+AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
-	for (int i = 0; i < nitems; i++)
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
 	{
-		Assert(ItemPointerIsValid(&items[i]));
+		Assert(ItemPointerIsValid(&buffer->items[i]));
 
 		if ((i == 0) || !sorted)
 			continue;
 
-		Assert(ItemPointerCompare(&items[i - 1], &items[i]) < 0);
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
 	}
 #endif
 }
@@ -1180,12 +1188,25 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 	/* if we have any items, the array must exist */
 	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
 
+	/*
+	 * The buffer may be empty, in which case we must not call the check of
+	 * item pointers, because that assumes non-emptiness.
+	 *
+	 * XXX Would be better to have AssertCheckGinBuffer with flags, instead of
+	 * calling AssertCheckGinBuffer in some placess and then
+	 * AssertCheckItemPointers directly in others.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
 	/*
 	 * we don't know if the TID array is expected to be sorted or not
 	 *
-	 * XXX maybe we can pass that to AssertCheckGinBuffer() call?
+	 * XXX maybe we can pass that to AssertCheckGinBuffer() call? XXX actually
+	 * with the mergesort in GinBufferStoreTuple, we should not need 'false'
+	 * here. See AssertCheckItemPointers.
 	 */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
+	AssertCheckItemPointers(buffer, false);
 #endif
 }
 
@@ -1312,8 +1333,26 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
- * having the same key. The TID values from the tuple are simply appended
- * to the array, without sorting.
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) is expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. But even in a single worker,
+ * lists can overlap - parallel scans require sync-scans, and if a scan wraps,
+ * obe of the lists may be very wide (in terms of TID range).
+ *
+ * But the ginMergeItemPointers() is already smart about detecting cases when
+ * it can simply concatenate the lists, and when full mergesort is needed. And
+ * does the right thing.
+ *
+ * By keeping the first TID in the GinTuple and sorting by that, we make it
+ * more likely the lists won't overlap very often.
+ *
+ * XXX How frequent can the overlaps be? If the scan does not wrap around,
+ * there should be no overlapping lists, and thus no mergesort. After a
+ * wraparound, there probably can be many - the one list will be very wide,
+ * with a very low and high TID, and all other lists will overlap with it.
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
@@ -1359,33 +1398,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->items = new;
 		buffer->nitems = nnew;
-	}
 
-	/* we simply append the TID values, so don't check sorting */
-	AssertCheckItemPointers(buffer->items, buffer->nitems, false);
-}
-
-/* TID comparator for qsort */
-static int
-tid_cmp(const void *a, const void *b)
-{
-	return ItemPointerCompare((ItemPointer) a, (ItemPointer) b);
-}
-
-/*
- * GinBufferSortItems
- *		Sort the TID values stored in the TID buffer.
- */
-static void
-GinBufferSortItems(GinBuffer *buffer)
-{
-	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
-
-	pg_qsort(buffer->items, buffer->nitems, sizeof(ItemPointerData), tid_cmp);
-
-	AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
+	}
 }
 
 /*
@@ -1522,7 +1537,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+			AssertCheckItemPointers(buffer, true);
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1532,14 +1547,17 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
 	if (!GinBufferIsEmpty(buffer))
 	{
-		AssertCheckItemPointers(buffer->items, buffer->nitems, true);
+		AssertCheckItemPointers(buffer, true);
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1642,7 +1660,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 * the data into the insert, and start a new entry for current
 			 * GinTuple.
 			 */
-			GinBufferSortItems(buffer);
+			AssertCheckItemPointers(buffer, true);
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
@@ -1656,7 +1674,10 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
-		/* now remember the new key */
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
 		GinBufferStoreTuple(buffer, tup);
 	}
 
@@ -1666,7 +1687,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		GinTuple   *ntup;
 		Size		ntuplen;
 
-		GinBufferSortItems(buffer);
+		AssertCheckItemPointers(buffer, true);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
@@ -1971,6 +1992,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2054,6 +2076,12 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
  * compared last. The comparisons are done using type-specific sort support
  * functions.
  *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ *
  * XXX We might try using memcmp(), based on the assumption that if we get
  * two keys that are two different representations of a logically equal
  * value, it'll get merged by the index build. But it's not clear that's
@@ -2066,6 +2094,7 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 int
 _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 {
+	int			r;
 	Datum		keya,
 				keyb;
 
@@ -2086,10 +2115,13 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 		keya = _gin_parse_tuple(a, NULL);
 		keyb = _gin_parse_tuple(b, NULL);
 
-		return ApplySortComparator(keya, false,
-								   keyb, false,
-								   &ssup[a->attrnum - 1]);
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
 	}
 
-	return 0;
+	return ItemPointerCompare(&a->first, &b->first);
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 6f529a5aaf0..55dd8544b21 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -13,7 +13,15 @@
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
-/* XXX do we still need all the fields now that we use SortSupport? */
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
 typedef struct GinTuple
 {
 	Size		tuplen;			/* length of the whole tuple */
@@ -22,6 +30,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
+	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
-- 
2.47.1

v20250107-0004-Compress-TID-lists-before-writing-tuples-t.patchtext/x-patch; charset=UTF-8; name=v20250107-0004-Compress-TID-lists-before-writing-tuples-t.patchDownload

From 8b0c8ff5dacf3e8e7a1d35f0e68407723670cbd5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:39 +0200
Subject: [PATCH v20250107 04/10] Compress TID lists before writing tuples to
 disk

When serializing GIN tuples to tuplesorts, we can significantly reduce
the amount of data by compressing the TID lists. The GIN opclasses may
produce a lot of data (depending on how many keys are extracted from
each row), and the compression is very effective and efficient.

If the number of different keys is high, the first worker pass may not
benefit from the compression very much - the data will be spilled to
disk before the TID lists can grow long enough for the compression to
actualy help. In the second pass the impact is much more significant.

For real-world data (full-text on mailing list archives), I usually see
the compression to save only about ~15% in the first pass, but ~50% on
the second pass.
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b452a8cae59..66f489abcab 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -188,7 +188,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1365,7 +1367,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1401,6 +1404,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer, true);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1919,6 +1925,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1931,6 +1946,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1944,6 +1964,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1970,12 +1995,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2025,37 +2072,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2068,6 +2118,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2112,8 +2184,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a23896d42a0..94424d2a05d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1042,6 +1042,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.47.1

v20250107-0005-Collect-and-print-compression-stats.patchtext/x-patch; charset=UTF-8; name=v20250107-0005-Collect-and-print-compression-stats.patchDownload

From ac5bb4302d74efb0d4a69a7bd623206ab5c30e70 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 2 May 2024 15:21:43 +0200
Subject: [PATCH v20250107 05/10] Collect and print compression stats

Allows evaluating the benefits of compressing the TID lists.
---
 src/backend/access/gin/gininsert.c | 36 +++++++++++++++++++++++++-----
 src/include/access/gin.h           |  2 ++
 2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 66f489abcab..1fc67ca1b8f 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,7 +191,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
-static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+static GinTuple *_gin_build_tuple(GinBuildState *state,
+								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
@@ -554,7 +555,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(buildstate, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &tuplen);
 
@@ -1642,6 +1643,15 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* sort the raw per-worker data */
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* print some basic info */
+	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
+	/* reset before the second phase */
+	state->buildStats.sizeCompressed = 0;
+	state->buildStats.sizeRaw = 0;
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1668,7 +1678,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			 */
 			AssertCheckItemPointers(buffer, true);
 
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
 									buffer->items, buffer->nitems, &ntuplen);
 
@@ -1695,7 +1705,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 		AssertCheckItemPointers(buffer, true);
 
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
 								buffer->items, buffer->nitems, &ntuplen);
 
@@ -1710,6 +1720,11 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	/* print some basic info */
+	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
+		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
+		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
+
 	tuplesort_end(worker_sort);
 }
 
@@ -1782,7 +1797,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 			/* there could be many entries, so be willing to abort here */
 			CHECK_FOR_INTERRUPTS();
 
-			tup = _gin_build_tuple(attnum, category,
+			tup = _gin_build_tuple(state, attnum, category,
 								   key, attr->attlen, attr->attbyval,
 								   list, nlist, &len);
 
@@ -1876,6 +1891,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* initialize the GIN build state */
 	initGinState(&buildstate.ginstate, indexRel);
 	buildstate.indtuples = 0;
+	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
 	/*
@@ -1953,7 +1969,8 @@ typedef struct
  * of that into the GIN tuple.
  */
 static GinTuple *
-_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+_gin_build_tuple(GinBuildState *state,
+				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
@@ -2087,6 +2104,13 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 		pfree(seginfo);
 	}
 
+	/* how large would the tuple be without compression? */
+	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+		nitems * sizeof(ItemPointerData);
+
+	/* compressed size */
+	state->buildStats.sizeCompressed += tuplen;
+
 	return tuple;
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2debdac0f43..c1938d0b9c6 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -49,6 +49,8 @@ typedef struct GinStatsData
 	BlockNumber nDataPages;
 	int64		nEntries;
 	int32		ginVersion;
+	Size		sizeRaw;
+	Size		sizeCompressed;
 } GinStatsData;
 
 /*
-- 
2.47.1

v20250107-0006-Enforce-memory-limit-when-combining-tuples.patchtext/x-patch; charset=UTF-8; name=v20250107-0006-Enforce-memory-limit-when-combining-tuples.patchDownload

From b50e5b1600a461cf4ecfc3b98afeaa025aa912b3 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Mon, 24 Jun 2024 01:46:48 +0200
Subject: [PATCH v20250107 06/10] Enforce memory limit when combining tuples

When combinnig intermediate results during parallel GIN index build, we
want to restrict the memory usage. In ginBuildCallbackParallel() this is
done simply by dumping working state into tuplesort after hitting the
memory limit.

This commit introduces memory limit to the following steps, merging the
intermediate results in both worker and leader. The merge only deals
with one key at a time, and the primary risk is the key might have too
many different TIDs. This is not very likely, because the TID array only
needs 6B per item, it's a potential issue.

We can't simply dump the whole current TID list - the index requires the
TID values to be inserted in the correct order, but if the lists overlap
(as they do between workers), the tail of the list may change during the
mergesort. But thanks to sorting GIN tuples by first TID, we can derive
a safe TID horizon - we know no future tuples will have TIDs from before
this value, so it's safe to output this part of the list.

This commit tracks "frozen" part of the the TID list, which is the part
we know won't change after merging additional TID lists. Then if the TID
list grows too large (more than 64kB), we try to trim it - write out the
frozen part of the list, and discard it from the buffer. We only do the
trimming if there's at least 1024 frozen items - we don't want to write
the data into the index in tiny chunks.

The freezing also allows us to skip the frozen part during mergesort.
The frozen part of the list is known to be fully sorted, so we can just
skip it and mergesort only the rest of the data.

Note: These limites (1024 and 64kB) are mostly arbitrary - but seem high
enough to get good efficiency for compression/batching, but low enough
to release memory early and work in small increments.
---
 src/backend/access/gin/gininsert.c | 232 ++++++++++++++++++++++++++++-
 src/include/access/gin.h           |   1 +
 2 files changed, 225 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 1fc67ca1b8f..3fbd7383e77 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1149,8 +1149,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1228,6 +1232,18 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 *
+	 * XXX This is not enough to prevent repeated merges after a wraparound of
+	 * the parallel scan, but it should be enough to make the merges cheap
+	 * because it quickly reaches the end of the second list and can just
+	 * memcpy the rest without walking it item by item.
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1331,6 +1347,54 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ *
+ * XXX This does help for the wraparound case too, because the "wide" TID list
+ * is essentially two ranges - one at the beginning of the table, one at the
+ * end. And all the other ranges (from GIN tuples) come in between, and also
+ * do not overlap. So by trimming up to the range we're about to add, this
+ * guarantees we'll be able to "concatenate" the two lists cheaply.
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* We're not to hit the memory limit after adding this tuple. */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1359,6 +1423,11 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX We expect the tuples to contain sorted TID lists, so maybe we should
  * check that's true with an assert.
+ *
+ * XXX Maybe we could/should allocate the buffer once and then keep it
+ * without palloc/pfree. That won't help when just calling the mergesort,
+ * as that does palloc internally, but if we detected the append case,
+ * we could do without it. Not sure how much overhead it is, though.
  */
 static void
 GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
@@ -1387,21 +1456,72 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now search the list linearly, to find the last frozen TID. If we found
+	 * the whole list is frozen, this just does nothing.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher.
+	 *
+	 * XXX Maybe this should do a binary search if the number of "non-frozen"
+	 * items is sufficiently high (enough to make linear search slower than
+	 * binsearch).
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		pfree(new);
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer, true);
 	}
@@ -1440,11 +1560,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1512,7 +1650,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1554,6 +1697,34 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1577,6 +1748,8 @@ _gin_parallel_merge(GinBuildState *state)
 	/* relase all the memory */
 	GinBufferFree(buffer);
 
+	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(state->bs_sortstate);
 
 	return reltuples;
@@ -1637,7 +1810,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1690,6 +1869,41 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+
+			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1725,6 +1939,8 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
 		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
 
+	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
+
 	tuplesort_end(worker_sort);
 }
 
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index c1938d0b9c6..9ed3cf97ad0 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -51,6 +51,7 @@ typedef struct GinStatsData
 	int32		ginVersion;
 	Size		sizeRaw;
 	Size		sizeCompressed;
+	int64		nTrims;
 } GinStatsData;
 
 /*
-- 
2.47.1

v20250107-0007-Detect-wrap-around-in-parallel-callback.patchtext/x-patch; charset=UTF-8; name=v20250107-0007-Detect-wrap-around-in-parallel-callback.patchDownload

From c335765b0ab601084a9286a81474c636fddfbf90 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 20 Jun 2024 20:50:51 +0200
Subject: [PATCH v20250107 07/10] Detect wrap around in parallel callback

When sync scan during index build wraps around, that may result in some
keys having very long TID lists, requiring "full" merge sort runs when
combining data in workers. It also causes problems with enforcing memory
limit, because we can't just dump the data - the index build requires
append-only posting lists, and violating may result in errors like

  ERROR: could not split GIN page; all old items didn't fit

because after the scan wrap around some of the TIDs may belong to the
beginning of the list, affecting the compression.

But we can deal with this in the callback - if we see the TID to jump
back, that must mean a wraparound happened. In that case we simply dump
all the data accumulated in memory, and start from scratch.

This means there won't be any tuples with very wide TID ranges, instead
there'll be one tuple with a range at the end of the table, and another
tuple at the beginning. And all the lists in the worker will be
non-overlapping, and sort nicely based on first TID.

For the leader, we still need to do the full merge - the lists may
overlap and interleave in various ways. But there should be only very
few of those lists, about one per worker, making it not an issue.
---
 src/backend/access/gin/gininsert.c | 128 ++++++++++++++---------------
 1 file changed, 61 insertions(+), 67 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3fbd7383e77..549044174dc 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -144,6 +144,7 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -475,6 +476,47 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(buildstate, attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
 /*
  * ginBuildCallbackParallel
  *		Callback for the parallel index build.
@@ -499,6 +541,11 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
  * The disadvantage is increased disk space usage, possibly up to 2x, if
  * no entries get combined at the worker level.
  *
+ * To detect a wraparound (which can happen with sync scans), we remember the
+ * last TID seen by each worker - if the next TID seen by the worker is lower,
+ * the scan must have wrapped around. We handle that by flushing the current
+ * buildstate to the tuplesort, so that we don't end up with wide TID lists.
+ *
  * XXX It would be possible to partition the data into multiple tuplesorts
  * per worker (by hashing) - we don't need the data produced by workers
  * to be perfectly sorted, and we could even live with multiple entries
@@ -515,6 +562,16 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
+	/* scan wrapped around - flush accumulated entries and start anew */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+	{
+		elog(LOG, "calling ginFlushBuildState");
+		ginFlushBuildState(buildstate, index);
+	}
+
+	/* remember the TID we're about to process */
+	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
+
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
 		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
@@ -533,40 +590,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance command.
 	 */
 	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&buildstate->accum);
-		while ((list = ginGetBAEntry(&buildstate->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the index key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			/* GIN tuple and tuple length that we'll use for tuplesort */
-			GinTuple   *tup;
-			Size		tuplen;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(buildstate, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &tuplen);
-
-			tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(buildstate->tmpCtx);
-		ginInitBA(&buildstate->accum);
-	}
+		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
 }
@@ -603,6 +627,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
@@ -1992,39 +2017,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 									   ginBuildCallbackParallel, state, scan);
 
 	/* write remaining accumulated entries */
-	{
-		ItemPointerData *list;
-		Datum		key;
-		GinNullCategory category;
-		uint32		nlist;
-		OffsetNumber attnum;
-		TupleDesc	tdesc = RelationGetDescr(index);
-
-		ginBeginBAScan(&state->accum);
-		while ((list = ginGetBAEntry(&state->accum,
-									 &attnum, &key, &category, &nlist)) != NULL)
-		{
-			/* information about the key */
-			Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
-
-			GinTuple   *tup;
-			Size		len;
-
-			/* there could be many entries, so be willing to abort here */
-			CHECK_FOR_INTERRUPTS();
-
-			tup = _gin_build_tuple(state, attnum, category,
-								   key, attr->attlen, attr->attbyval,
-								   list, nlist, &len);
-
-			tuplesort_putgintuple(state->bs_worker_sort, tup, len);
-
-			pfree(tup);
-		}
-
-		MemoryContextReset(state->tmpCtx);
-		ginInitBA(&state->accum);
-	}
+	ginFlushBuildState(state, index);
 
 	/*
 	 * Do the first phase of in-worker processing - sort the data produced by
@@ -2109,6 +2102,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	buildstate.indtuples = 0;
 	/* XXX Shouldn't this initialize the other fields too, like ginbuild()? */
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
 
 	/*
 	 * create a temporary memory context that is used to hold data not yet
-- 
2.47.1

v20250107-0008-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250107-0008-Use-a-single-GIN-tuplesort.patchDownload

From 969aa83fb093ae320ed7b4f37a67408896bc4268 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 19:22:32 +0200
Subject: [PATCH v20250107 08/10] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read it back,
merge the GinTuples, and write it into the shared sort, to later be used by the
shared tuple sort.

The new approach is to use a single sort, merging tuples as we write them to disk.
This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize tuples unless
we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's writetup can
now decide to buffer writes until the next flushwrites() callback.
---
 src/backend/access/gin/gininsert.c         | 429 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 311 insertions(+), 249 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 549044174dc..d76455f5e74 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -163,14 +163,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(buildstate, attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1163,8 +1153,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * synchronized (and thus may wrap around), and when combininng values from
  * multiple workers.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1182,7 +1178,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1197,8 +1193,7 @@ AssertCheckItemPointers(GinBuffer *buffer, bool sorted)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1218,7 +1213,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1249,7 +1244,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1317,15 +1312,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1337,37 +1335,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1420,6 +1452,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1454,33 +1536,30 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * as that does palloc internally, but if we detected the append case,
  * we could do without it. Not sure how much overhead it is, though.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
+	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
 	}
 
+	items = _gin_parse_tuple_items(tup);
+
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
 	 * the mergesort. We can do that with TIDs before the first TID in the new
@@ -1553,6 +1632,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(NULL, buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1566,14 +1672,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  *
  * XXX Might be better to have a separate memory context for the buffer.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1589,6 +1702,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1612,7 +1726,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1623,6 +1737,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1633,10 +1748,10 @@ GinBufferFree(GinBuffer *buffer)
  * Returns true if the buffer is either empty or for the same index key.
  *
  * XXX This could / should also enforce a memory limit by checking the size of
- * the TID array, and returning false if it's too large (more thant work_mem,
+ * the TID array, and returning false if it's too large (more than work_mem,
  * for example).
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1713,6 +1828,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1741,6 +1857,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1754,7 +1871,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1762,6 +1882,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer, true);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1813,162 +1934,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* print some basic info */
-	elog(LOG, "_gin_parallel_scan_and_build raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	/* reset before the second phase */
-	state->buildStats.sizeCompressed = 0;
-	state->buildStats.sizeRaw = 0;
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-
-			ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer, true);
-
-		ntup = _gin_build_tuple(state, buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	/* print some basic info */
-	elog(LOG, "_gin_process_worker_data raw %zu compressed %zu ratio %.2f%%",
-		 state->buildStats.sizeRaw, state->buildStats.sizeCompressed,
-		 (100.0 * state->buildStats.sizeCompressed) / state->buildStats.sizeRaw);
-
-	elog(LOG, "_gin_process_worker_data trims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel sort.
  *
@@ -2001,11 +1966,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2019,13 +1979,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2182,8 +2135,7 @@ static GinTuple *
 _gin_build_tuple(GinBuildState *state,
 				 OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2251,8 +2203,6 @@ _gin_build_tuple(GinBuildState *state,
 	 */
 	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
@@ -2314,12 +2264,15 @@ _gin_build_tuple(GinBuildState *state,
 		pfree(seginfo);
 	}
 
-	/* how large would the tuple be without compression? */
-	state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
-		nitems * sizeof(ItemPointerData);
+	if (state)
+	{
+		/* how large would the tuple be without compression? */
+		state->buildStats.sizeRaw += MAXALIGN(offsetof(GinTuple, data) + keylen) +
+			nitems * sizeof(ItemPointerData);
 
-	/* compressed size */
-	state->buildStats.sizeCompressed += tuplen;
+		/* compressed size */
+		state->buildStats.sizeCompressed += tuplen;
+	}
 
 	return tuple;
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index bda1bffa3cc..35dd9d8ec30 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 28a50be82e3..0b43493b9e8 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -590,6 +609,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -614,6 +634,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -645,9 +669,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -688,6 +714,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -890,17 +917,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -908,7 +935,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1928,19 +1955,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1966,6 +2037,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index dcd1ae3fc34..3faf6c80915 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 55dd8544b21..4ac8cfcc2bf 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -35,6 +35,16 @@ typedef struct GinTuple
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 94424d2a05d..1296dd6004b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3008,6 +3008,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.47.1

v20250107-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchtext/x-patch; charset=UTF-8; name=v20250107-0009-Reduce-the-size-of-GinTuple-by-12-bytes.patchDownload

From cba472c7333bbe45f666b1eeecb4b9940bd00a3a Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Fri, 5 Jul 2024 20:58:37 +0200
Subject: [PATCH v20250107 09/10] Reduce the size of GinTuple by 12 bytes

The size of a Gin tuple can't be larger than what we can allocate, which is just
shy of 1GB; this reduces the number of useful bits in size fields to 30 bits; so
int will be enough here.

Next, a key must fit in a single page (up to 32KB), so uint16 should be enough for
the keylen attribute.

Then, re-organize the fields to minimize alignment losses, while maintaining an
order that does make logical grouping sense.

Finally, use the first posting list to get the first stored ItemPointer; this
deduplicates stored data and thus improves performance again. In passing, adjust the
alignment of the first GinPostingList in GinTuple from MAXALIGN to SHORTALIGN.
---
 src/backend/access/gin/gininsert.c | 21 ++++++++++++---------
 src/include/access/gin_tuple.h     | 19 +++++++++++++++----
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d76455f5e74..cb3558ada55 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1575,7 +1575,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	 * when merging non-overlapping lists, e.g. in each parallel worker.
 	 */
 	if ((buffer->nitems > 0) &&
-		(ItemPointerCompare(&buffer->items[buffer->nitems - 1], &tup->first) == 0))
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1592,7 +1593,8 @@ GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
 	{
 		/* Is the TID after the first TID of the new tuple? Can't freeze. */
-		if (ItemPointerCompare(&buffer->items[i], &tup->first) > 0)
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
 			break;
 
 		buffer->nfrozen++;
@@ -2201,7 +2203,7 @@ _gin_build_tuple(GinBuildState *state,
 	 * alignment, to allow direct access to compressed segments (those require
 	 * SHORTALIGN, but we do MAXALING anyway).
 	 */
-	tuplen = MAXALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	/*
 	 * Allocate space for the whole GIN tuple.
@@ -2216,7 +2218,6 @@ _gin_build_tuple(GinBuildState *state,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
-	tuple->first = items[0];
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2247,7 +2248,7 @@ _gin_build_tuple(GinBuildState *state,
 	}
 
 	/* finally, copy the TIDs into the array */
-	ptr = (char *) tuple + MAXALIGN(offsetof(GinTuple, data) + keylen);
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
 	/* copy in the compressed data, and free the segments */
 	dlist_foreach_modify(iter, &segments)
@@ -2317,8 +2318,8 @@ _gin_parse_tuple_items(GinTuple *a)
 	int			ndecoded;
 	ItemPointer items;
 
-	len = a->tuplen - MAXALIGN(offsetof(GinTuple, data) + a->keylen);
-	ptr = (char *) a + MAXALIGN(offsetof(GinTuple, data) + a->keylen);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
 	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
 
@@ -2379,8 +2380,10 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 								&ssup[a->attrnum - 1]);
 
 		/* if the key is the same, consider the first TID in the array */
-		return (r != 0) ? r : ItemPointerCompare(&a->first, &b->first);
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
 	}
 
-	return ItemPointerCompare(&a->first, &b->first);
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
 }
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index 4ac8cfcc2bf..f4dbdfd3f7f 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -10,10 +10,12 @@
 #ifndef GIN_TUPLE_
 #define GIN_TUPLE_
 
+#include "access/ginblock.h"
 #include "storage/itemptr.h"
 #include "utils/sortsupport.h"
 
 /*
+ * XXX: Update description with new architecture
  * Each worker sees tuples in CTID order, so if we track the first TID and
  * compare that when combining results in the worker, we would not need to
  * do an expensive sort in workers (the mergesort is already smart about
@@ -24,17 +26,26 @@
  */
 typedef struct GinTuple
 {
-	Size		tuplen;			/* length of the whole tuple */
-	Size		keylen;			/* bytes in data for key value */
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
 	int16		typlen;			/* typlen for key */
 	bool		typbyval;		/* typbyval for key */
-	OffsetNumber attrnum;		/* attnum of index key */
 	signed char category;		/* category: normal or NULL? */
-	ItemPointerData first;		/* first TID in the array */
 	int			nitems;			/* number of TIDs in the data */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;
 
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
 typedef struct GinBuffer GinBuffer;
 
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
-- 
2.47.1

v20250107-0010-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250107-0010-WIP-parallel-inserts-into-GIN-index.patchDownload

From 950c960972e135ab0b5ae17d51d1349789573167 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@2ndquadrant.com>
Date: Tue, 9 Jul 2024 20:53:20 +0200
Subject: [PATCH v20250107 10/10] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 433 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 290 insertions(+), 145 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cb3558ada55..404ed2ecaa5 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -172,7 +182,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(GinBuildState *state,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -554,10 +569,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	/* scan wrapped around - flush accumulated entries and start anew */
 	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
-	{
-		elog(LOG, "calling ginFlushBuildState");
 		ginFlushBuildState(buildstate, index);
-	}
 
 	/* remember the TID we're about to process */
 	memcpy(&buildstate->tid, tid, sizeof(ItemPointerData));
@@ -717,8 +729,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -1007,6 +1023,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1074,6 +1096,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1087,6 +1114,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1764,145 +1793,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- *
- * FIXME Maybe should have local memory contexts similar to what
- * _brin_parallel_merge does?
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 *
-	 * XXX Maybe we should sort by key first, then by category? The idea is
-	 * that if this matches the order of the keys in the index, we'd insert
-	 * the entries in order better matching the index.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			state->buildStats.nTrims++;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer, true);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer, true);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	elog(LOG, "_gin_parallel_merge ntrims " INT64_FORMAT, state->buildStats.nTrims);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2086,6 +1976,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2096,6 +1989,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2387,3 +2294,239 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			state->buildStats.nTrims++;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer, true);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer, true);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	elog(LOG, "_gin_parallel_insert ntrims " INT64_FORMAT, state->buildStats.nTrims);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..37db056bebd 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -115,6 +115,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.47.1

#45

Matthias van de Meent

boekewurm+postgres@gmail.com

11 months ago

In reply to: Tomas Vondra (#44)

Re: Parallel CREATE INDEX for GIN indexes

On Tue, 7 Jan 2025 at 12:59, Tomas Vondra <tomas@vondra.me> wrote:

On 1/6/25 20:13, Matthias van de Meent wrote:

...

Thanks. Attached is a rebased patch series fixing those issues, and one
issue I found in an AssertCheckGinBuffer, which was calling the other
assert (AssertCheckItemPointers) even for empty buffers. I think this
part might need some more work, so that it's clear what the various
asserts assume (or rather to allow just calling AssertCheckGinBuffer
everywhere, with some flags).

Thanks for the rebase.

0001
+ * mutex protects all fields before heapdesc.

This comment is still inaccurate.

Hmm, yeah. But this comment originates from btree, so maybe it's wrong
there (and in BRIN too)? I believe it refers to the descriptors stored
after the struct, i.e. it means "all fields after the mutex".

Yeah, I think that's just the comment that needs updating.

+ /* FIXME likely duplicate with indtuples */

I think this doesn't have to be duplicate, as we can distinguish
between number of heap tuples and the number of GIN (key, TID) pairs
loaded. This distinction doesn't really exist anywhere else, though,
so to expose this to users we may need changes in
pg_stat_progress_create_index.

While I haven't checked if that distinction is being made in the code,
I think it would be a useful distinction to have.

I haven't done anything about this, but I'm not sure adding the number
of GIN tuples to pg_stat_progress_create_index would be very useful. We
don't know the total number of entries, so it can't show the progress.

For btree scans, we update the number of to-be-inserted tuples
together with the number of blocks scanned. Can we do something
similar with GIN?

Can we track data for pg_stat_progress_create_index?

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

Good point! I fixed this by copying the logic from initGinState.

include/access/gin_tuple.h
+ OffsetNumber attrnum; /* attnum of index key */

I think this would best be AttrNumber-typed? Looks like I didn't
notice or fix that in 0009.

You're probably right, but I see the GIN code uses OffsetNumber for
attrnum in a number of places. I wonder why is that. I don't think it
can be harmful, because we can't have GIN on system columns, right?

Indeed, indexes on system columns are not supported, which includes GIN indexes.

I need to figure out how to squash the patches - I don't want to
squash this into a single much-harder-to-understand commit, but maybe it
has too many parts.

I think the following would be good:

Commits:
1.) 0001 (parallel create) + 0009 (reduce the size of ...) + 0002
(mergesort) + 0003 (remove explicit pg_qsort) + 0007 (detect
wrap-around)
2.) 0004 (compress) + 0006 (enforce memory limit)
3.) 0008 (single tuplesort)

Note that 0009 is a drop-in improvement, so I don't think order makes
much of a difference there.

IIUC, 0005 was only for development insights, and not proposed to get
committed. If that was wrong, I'd squash it into the second commit,
together with 0004/0006.

I'll try to provide a more polished version of 0008 soon, with
improved comments/commit message, however that'll depend on me not
getting distracted with $job items first; it's taken quite some time
recently.

Kind regards,

Matthias van de Meent

#46

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Matthias van de Meent (#45)

Re: Parallel CREATE INDEX for GIN indexes

On 2/12/25 15:59, Matthias van de Meent wrote:

On Tue, 7 Jan 2025 at 12:59, Tomas Vondra <tomas@vondra.me> wrote:

On 1/6/25 20:13, Matthias van de Meent wrote:

...

Thanks. Attached is a rebased patch series fixing those issues, and one
issue I found in an AssertCheckGinBuffer, which was calling the other
assert (AssertCheckItemPointers) even for empty buffers. I think this
part might need some more work, so that it's clear what the various
asserts assume (or rather to allow just calling AssertCheckGinBuffer
everywhere, with some flags).

Thanks for the rebase.

0001
+ * mutex protects all fields before heapdesc.

This comment is still inaccurate.

Hmm, yeah. But this comment originates from btree, so maybe it's wrong
there (and in BRIN too)? I believe it refers to the descriptors stored
after the struct, i.e. it means "all fields after the mutex".

Yeah, I think that's just the comment that needs updating.

+ /* FIXME likely duplicate with indtuples */

I think this doesn't have to be duplicate, as we can distinguish
between number of heap tuples and the number of GIN (key, TID) pairs
loaded. This distinction doesn't really exist anywhere else, though,
so to expose this to users we may need changes in
pg_stat_progress_create_index.

While I haven't checked if that distinction is being made in the code,
I think it would be a useful distinction to have.

I haven't done anything about this, but I'm not sure adding the number
of GIN tuples to pg_stat_progress_create_index would be very useful. We
don't know the total number of entries, so it can't show the progress.

For btree scans, we update the number of to-be-inserted tuples
together with the number of blocks scanned. Can we do something
similar with GIN?

Can we track data for pg_stat_progress_create_index?

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

Good point! I fixed this by copying the logic from initGinState.

include/access/gin_tuple.h
+ OffsetNumber attrnum; /* attnum of index key */

I think this would best be AttrNumber-typed? Looks like I didn't
notice or fix that in 0009.

You're probably right, but I see the GIN code uses OffsetNumber for
attrnum in a number of places. I wonder why is that. I don't think it
can be harmful, because we can't have GIN on system columns, right?

Indeed, indexes on system columns are not supported, which includes GIN indexes.

I need to figure out how to squash the patches - I don't want to
squash this into a single much-harder-to-understand commit, but maybe it
has too many parts.

I think the following would be good:

Commits:
1.) 0001 (parallel create) + 0009 (reduce the size of ...) + 0002
(mergesort) + 0003 (remove explicit pg_qsort) + 0007 (detect
wrap-around)
2.) 0004 (compress) + 0006 (enforce memory limit)
3.) 0008 (single tuplesort)

Thanks. It's been so long since I looked at the patches that I don't
quite recall all the details - it's almost as if it was authored by
someone else ;-)

But your proposal makes sense. I was torn between committing this in
smaller "increments" and squashing it like this. The smaller steps are
easier to follow and debug. But that mattered during the development,
and evaluation of those changes. It's not that useful for commit,
because the parts will get pushed over a relatively short time period
anyway.

Note that 0009 is a drop-in improvement, so I don't think order makes
much of a difference there.

True.

IIUC, 0005 was only for development insights, and not proposed to get
committed. If that was wrong, I'd squash it into the second commit,
together with 0004/0006.

Nope, it was for development only. I'll however consider keeping 0004
and 0006 separate, because those seem like pretty separate changes. I
don't see much point in not committing them in (1) and then squashing
them together into a single commit.

I'll try to provide a more polished version of 0008 soon, with
improved comments/commit message, however that'll depend on me not
getting distracted with $job items first; it's taken quite some time
recently.

Cool, thanks!

regards

--
Tomas Vondra

#47

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Tomas Vondra (#46)

5 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

Attached is a cleaned up version of the patch series, squashed into
fewer patches as discussed. I also went through all the comments, and
removed/updated some obsolete ones. I also updated the commit messages,
it'd be nice if someone could read through those, to make sure it's
clear enough.

While cleaning the comments, I realized there's a couple remaining XXX
and FIXME comments, with some valid open questions.

1) There are two places that explicitly zero memory, suggesting it's
because of padding causing issues in valgrind (in tuplesort). I need to
check if that's still true, but I wonder what do the other tuplesort
variants write stuff without tripping valgrind. Maybe the GinTuple is
too unique.

2) ginBuildCallbackParallel says this about memory limits:

* XXX It might seem this should set the memory limit to 32MB, same as
* what plan_create_index_workers() uses to calculate the number of
* parallel workers, but that's the limit for tuplesort. So it seems
* better to keep using work_mem here.
*
* XXX But maybe we should calculate this as a per-worker fraction of
* maintenance_work_mem. It's weird to use work_mem here, in a clearly
* maintenance command.

The function uses work_mem to limit the amount of memory used by each
worker, which seems a bit strange - it's a maintenance operation, so it
would be more appropriate to use maintenance_work_mem I guess.

I see the btree code also uses work_mem in some cases when building the
index, although that uses it to size the tuplesort. And here we have
both the tuplesorts (sized just like in nbtree code), but also the
buffer used to accumulate entries.

I wonder if maybe the right solution would be to use half the allowance
for tuplesort and half for the buffer. In the workers the allowance is

maintenance_work_mem / ginleader->nparticipanttuplesorts

while in the leader it's maintenance_work_mem. Opinions?

3) There's a XXX comment suggesting to use a separate memory context for
the GinBuffer, but I decided it doesn't seem really necessary. We're not
running any complex function or anything like that in this code, so I
don't see a huge benefit of a separate context.

I know the patch reworking this to use a single tuplesort actually adds
the memory context, maybe it's helpful for that patch. But for now I
don't see the point.

4) The patch saving 12B in the GinTuple also added this comment:

* XXX: Update description with new architecture

but I'm a bit unsure what exactly is meant but "architecture" or what
should I add a description for.

regards

--
Tomas Vondra

Attachments:

v20250216-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchtext/x-patch; charset=UTF-8; name=v20250216-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchDownload

From 232ae1160d006b53b04f54afb73c860e4d1d74c2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 20:57:52 +0100
Subject: [PATCH v20250216 1/5] Allow parallel CREATE INDEX for GIN indexes

Allow using multiple worker processes to build a GIN index, similarly to
BTREE and BRIN indexes. For large tables this may result in significant
speedup when the build is CPU-bound.

The work is divided so that each worker builds index entries on a subset
of the table, determined by the regular parallel scan used to read the
data. Each worker uses a local tuplesort to sort and merge the entries
for the same key. The entries are then written into a shared tuplesort.
Finally, the leader merges entries from all workers, and writes them
into the index.

This minimizes the amount of sorting and merging that needs to happen
in the leader process. The entries still need to be merged, but by doing
most of that in workers means it's parallelized. The leader then needs
to merge fewer large entries, which is cheaper / more efficient.

The workers build entries so that the TID lists do not overlap (for a
given key), so that the merge is simply append the two lists. In the
leader a full mergesort is needed.

Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for GIN indexes
(e.g. uniqueness checks).

Original patch by me, with reviews and substantial reworks by Matthias
van de Meent, certainly enough to make him a co-author.

Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c         | 1611 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  200 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   51 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1869 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d1b5e8f0dd1..2a44482b4be 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinBuildShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all following fields
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinBuildShared;
+
+/*
+ * Return pointer to a GinBuildShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinBuildShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinBuildShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinBuildShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,57 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinBuildShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +473,122 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is similar to the serial build callback ginBuildCallback, but
+ * instead of writing the accumulated entries into the index, each worker
+ * writes them into a (local) tuplesort.
+ *
+ * The worker then sorts and combines these entries, before writing them
+ * into a shared tuplesort for the leader (see _gin_parallel_scan_and_build
+ * for the whole process).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/*
+	 * if scan wrapped around - flush accumulated entries and start anew
+	 *
+	 * With parallel scans, we don't have a guarantee the scan does not start
+	 * half-way through the relation (serial builds disable sync scans and
+	 * always start from block 0, parallel scans require allow_sync=true).
+	 *
+	 * Building the posting lists assumes the TIDs are monotonic and never go
+	 * back, and the wrap around would break that. We handle that by detecting
+	 * the wraparound, and flushing all entries. This means we'll later see
+	 * two separate entries with non-overlapping TID lists (which can be
+	 * combined by merge sort).
+	 *
+	 * To detect a wraparound, we remember the last TID seen by each worker
+	 * (for any key). If the next TID seen by the worker is lower, the scan
+	 * must have wrapped around.
+	 */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+		ginFlushBuildState(buildstate, index);
+
+	/* remember the TID we're about to process */
+	buildstate->tid = *tid;
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort.
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+		ginFlushBuildState(buildstate, index);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +606,16 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +656,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
+	 */
+	if (state->bs_leader)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
+	{
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +880,1239 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinBuildShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinBuildShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinBuildShared *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * When adding TIDs to the buffer, we make sure to keep them sorted, both
+ * during the initial table scan (and detecting when the scan wraps around),
+ * and during merging (where we do mergesort).
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&buffer->items[i]));
+
+		/* don't check ordering for the first TID item */
+		if (i == 0)
+			continue;
+
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
+	}
+#endif
+}
+
+/*
+ * GinBuffer checks
+ *
+ * XXX Maybe it would be better to have AssertCheckGinBuffer with flags, instead
+ * of having to call AssertCheckItemPointers in some places, if we require the
+ * items to not be empty?
+ */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * The buffer may be empty, in which case we must not call the check of
+	 * item pointers, because that assumes non-emptiness.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
+	/* Make sure the item pointers are valid and sorted. */
+	AssertCheckItemPointers(buffer);
+#endif
+}
+
+/*
+ * GinBufferInit
+ *		Initialize buffer to store tuples for a GIN index.
+ *
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		Oid			cmpFunc;
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * If the compare proc isn't specified in the opclass definition, look
+		 * up the index key type's default btree comparator.
+		 */
+		cmpFunc = index_getprocid(index, i + 1, GIN_COMPARE_PROC);
+		if (cmpFunc == InvalidOid)
+		{
+			TypeCacheEntry *typentry;
+
+			typentry = lookup_type_cache(att->atttypid,
+										 TYPECACHE_CMP_PROC_FINFO);
+			if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_FUNCTION),
+						 errmsg("could not identify a comparison function for type %s",
+								format_type_be(att->atttypid))));
+
+			cmpFunc = typentry->cmp_proc_finfo.fn_oid;
+		}
+
+		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * GinBufferKeyEquals
+ *		Can the buffer store TIDs for the provided GIN tuple (same key)?
+ *
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer, or
+ * false if the key does not match.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) are expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. There should be no overlaps
+ * in a single worker - it could happen when the parallel scan wraps around,
+ * but we detect that and flush the data (see ginBuildCallbackParallel).
+ *
+ * By sorting the GinTuple not only by key, but also by the first TID, we make
+ * it more less likely the lists will overlap during merge. We merge them using
+ * mergesort, but it's cheaper to just append one list to the other.
+ *
+ * How often can the lists overlap? There should be no overlaps in workers,
+ * and in the leader we can see overlaps between lists built by different
+ * workers. But the workers merge the items as much as possible, so there
+ * should not be too many.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* add the new TIDs into the buffer, combine using merge-sort */
+	{
+		int			nnew;
+		ItemPointer new;
+
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
+
+		Assert(nnew == buffer->nitems + tup->nitems);
+
+		if (buffer->items)
+			pfree(buffer->items);
+
+		buffer->items = new;
+		buffer->nitems = nnew;
+
+		AssertCheckItemPointers(buffer);
+	}
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example). But in the leader we need to be careful not to force flushing
+ * data too early, which might break the monotonicity of TID list.
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		AssertCheckItemPointers(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
+/*
+ * Perform a worker's portion of a parallel GIN index build sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ *
+ * Before feeding data into a shared tuplesort (for the leader process),
+ * the workers process data in two phases.
+ *
+ * 1) A worker reads a portion of rows from the table, accumulates entries
+ * in memory, and flushes them into a private tuplesort (e.g. because of
+ * using too much memory).
+ *
+ * 2) The private tuplesort gets sorted (by key and TID), the worker reads
+ * the data again, and combines the entries as much as possible. This has
+ * to happen eventually, and this way it's done in workers in parallel.
+ *
+ * Finally, the combined entries are written into the shared tuplesort, so
+ * that the leader can process them.
+ *
+ * How well this works (compared to just writing entries into the shared
+ * tuplesort) depends on the data set. For large tables with many distinct
+ * keys this helps a lot. With many distinct keys it's likely the buffers has
+ * to be flushed often, generating many entries with the same key and short
+ * TID lists. These entries need to be sorted and merged at some point,
+ * before writing them to the index. The merging is quite expensive, it can
+ * easily be ~50% of a serial build, and doing as much of it in the workers
+ * means it's parallelized. The leader still has to merge results from the
+ * workers, but it's much more efficient to merge few large entries than
+ * many tiny ones.
+ *
+ * This also reduces the amount of data the workers pass to the leader through
+ * the shared tuplesort. OTOH the workers need more space for the private sort,
+ * possibly up to 2x of the data, if no entries be merged in a worker. But this
+ * is very unlikely, and the only consequence is inefficiency, so we ignore it.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinBuildShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinBuildShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	ginFlushBuildState(state, index);
+
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	int			r;
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
+	}
+
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 1f9e58c4f1f..6b2dd40fa0f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df92133..f6d81d6e1fc 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..4d3114076b3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +886,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1089,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1900,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..fb8f982a2a0
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "access/ginblock.h"
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/*
+ * XXX: Update description with new architecture
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
+typedef struct GinTuple
+{
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..5be7c379926 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1026,11 +1026,14 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
+GinBuildShared
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1046,6 +1049,7 @@ GinScanOpaqueData
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.48.1

v20250216-0002-Compress-TID-lists-when-writing-GIN-tuples.patchtext/x-patch; charset=UTF-8; name=v20250216-0002-Compress-TID-lists-when-writing-GIN-tuples.patchDownload

From 1e5b6112f6bdeabbc8ec3894259188a301bba732 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:01:43 +0100
Subject: [PATCH v20250216 2/5] Compress TID lists when writing GIN tuples to
 disk

When serializing GIN tuples to tuplesorts during parallel index builds,
we can significantly reduce the amount of data by compressing the TID
lists. The GIN opclasses may produce a lot of data (depending on how
many keys are extracted from each row), and the TID compression is very
efficient and effective.

If the number of distinct keys is high, the first worker pass (reading
data from the table and writing them into a private tuplesort) may not
benefit from the compression very much. It is likely to spill data to
disk before the TID lists get long enough for the compression to help.
The second pass (writing the merged data into the shared tuplesort) is
more likely to benefit from compression.

The compression can be seen as a way to reduce the amount of disk space
needed by the parallel builds, because the data is written twice - first
into the per-worker tuplesorts, then into the shared tuplesort.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2a44482b4be..0860b89f74e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -189,7 +189,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1373,7 +1375,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1409,6 +1412,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1918,6 +1924,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1930,6 +1945,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1943,6 +1963,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1969,12 +1994,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * only SHORTALIGN).
 	 */
-	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2023,37 +2070,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2066,6 +2116,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2101,8 +2173,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5be7c379926..28d39648c04 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1046,6 +1046,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.48.1

v20250216-0003-Enforce-memory-limit-during-parallel-GIN-b.patchtext/x-patch; charset=UTF-8; name=v20250216-0003-Enforce-memory-limit-during-parallel-GIN-b.patchDownload

From 295bfba2ff1d5aacc6836bd06cfdb530ec07f06a Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:45 +0100
Subject: [PATCH v20250216 3/5] Enforce memory limit during parallel GIN builds

Index builds are expected to respect maintenance_work_mem, just like
other maintenance operations. For serial builds this is done simply by
flushing the buffer in ginBuildCallback() into the index. But with
parallel builds it's more complicated, because there are multiple places
that can allocate memory.

ginBuildCallbackParallel() does the same thing as ginBuildCallback(),
except that the accumulated items are written into tuplesort. Then the
entries with the same key get merged - first in the worker, then in the
leader - and the TID lists may get (arbitrarily) long. It's unlikely it
would exceed the memory limit, but it's possible. We address this by
evicting some of the data if the list gets too long.

We can't simply dump the whole in-memory TID list. The GIN index bulk
insert code expects to see TIDs in monotonic order; it may fail if the
TIDs go backwards. If the TID lists overlap, evicting the whole current
TID list would break this (a later entry might add "old" TID values into
the already-written part).

In the workers this is not an issue, because the lists never overlap.
But the leader may see overlapping lists produced by the workers.

We can however derive a safe "horizon" TID - the entries (for a given
key) are sorted by (key, first TID), which means no future list can add
values before the last "first TID" we've seen. This patch tracks the
"frozen" part of the TID list, which we know can't change by merging
additional TID lists. If needed, we can evict this part of the list.

We don't want to do this too often - the smaller lists we evict, the
more expensive it'll be to merge them in the next step (especially in
the leader). Therefore we only trim the list if we have at least 1024
frozen items, and if the whole list is at least 64kB large.

These limits are somewhat arbitrary and fairly low. We might calculate
some limits from maintenance_work_mem, but judging by experiments that
does not really improve anything (time, compression ratio, ...). So we
stick to these conservative limits to release memory faster.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 212 +++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0860b89f74e..2f0559dd8ec 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1162,8 +1162,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1237,6 +1241,13 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1344,6 +1355,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* no need to trim if we have not hit the memory limit yet */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1394,21 +1447,76 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		pfree(new);
+
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer);
 	}
@@ -1445,11 +1553,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1515,7 +1641,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1553,6 +1684,32 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1636,7 +1793,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1680,6 +1843,39 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
-- 
2.48.1

v20250216-0004-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250216-0004-Use-a-single-GIN-tuplesort.patchDownload

From e2b9ee0781a885bb5b45853480f8adfb8ea235e6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:55 +0100
Subject: [PATCH v20250216 4/5] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 396 ++++++++++-----------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 +++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 225 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2f0559dd8ec..7414f1b6a74 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -163,14 +163,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -194,8 +186,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -498,16 +489,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1151,8 +1141,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1170,7 +1166,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1181,8 +1177,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1209,7 +1204,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1233,7 +1228,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1296,15 +1291,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1320,37 +1318,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1397,6 +1429,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1420,33 +1502,30 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
+	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
 	}
 
+	items = _gin_parse_tuple_items(tup);
+
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
 	 * the mergesort. We can do that with TIDs before the first TID in the new
@@ -1523,6 +1602,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1534,14 +1640,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  * But it's better to not let the array grow arbitrarily large, and enforce
  * work_mem as memory limit by flushing the buffer into the tuplestore.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1557,6 +1670,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1580,7 +1694,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1591,6 +1705,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1605,7 +1720,7 @@ GinBufferFree(GinBuffer *buffer)
  * for example). But in the leader we need to be careful not to force flushing
  * data too early, which might break the monotonicity of TID list.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1675,6 +1790,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1701,6 +1817,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1714,7 +1831,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1722,6 +1842,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1771,144 +1892,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -1971,11 +1954,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													sortmem, coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -1989,13 +1967,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2150,8 +2121,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2219,8 +2189,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 4d3114076b3..4d75a097617 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -642,9 +666,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -685,6 +711,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -887,17 +914,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -905,7 +932,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1925,19 +1952,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1963,6 +2034,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index dcd1ae3fc34..3faf6c80915 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index fb8f982a2a0..f4dbdfd3f7f 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -46,6 +46,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 28d39648c04..99921597ca3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3020,6 +3020,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250216-0005-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250216-0005-WIP-parallel-inserts-into-GIN-index.patchDownload

From ccfccbf9858e89b5da9e6813599a45a11bf93260 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:03:15 +0100
Subject: [PATCH v20250216 5/5] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 415 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7414f1b6a74..67d1a512d0a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -172,7 +182,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -188,6 +197,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -706,8 +721,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -996,6 +1015,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1063,6 +1088,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1076,6 +1106,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1731,134 +1763,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2071,6 +1975,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2081,6 +1988,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2352,3 +2273,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#48

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Matthias van de Meent (#45)

Re: Parallel CREATE INDEX for GIN indexes

On 2/12/25 15:59, Matthias van de Meent wrote:

On Tue, 7 Jan 2025 at 12:59, Tomas Vondra <tomas@vondra.me> wrote:

...

I haven't done anything about this, but I'm not sure adding the number
of GIN tuples to pg_stat_progress_create_index would be very useful. We
don't know the total number of entries, so it can't show the progress.

For btree scans, we update the number of to-be-inserted tuples
together with the number of blocks scanned. Can we do something
similar with GIN?

I've been thinking about this, but I'm not quite sure how should that
work. The problem is in btree we have a 1:1 mapping to heap tuples, but
in GIN that's not quite that simple. Not only do we generate multiple
GIN entries for each heap row, but we also combine / merge those tuples
in various levels.

But I think it might look like this:

1) Each worker counts the number of GinTuples written to the shared
tuplesort, after the in-worker merge phase (i.e. it'd not be the number
of GIN entries generated in ginBuildCallbackParallel).

2) The leader then counts the number of entries it loaded from the
tuplesort, before merging/writing them into the index.

I think this would work as a measure of progress, even though it does
not really match the number of index tuples.

One thing I'm not not sure about is how would this work with the "single
tuplesort" patch? That patch moves the merging to the tuplesort code,
and there doesn't seem to be a nice way to pass the number of merged
outside.

Can we track data for pg_stat_progress_create_index?

Which data? I think progress for the CREATE INDEX would be nice, ofc.

regards

--
Tomas Vondra

#49

Matthias van de Meent

boekewurm+postgres@gmail.com

11 months ago

In reply to: Tomas Vondra (#47)

Re: Parallel CREATE INDEX for GIN indexes

On Sun, 16 Feb 2025 at 04:47, Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Attached is a cleaned up version of the patch series, squashed into
fewer patches as discussed. I also went through all the comments, and
removed/updated some obsolete ones. I also updated the commit messages,
it'd be nice if someone could read through those, to make sure it's
clear enough.

While cleaning the comments, I realized there's a couple remaining XXX
and FIXME comments, with some valid open questions.

1) There are two places that explicitly zero memory, suggesting it's
because of padding causing issues in valgrind (in tuplesort). I need to
check if that's still true, but I wonder what do the other tuplesort
variants write stuff without tripping valgrind. Maybe the GinTuple is
too unique.

2) ginBuildCallbackParallel says this about memory limits:

* XXX It might seem this should set the memory limit to 32MB, same as
* what plan_create_index_workers() uses to calculate the number of
* parallel workers, but that's the limit for tuplesort. So it seems
* better to keep using work_mem here.
*
* XXX But maybe we should calculate this as a per-worker fraction of
* maintenance_work_mem. It's weird to use work_mem here, in a clearly
* maintenance command.

The function uses work_mem to limit the amount of memory used by each
worker, which seems a bit strange - it's a maintenance operation, so it
would be more appropriate to use maintenance_work_mem I guess.

I see the btree code also uses work_mem in some cases when building the
index, although that uses it to size the tuplesort. And here we have
both the tuplesorts (sized just like in nbtree code), but also the
buffer used to accumulate entries.

I think that's a bug in btree code.

I wonder if maybe the right solution would be to use half the allowance
for tuplesort and half for the buffer. In the workers the allowance is

maintenance_work_mem / ginleader->nparticipanttuplesorts

while in the leader it's maintenance_work_mem. Opinions?

Why is the allowance in the leader not affected by memory usage of
parallel workers? Shouldn't that also be m_w_m / nparticipants?

IIRC, in nbtree, the leader will use (n_planned_part -
n_launched_part) * (m_w_m / n_planned_part), which in practice is 1 *
(m_w_m / n_planned_part).

3) There's a XXX comment suggesting to use a separate memory context for
the GinBuffer, but I decided it doesn't seem really necessary. We're not
running any complex function or anything like that in this code, so I
don't see a huge benefit of a separate context.

I know the patch reworking this to use a single tuplesort actually adds
the memory context, maybe it's helpful for that patch. But for now I
don't see the point.

I think I added that for the reduced cost of memory cleanup using mctx
resets vs repeated pfree(), when we're in the tuplesort merge phase.

4) The patch saving 12B in the GinTuple also added this comment:

* XXX: Update description with new architecture

but I'm a bit unsure what exactly is meant but "architecture" or what
should I add a description for.

I worked on both 'smaller GinTuple' and the 'single tuplesort' patch
in one go, without any intermediate commits to delineate work on
either patch, and picked the changes for 'smaller GinTuple' from that
large pile of changes. That XXX was supposed to go into the second
patch, and was there to signal me to update the overarching
architecture documentation for Parallel GIN index builds (which I
subsequently forgot to do with the 'single tuplesort' patch). So, it's
not relevant to the 12B patch, but could be relevant to what is now
patch 0004.

Kind regards,

Matthias van de Meent

#50

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Matthias van de Meent (#49)

6 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On 2/17/25 14:01, Matthias van de Meent wrote:

On Sun, 16 Feb 2025 at 04:47, Tomas Vondra <tomas@vondra.me> wrote:

Hi,

Attached is a cleaned up version of the patch series, squashed into
fewer patches as discussed. I also went through all the comments, and
removed/updated some obsolete ones. I also updated the commit messages,
it'd be nice if someone could read through those, to make sure it's
clear enough.

While cleaning the comments, I realized there's a couple remaining XXX
and FIXME comments, with some valid open questions.

1) There are two places that explicitly zero memory, suggesting it's
because of padding causing issues in valgrind (in tuplesort). I need to
check if that's still true, but I wonder what do the other tuplesort
variants write stuff without tripping valgrind. Maybe the GinTuple is
too unique.

2) ginBuildCallbackParallel says this about memory limits:

* XXX It might seem this should set the memory limit to 32MB, same as
* what plan_create_index_workers() uses to calculate the number of
* parallel workers, but that's the limit for tuplesort. So it seems
* better to keep using work_mem here.
*
* XXX But maybe we should calculate this as a per-worker fraction of
* maintenance_work_mem. It's weird to use work_mem here, in a clearly
* maintenance command.

The function uses work_mem to limit the amount of memory used by each
worker, which seems a bit strange - it's a maintenance operation, so it
would be more appropriate to use maintenance_work_mem I guess.

I see the btree code also uses work_mem in some cases when building the
index, although that uses it to size the tuplesort. And here we have
both the tuplesorts (sized just like in nbtree code), but also the
buffer used to accumulate entries.

I think that's a bug in btree code.

Not sure I'd call it a bug, but it's certainly a bit confusing. A
maintenance operation using a mix of maintenance_work_mem and work_mem
seems a bit unexpected.

I kinda understand the logic (which predates parallel builds), based on
the assumption that [m_w_m >> w_m]. Which is generally true, but with
parallel builds this is [(m_w_m/k) >> w_m], and that's less likely. The
code actually uses Min(sortmem, work_mem), so it'll likely use sortmem
anyway, and the whole work_mem discussion seems a bit moot.

Anyway, it's not my ambition to rethink this part of nbtree builds. It's
a long-standing behavior, no one ever complained about it (AFAIK).

For GIN I propose to do roughly what the attached 0002 does - split the
sortmem and use half for the tuplesort, half for accumulating entries.
That's the best idea I have, and by default this will use ~16MB for each
(because of how plan_create_index_workers picks worker count), which
seems reasonable. I can imagine being smarter and capping the tuplesort
memory (to give preference to accumulating more entries before having to
flush it into the buffer).

There are ways to force more clients than would be allowed by the usual
logic in plan_create_index_workers (e.g. by setting it for the relation)
but I'd say that's user's choice.

I wonder if maybe the right solution would be to use half the allowance
for tuplesort and half for the buffer. In the workers the allowance is

maintenance_work_mem / ginleader->nparticipanttuplesorts

while in the leader it's maintenance_work_mem. Opinions?

Why is the allowance in the leader not affected by memory usage of
parallel workers? Shouldn't that also be m_w_m / nparticipants?

IIRC, in nbtree, the leader will use (n_planned_part -
n_launched_part) * (m_w_m / n_planned_part), which in practice is 1 *
(m_w_m / n_planned_part).

Sorry, I didn't write it very clearly. For the parallel part (when
acting as a worker), the leader will use the same amount of memory as
any other worker.

What I meant is that the shared tuplesort is allocated like this:

state->bs_sortstate =
tuplesort_begin_index_gin(heap, index,
maintenance_work_mem, coordinate,
TUPLESORT_NONE);

so the final sort will use m_w_m. But that's fine, the leader does not
need to accumulate a lot of entries anyway - it'll keep a single key and
then flush it when the TID list gets too long.

3) There's a XXX comment suggesting to use a separate memory context for
the GinBuffer, but I decided it doesn't seem really necessary. We're not
running any complex function or anything like that in this code, so I
don't see a huge benefit of a separate context.

I know the patch reworking this to use a single tuplesort actually adds
the memory context, maybe it's helpful for that patch. But for now I
don't see the point.

I think I added that for the reduced cost of memory cleanup using mctx
resets vs repeated pfree(), when we're in the tuplesort merge phase.

Hmm ... I don't think I've ever seen this as a very expensive part of
the build. It's not like we have a huge number of pointers to free,
pretty much just the "items" array of TIDs, right? But that'll be either
small (and we'll just shove it into the cache), or large (and then we'll
do a "full" free, swamping any other costs).

I plan to do nothing about this for now. We may rethink later, if it
ever happens to be an issue.

4) The patch saving 12B in the GinTuple also added this comment:

* XXX: Update description with new architecture

but I'm a bit unsure what exactly is meant but "architecture" or what
should I add a description for.

I worked on both 'smaller GinTuple' and the 'single tuplesort' patch
in one go, without any intermediate commits to delineate work on
either patch, and picked the changes for 'smaller GinTuple' from that
large pile of changes. That XXX was supposed to go into the second
patch, and was there to signal me to update the overarching
architecture documentation for Parallel GIN index builds (which I
subsequently forgot to do with the 'single tuplesort' patch). So, it's
not relevant to the 12B patch, but could be relevant to what is now
patch 0004.

OK, thanks. I've removed the comment from the 0001 patch.

Also, while stress-testing the patches, I ran into a bug in the part

WIP: parallel inserts into GIN index

The patch adds a barrier into _gin_begin_parallel(), and it initializes
it like this:

BarrierInit(&ginshared->build_barrier, scantuplesortstates);

The trouble is this is before we launch the workers, and the count is
just what we ask for - but we may not actually get that many workers. In
which case the BarrierArriveAndWait() at the end hangs forever.

Why does this even need the barrier? Doesn't _gin_parallel_heapscan() do
exactly the wait this is meant to do?

regards

--
Tomas Vondra

Attachments:

v20250217-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchtext/x-patch; charset=UTF-8; name=v20250217-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchDownload

From e7b8856976d142c76058afa4b336fc49b497a25e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 20:57:52 +0100
Subject: [PATCH v20250217 1/6] Allow parallel CREATE INDEX for GIN indexes

Allow using multiple worker processes to build a GIN index, similarly to
BTREE and BRIN indexes. For large tables this may result in significant
speedup when the build is CPU-bound.

The work is divided so that each worker builds index entries on a subset
of the table, determined by the regular parallel scan used to read the
data. Each worker uses a local tuplesort to sort and merge the entries
for the same key. The entries are then written into a shared tuplesort.
Finally, the leader merges entries from all workers, and writes them
into the index.

This minimizes the amount of sorting and merging that needs to happen
in the leader process. The entries still need to be merged, but by doing
most of that in workers means it's parallelized. The leader then needs
to merge fewer large entries, which is cheaper / more efficient.

The workers build entries so that the TID lists do not overlap (for a
given key), so that the merge is simply append the two lists. In the
leader a full mergesort is needed.

Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for GIN indexes
(e.g. uniqueness checks).

Original patch by me, with reviews and substantial reworks by Matthias
van de Meent, certainly enough to make him a co-author.

Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c         | 1611 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  200 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   50 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1868 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d1b5e8f0dd1..2a44482b4be 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinBuildShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all following fields
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinBuildShared;
+
+/*
+ * Return pointer to a GinBuildShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinBuildShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinBuildShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinBuildShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,57 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinBuildShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +473,122 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is similar to the serial build callback ginBuildCallback, but
+ * instead of writing the accumulated entries into the index, each worker
+ * writes them into a (local) tuplesort.
+ *
+ * The worker then sorts and combines these entries, before writing them
+ * into a shared tuplesort for the leader (see _gin_parallel_scan_and_build
+ * for the whole process).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/*
+	 * if scan wrapped around - flush accumulated entries and start anew
+	 *
+	 * With parallel scans, we don't have a guarantee the scan does not start
+	 * half-way through the relation (serial builds disable sync scans and
+	 * always start from block 0, parallel scans require allow_sync=true).
+	 *
+	 * Building the posting lists assumes the TIDs are monotonic and never go
+	 * back, and the wrap around would break that. We handle that by detecting
+	 * the wraparound, and flushing all entries. This means we'll later see
+	 * two separate entries with non-overlapping TID lists (which can be
+	 * combined by merge sort).
+	 *
+	 * To detect a wraparound, we remember the last TID seen by each worker
+	 * (for any key). If the next TID seen by the worker is lower, the scan
+	 * must have wrapped around.
+	 */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+		ginFlushBuildState(buildstate, index);
+
+	/* remember the TID we're about to process */
+	buildstate->tid = *tid;
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort.
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+		ginFlushBuildState(buildstate, index);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +606,16 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +656,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
+	 */
+	if (state->bs_leader)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
+	{
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +880,1239 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinBuildShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinBuildShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinBuildShared *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * When adding TIDs to the buffer, we make sure to keep them sorted, both
+ * during the initial table scan (and detecting when the scan wraps around),
+ * and during merging (where we do mergesort).
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&buffer->items[i]));
+
+		/* don't check ordering for the first TID item */
+		if (i == 0)
+			continue;
+
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
+	}
+#endif
+}
+
+/*
+ * GinBuffer checks
+ *
+ * XXX Maybe it would be better to have AssertCheckGinBuffer with flags, instead
+ * of having to call AssertCheckItemPointers in some places, if we require the
+ * items to not be empty?
+ */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * The buffer may be empty, in which case we must not call the check of
+	 * item pointers, because that assumes non-emptiness.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
+	/* Make sure the item pointers are valid and sorted. */
+	AssertCheckItemPointers(buffer);
+#endif
+}
+
+/*
+ * GinBufferInit
+ *		Initialize buffer to store tuples for a GIN index.
+ *
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		Oid			cmpFunc;
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * If the compare proc isn't specified in the opclass definition, look
+		 * up the index key type's default btree comparator.
+		 */
+		cmpFunc = index_getprocid(index, i + 1, GIN_COMPARE_PROC);
+		if (cmpFunc == InvalidOid)
+		{
+			TypeCacheEntry *typentry;
+
+			typentry = lookup_type_cache(att->atttypid,
+										 TYPECACHE_CMP_PROC_FINFO);
+			if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_FUNCTION),
+						 errmsg("could not identify a comparison function for type %s",
+								format_type_be(att->atttypid))));
+
+			cmpFunc = typentry->cmp_proc_finfo.fn_oid;
+		}
+
+		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * GinBufferKeyEquals
+ *		Can the buffer store TIDs for the provided GIN tuple (same key)?
+ *
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer, or
+ * false if the key does not match.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) are expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. There should be no overlaps
+ * in a single worker - it could happen when the parallel scan wraps around,
+ * but we detect that and flush the data (see ginBuildCallbackParallel).
+ *
+ * By sorting the GinTuple not only by key, but also by the first TID, we make
+ * it more less likely the lists will overlap during merge. We merge them using
+ * mergesort, but it's cheaper to just append one list to the other.
+ *
+ * How often can the lists overlap? There should be no overlaps in workers,
+ * and in the leader we can see overlaps between lists built by different
+ * workers. But the workers merge the items as much as possible, so there
+ * should not be too many.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* add the new TIDs into the buffer, combine using merge-sort */
+	{
+		int			nnew;
+		ItemPointer new;
+
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
+
+		Assert(nnew == buffer->nitems + tup->nitems);
+
+		if (buffer->items)
+			pfree(buffer->items);
+
+		buffer->items = new;
+		buffer->nitems = nnew;
+
+		AssertCheckItemPointers(buffer);
+	}
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example). But in the leader we need to be careful not to force flushing
+ * data too early, which might break the monotonicity of TID list.
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		AssertCheckItemPointers(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
+/*
+ * Perform a worker's portion of a parallel GIN index build sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ *
+ * Before feeding data into a shared tuplesort (for the leader process),
+ * the workers process data in two phases.
+ *
+ * 1) A worker reads a portion of rows from the table, accumulates entries
+ * in memory, and flushes them into a private tuplesort (e.g. because of
+ * using too much memory).
+ *
+ * 2) The private tuplesort gets sorted (by key and TID), the worker reads
+ * the data again, and combines the entries as much as possible. This has
+ * to happen eventually, and this way it's done in workers in parallel.
+ *
+ * Finally, the combined entries are written into the shared tuplesort, so
+ * that the leader can process them.
+ *
+ * How well this works (compared to just writing entries into the shared
+ * tuplesort) depends on the data set. For large tables with many distinct
+ * keys this helps a lot. With many distinct keys it's likely the buffers has
+ * to be flushed often, generating many entries with the same key and short
+ * TID lists. These entries need to be sorted and merged at some point,
+ * before writing them to the index. The merging is quite expensive, it can
+ * easily be ~50% of a serial build, and doing as much of it in the workers
+ * means it's parallelized. The leader still has to merge results from the
+ * workers, but it's much more efficient to merge few large entries than
+ * many tiny ones.
+ *
+ * This also reduces the amount of data the workers pass to the leader through
+ * the shared tuplesort. OTOH the workers need more space for the private sort,
+ * possibly up to 2x of the data, if no entries be merged in a worker. But this
+ * is very unlikely, and the only consequence is inefficiency, so we ignore it.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinBuildShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													sortmem, coordinate,
+													TUPLESORT_NONE);
+
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  sortmem, NULL,
+													  TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinBuildShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	ginFlushBuildState(state, index);
+
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	int			r;
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
+	}
+
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 1f9e58c4f1f..6b2dd40fa0f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df92133..f6d81d6e1fc 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..4d3114076b3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +886,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1089,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1900,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..c8fe1130aa4
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,50 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "access/ginblock.h"
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
+typedef struct GinTuple
+{
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b6c170ac249..5be7c379926 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1026,11 +1026,14 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
+GinBuildShared
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1046,6 +1049,7 @@ GinScanOpaqueData
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.48.1

v20250217-0002-WIP-don-t-use-work_mem-to-size-buffers.patchtext/x-patch; charset=UTF-8; name=v20250217-0002-WIP-don-t-use-work_mem-to-size-buffers.patchDownload

From fac87c8b7f263ef1c9e087b57eebdee5382ce16d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 17 Feb 2025 15:26:24 +0100
Subject: [PATCH v20250217 2/6] WIP: don't use work_mem to size buffers

---
 src/backend/access/gin/gininsert.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 2a44482b4be..34f88d16473 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -145,6 +145,7 @@ typedef struct
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
 	ItemPointerData tid;
+	int				work_mem;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -576,7 +577,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
 	 * maintenance command.
 	 */
-	if (buildstate->accum.allocatedMemory >= (Size) work_mem * 1024L)
+	if (buildstate->accum.allocatedMemory >= buildstate->work_mem * 1024L)
 		ginFlushBuildState(buildstate, index);
 
 	MemoryContextSwitchTo(oldCtx);
@@ -1766,14 +1767,19 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 
 	/* Begin "partial" tuplesort */
 	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
-													sortmem, coordinate,
+													(sortmem / 2),
+													coordinate,
 													TUPLESORT_NONE);
 
 	/* Local per-worker sort of raw-data */
 	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  sortmem, NULL,
+													  (sortmem / 2),
+													  NULL,
 													  TUPLESORT_NONE);
 
+	/* remember how much space is allowed for the accumulated entries */
+	state->work_mem = (sortmem / 2);
+
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
-- 
2.48.1

v20250217-0003-Compress-TID-lists-when-writing-GIN-tuples.patchtext/x-patch; charset=UTF-8; name=v20250217-0003-Compress-TID-lists-when-writing-GIN-tuples.patchDownload

From 50fde5c2578e75226296eb1c1d4e99469688c20b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:01:43 +0100
Subject: [PATCH v20250217 3/6] Compress TID lists when writing GIN tuples to
 disk

When serializing GIN tuples to tuplesorts during parallel index builds,
we can significantly reduce the amount of data by compressing the TID
lists. The GIN opclasses may produce a lot of data (depending on how
many keys are extracted from each row), and the TID compression is very
efficient and effective.

If the number of distinct keys is high, the first worker pass (reading
data from the table and writing them into a private tuplesort) may not
benefit from the compression very much. It is likely to spill data to
disk before the TID lists get long enough for the compression to help.
The second pass (writing the merged data into the shared tuplesort) is
more likely to benefit from compression.

The compression can be seen as a way to reduce the amount of disk space
needed by the parallel builds, because the data is written twice - first
into the per-worker tuplesorts, then into the shared tuplesort.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 34f88d16473..37d1bc35f23 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1374,7 +1376,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1410,6 +1413,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1924,6 +1930,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1936,6 +1951,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1949,6 +1969,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1975,12 +2000,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * only SHORTALIGN).
 	 */
-	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2029,37 +2076,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2072,6 +2122,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2107,8 +2179,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 5be7c379926..28d39648c04 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1046,6 +1046,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.48.1

v20250217-0004-Enforce-memory-limit-during-parallel-GIN-b.patchtext/x-patch; charset=UTF-8; name=v20250217-0004-Enforce-memory-limit-during-parallel-GIN-b.patchDownload

From 25ff4779e6b745f24bfb0da6a03078e9c2db4793 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:45 +0100
Subject: [PATCH v20250217 4/6] Enforce memory limit during parallel GIN builds

Index builds are expected to respect maintenance_work_mem, just like
other maintenance operations. For serial builds this is done simply by
flushing the buffer in ginBuildCallback() into the index. But with
parallel builds it's more complicated, because there are multiple places
that can allocate memory.

ginBuildCallbackParallel() does the same thing as ginBuildCallback(),
except that the accumulated items are written into tuplesort. Then the
entries with the same key get merged - first in the worker, then in the
leader - and the TID lists may get (arbitrarily) long. It's unlikely it
would exceed the memory limit, but it's possible. We address this by
evicting some of the data if the list gets too long.

We can't simply dump the whole in-memory TID list. The GIN index bulk
insert code expects to see TIDs in monotonic order; it may fail if the
TIDs go backwards. If the TID lists overlap, evicting the whole current
TID list would break this (a later entry might add "old" TID values into
the already-written part).

In the workers this is not an issue, because the lists never overlap.
But the leader may see overlapping lists produced by the workers.

We can however derive a safe "horizon" TID - the entries (for a given
key) are sorted by (key, first TID), which means no future list can add
values before the last "first TID" we've seen. This patch tracks the
"frozen" part of the TID list, which we know can't change by merging
additional TID lists. If needed, we can evict this part of the list.

We don't want to do this too often - the smaller lists we evict, the
more expensive it'll be to merge them in the next step (especially in
the leader). Therefore we only trim the list if we have at least 1024
frozen items, and if the whole list is at least 64kB large.

These limits are somewhat arbitrary and fairly low. We might calculate
some limits from maintenance_work_mem, but judging by experiments that
does not really improve anything (time, compression ratio, ...). So we
stick to these conservative limits to release memory faster.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 212 +++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 37d1bc35f23..16afa16d96b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1163,8 +1163,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1238,6 +1242,13 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1345,6 +1356,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* no need to trim if we have not hit the memory limit yet */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1395,21 +1448,76 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		pfree(new);
+
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer);
 	}
@@ -1446,11 +1554,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1516,7 +1642,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1554,6 +1685,32 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1637,7 +1794,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1681,6 +1844,39 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
-- 
2.48.1

v20250217-0005-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250217-0005-Use-a-single-GIN-tuplesort.patchDownload

From 9a3e21109eefa8b9abed5a913328c3924811c844 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:55 +0100
Subject: [PATCH v20250217 5/6] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 391 ++++++++++-----------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 +++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 220 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 16afa16d96b..29e001c1930 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -164,14 +164,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1152,8 +1142,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1171,7 +1167,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1182,8 +1178,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1210,7 +1205,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1234,7 +1229,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1297,15 +1292,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1321,37 +1319,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1398,6 +1430,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1421,33 +1503,30 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
+	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
 	}
 
+	items = _gin_parse_tuple_items(tup);
+
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
 	 * the mergesort. We can do that with TIDs before the first TID in the new
@@ -1524,6 +1603,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1535,14 +1641,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  * But it's better to not let the array grow arbitrarily large, and enforce
  * work_mem as memory limit by flushing the buffer into the tuplestore.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1558,6 +1671,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1581,7 +1695,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1592,6 +1706,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1606,7 +1721,7 @@ GinBufferFree(GinBuffer *buffer)
  * for example). But in the leader we need to be careful not to force flushing
  * data too early, which might break the monotonicity of TID list.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1676,6 +1791,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1702,6 +1818,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1715,7 +1832,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1723,6 +1843,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1772,144 +1893,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -1995,13 +1978,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2156,8 +2132,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2225,8 +2200,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 4d3114076b3..4d75a097617 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -642,9 +666,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -685,6 +711,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -887,17 +914,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -905,7 +932,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1925,19 +1952,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1963,6 +2034,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index dcd1ae3fc34..3faf6c80915 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index c8fe1130aa4..66a93894958 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -45,6 +45,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 28d39648c04..99921597ca3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3020,6 +3020,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250217-0006-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250217-0006-WIP-parallel-inserts-into-GIN-index.patchDownload

From 14b35f651f3758e6affcc3dc72471f3922a25184 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:03:15 +0100
Subject: [PATCH v20250217 6/6] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 415 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 29e001c1930..1f4ea7ada25 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -173,7 +183,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -707,8 +722,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -997,6 +1016,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1064,6 +1089,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1077,6 +1107,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1732,134 +1764,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2082,6 +1986,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2092,6 +1999,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2363,3 +2284,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#51

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Tomas Vondra (#50)

4 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

After stress-testing all the patches (which yielded no issues except for
the barrier hang in 0005, which is not for commit yet), I proceeded to
do some basic perf testing.

I simply built a bunch of GIN indexes on a database with current mailing
list archives. The database is ~23GB, and the indexes were these:

CREATE INDEX headers_jsonb_path_idx
ON messages USING gin (msg_headers jsonb_path_ops);

CREATE INDEX headers_jsonb_idx
ON messages USING gin (msg_headers);

CREATE INDEX subject_trgm_idx
ON messages USING gin (msg_subject gin_trgm_ops);

CREATE INDEX body_tsvector_idx
ON messages USING gin (msg_body_tsvector);

CREATE INDEX subject_tsvector_idx
ON messages USING gin (msg_subject_tsvector);

So the indexes are on different data types, columns of different size,
etc. I did this on my two machines:

1) xeon - 44 cores, but old (~2016)
2) ryzen - 12 cores, brand new CPU (2024)

And I ran the CREATE INDEX with a range of worker counts (0, 1, 4, ...).
The count was set using ALTER TABLE, which just sets that without the
additional plan_create_index_workers() heuristics. There was always
enough workers to satisfy this.

The m_w_m was set to 1GB for all runs, which should leave "enough"
memory for to 32 workers (plan_create_index_workers leaves at least 32MB
per worker).

The results are in the attached PDF tables. I think the results are
mostly as expected ...

timing
------

For the "timing" charts, there are two colored sections. The first shows
"comparison to 0 workers" (i.e. serial build), and then "comparison to
ideal speedup" (essentially time/(N+1), where N is the number of
workers). In both cases green=good, red=bad.

The "patch" is the number of patch in the patch series, without the "0"
prefix. Patch "0" means "master" without patches.

How much the parallelism helps depends on the column. For some columns
(body_trgm, subject_trgm, subject_tsvector) it helps a lot, for others
it's less beneficial. But in all cases it helps, cutting the duration
(at least) in half.

On both machines the performance stops improving at ~4 workers. I guess
that's expected, and AFAICS we wouldn't really try to use more workers
for these index builds anyway.

One thing I don't quite understand is that on the ryzen machine, this
also seems to speed up patch "0" (i.e. master with no parallel builds).
At first I thought it's just random run-to-run noise, but looking at
those results it doesn't seem to be the case. E.g. for body_trgm_idx it
changes from ~686 seconds to ~634 seconds. For the other columns it's
less significant, but still pretty consistent.

On the xeon machine this doesn't happen at all.

I don't have a great explanation for this, because the patch does not
modify serial builds at all. The only idea I have is change in binary
layout between builds, but that's just "I don't know" in disguise.

temporary files
---------------

The other set of charts "temporary MB" shows amount of temporary files
produced with each of the patches. It's not showing "patch 0" (aka
master) because serial builds don't use temp files at all. The % values
are relative to "patch 1".

The 0002 patch is the compression, and that helps a lot (but depends on
the column). 0003 is just about enforcing memory limit, it does not
affect the temporary files at all.

Then 0004 "single tuplesort" does help a lot too, sometimes cutting the
amount in half. Which makes sense, because we suddenly don't need to
shuffle data between two tuplesorts.

But the results of 0005 are a bit bizarre - it mostly undoes the 0004
benefits, for some reason. I wonder why.

Anyway, I'm mostly happy about how this performs for 0001-0003, which
are the parts I plan to push in the coming days.

regards

--
Tomas Vondra

#52

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Tomas Vondra (#51)

5 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

here's a rebased version of the patch series. I realized I made a silly
merge mistake in the 0004 patch during the last rebase, so cfbot was not
happy. So here's a fixed version.

This also squashes the memory size adjustments (0002 patch in the last
patch), into 0001.

regards

--
Tomas Vondra

Attachments:

v20250220-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchtext/x-patch; charset=UTF-8; name=v20250220-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchDownload

From 14ad828222db47ba86c93fe52029c3150a6523f1 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 20:57:52 +0100
Subject: [PATCH v20250220 1/5] Allow parallel CREATE INDEX for GIN indexes

Allow using multiple worker processes to build a GIN index, similarly to
BTREE and BRIN indexes. For large tables this may result in significant
speedup when the build is CPU-bound.

The work is divided so that each worker builds index entries on a subset
of the table, determined by the regular parallel scan used to read the
data. Each worker uses a local tuplesort to sort and merge the entries
for the same key. The entries are then written into a shared tuplesort.
Finally, the leader merges entries from all workers, and writes them
into the index.

This minimizes the amount of sorting and merging that needs to happen
in the leader process. The entries still need to be merged, but by doing
most of that in workers means it's parallelized. The leader then needs
to merge fewer large entries, which is cheaper / more efficient.

The workers build entries so that the TID lists do not overlap (for a
given key), so that the merge is simply append the two lists. In the
leader a full mergesort is needed.

Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for GIN indexes
(e.g. uniqueness checks).

Original patch by me, with reviews and substantial reworks by Matthias
van de Meent, certainly enough to make him a co-author.

Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c         | 1617 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  200 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   50 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1874 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d1b5e8f0dd1..34f88d16473 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinBuildShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all following fields
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinBuildShared;
+
+/*
+ * Return pointer to a GinBuildShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinBuildShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinBuildShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinBuildShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,58 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
+	int				work_mem;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinBuildShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +474,122 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is similar to the serial build callback ginBuildCallback, but
+ * instead of writing the accumulated entries into the index, each worker
+ * writes them into a (local) tuplesort.
+ *
+ * The worker then sorts and combines these entries, before writing them
+ * into a shared tuplesort for the leader (see _gin_parallel_scan_and_build
+ * for the whole process).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/*
+	 * if scan wrapped around - flush accumulated entries and start anew
+	 *
+	 * With parallel scans, we don't have a guarantee the scan does not start
+	 * half-way through the relation (serial builds disable sync scans and
+	 * always start from block 0, parallel scans require allow_sync=true).
+	 *
+	 * Building the posting lists assumes the TIDs are monotonic and never go
+	 * back, and the wrap around would break that. We handle that by detecting
+	 * the wraparound, and flushing all entries. This means we'll later see
+	 * two separate entries with non-overlapping TID lists (which can be
+	 * combined by merge sort).
+	 *
+	 * To detect a wraparound, we remember the last TID seen by each worker
+	 * (for any key). If the next TID seen by the worker is lower, the scan
+	 * must have wrapped around.
+	 */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+		ginFlushBuildState(buildstate, index);
+
+	/* remember the TID we're about to process */
+	buildstate->tid = *tid;
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort.
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= buildstate->work_mem * 1024L)
+		ginFlushBuildState(buildstate, index);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +607,16 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +657,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
+	 */
+	if (state->bs_leader)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
+	{
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +881,1244 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinBuildShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinBuildShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinBuildShared *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * When adding TIDs to the buffer, we make sure to keep them sorted, both
+ * during the initial table scan (and detecting when the scan wraps around),
+ * and during merging (where we do mergesort).
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&buffer->items[i]));
+
+		/* don't check ordering for the first TID item */
+		if (i == 0)
+			continue;
+
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
+	}
+#endif
+}
+
+/*
+ * GinBuffer checks
+ *
+ * XXX Maybe it would be better to have AssertCheckGinBuffer with flags, instead
+ * of having to call AssertCheckItemPointers in some places, if we require the
+ * items to not be empty?
+ */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * The buffer may be empty, in which case we must not call the check of
+	 * item pointers, because that assumes non-emptiness.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
+	/* Make sure the item pointers are valid and sorted. */
+	AssertCheckItemPointers(buffer);
+#endif
+}
+
+/*
+ * GinBufferInit
+ *		Initialize buffer to store tuples for a GIN index.
+ *
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		Oid			cmpFunc;
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * If the compare proc isn't specified in the opclass definition, look
+		 * up the index key type's default btree comparator.
+		 */
+		cmpFunc = index_getprocid(index, i + 1, GIN_COMPARE_PROC);
+		if (cmpFunc == InvalidOid)
+		{
+			TypeCacheEntry *typentry;
+
+			typentry = lookup_type_cache(att->atttypid,
+										 TYPECACHE_CMP_PROC_FINFO);
+			if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_FUNCTION),
+						 errmsg("could not identify a comparison function for type %s",
+								format_type_be(att->atttypid))));
+
+			cmpFunc = typentry->cmp_proc_finfo.fn_oid;
+		}
+
+		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * GinBufferKeyEquals
+ *		Can the buffer store TIDs for the provided GIN tuple (same key)?
+ *
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer, or
+ * false if the key does not match.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) are expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. There should be no overlaps
+ * in a single worker - it could happen when the parallel scan wraps around,
+ * but we detect that and flush the data (see ginBuildCallbackParallel).
+ *
+ * By sorting the GinTuple not only by key, but also by the first TID, we make
+ * it more less likely the lists will overlap during merge. We merge them using
+ * mergesort, but it's cheaper to just append one list to the other.
+ *
+ * How often can the lists overlap? There should be no overlaps in workers,
+ * and in the leader we can see overlaps between lists built by different
+ * workers. But the workers merge the items as much as possible, so there
+ * should not be too many.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* add the new TIDs into the buffer, combine using merge-sort */
+	{
+		int			nnew;
+		ItemPointer new;
+
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
+
+		Assert(nnew == buffer->nitems + tup->nitems);
+
+		if (buffer->items)
+			pfree(buffer->items);
+
+		buffer->items = new;
+		buffer->nitems = nnew;
+
+		AssertCheckItemPointers(buffer);
+	}
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example). But in the leader we need to be careful not to force flushing
+ * data too early, which might break the monotonicity of TID list.
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		AssertCheckItemPointers(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
+/*
+ * Perform a worker's portion of a parallel GIN index build sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ *
+ * Before feeding data into a shared tuplesort (for the leader process),
+ * the workers process data in two phases.
+ *
+ * 1) A worker reads a portion of rows from the table, accumulates entries
+ * in memory, and flushes them into a private tuplesort (e.g. because of
+ * using too much memory).
+ *
+ * 2) The private tuplesort gets sorted (by key and TID), the worker reads
+ * the data again, and combines the entries as much as possible. This has
+ * to happen eventually, and this way it's done in workers in parallel.
+ *
+ * Finally, the combined entries are written into the shared tuplesort, so
+ * that the leader can process them.
+ *
+ * How well this works (compared to just writing entries into the shared
+ * tuplesort) depends on the data set. For large tables with many distinct
+ * keys this helps a lot. With many distinct keys it's likely the buffers has
+ * to be flushed often, generating many entries with the same key and short
+ * TID lists. These entries need to be sorted and merged at some point,
+ * before writing them to the index. The merging is quite expensive, it can
+ * easily be ~50% of a serial build, and doing as much of it in the workers
+ * means it's parallelized. The leader still has to merge results from the
+ * workers, but it's much more efficient to merge few large entries than
+ * many tiny ones.
+ *
+ * This also reduces the amount of data the workers pass to the leader through
+ * the shared tuplesort. OTOH the workers need more space for the private sort,
+ * possibly up to 2x of the data, if no entries be merged in a worker. But this
+ * is very unlikely, and the only consequence is inefficiency, so we ignore it.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinBuildShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													(sortmem / 2),
+													coordinate,
+													TUPLESORT_NONE);
+
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  (sortmem / 2),
+													  NULL,
+													  TUPLESORT_NONE);
+
+	/* remember how much space is allowed for the accumulated entries */
+	state->work_mem = (sortmem / 2);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinBuildShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	ginFlushBuildState(state, index);
+
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	int			r;
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
+	}
+
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 1f9e58c4f1f..6b2dd40fa0f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df92133..f6d81d6e1fc 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..4d3114076b3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +886,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1089,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1900,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..c8fe1130aa4
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,50 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "access/ginblock.h"
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
+typedef struct GinTuple
+{
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 98ab45adfa3..967e17cae82 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1027,11 +1027,14 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
+GinBuildShared
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1047,6 +1050,7 @@ GinScanOpaqueData
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.48.1

v20250220-0002-Compress-TID-lists-when-writing-GIN-tuples.patchtext/x-patch; charset=UTF-8; name=v20250220-0002-Compress-TID-lists-when-writing-GIN-tuples.patchDownload

From c78c95953ad0645926bde838c81cb8f17c689ab2 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:01:43 +0100
Subject: [PATCH v20250220 2/5] Compress TID lists when writing GIN tuples to
 disk

When serializing GIN tuples to tuplesorts during parallel index builds,
we can significantly reduce the amount of data by compressing the TID
lists. The GIN opclasses may produce a lot of data (depending on how
many keys are extracted from each row), and the TID compression is very
efficient and effective.

If the number of distinct keys is high, the first worker pass (reading
data from the table and writing them into a private tuplesort) may not
benefit from the compression very much. It is likely to spill data to
disk before the TID lists get long enough for the compression to help.
The second pass (writing the merged data into the shared tuplesort) is
more likely to benefit from compression.

The compression can be seen as a way to reduce the amount of disk space
needed by the parallel builds, because the data is written twice - first
into the per-worker tuplesorts, then into the shared tuplesort.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 34f88d16473..37d1bc35f23 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1374,7 +1376,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1410,6 +1413,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1924,6 +1930,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1936,6 +1951,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1949,6 +1969,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -1975,12 +2000,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * only SHORTALIGN).
 	 */
-	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2029,37 +2076,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2072,6 +2122,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2107,8 +2179,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 967e17cae82..a6e2ccef36b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1047,6 +1047,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.48.1

v20250220-0003-Enforce-memory-limit-during-parallel-GIN-b.patchtext/x-patch; charset=UTF-8; name=v20250220-0003-Enforce-memory-limit-during-parallel-GIN-b.patchDownload

From 5c8712892b867f65bef061559cdada900cb10787 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:45 +0100
Subject: [PATCH v20250220 3/5] Enforce memory limit during parallel GIN builds

Index builds are expected to respect maintenance_work_mem, just like
other maintenance operations. For serial builds this is done simply by
flushing the buffer in ginBuildCallback() into the index. But with
parallel builds it's more complicated, because there are multiple places
that can allocate memory.

ginBuildCallbackParallel() does the same thing as ginBuildCallback(),
except that the accumulated items are written into tuplesort. Then the
entries with the same key get merged - first in the worker, then in the
leader - and the TID lists may get (arbitrarily) long. It's unlikely it
would exceed the memory limit, but it's possible. We address this by
evicting some of the data if the list gets too long.

We can't simply dump the whole in-memory TID list. The GIN index bulk
insert code expects to see TIDs in monotonic order; it may fail if the
TIDs go backwards. If the TID lists overlap, evicting the whole current
TID list would break this (a later entry might add "old" TID values into
the already-written part).

In the workers this is not an issue, because the lists never overlap.
But the leader may see overlapping lists produced by the workers.

We can however derive a safe "horizon" TID - the entries (for a given
key) are sorted by (key, first TID), which means no future list can add
values before the last "first TID" we've seen. This patch tracks the
"frozen" part of the TID list, which we know can't change by merging
additional TID lists. If needed, we can evict this part of the list.

We don't want to do this too often - the smaller lists we evict, the
more expensive it'll be to merge them in the next step (especially in
the leader). Therefore we only trim the list if we have at least 1024
frozen items, and if the whole list is at least 64kB large.

These limits are somewhat arbitrary and fairly low. We might calculate
some limits from maintenance_work_mem, but judging by experiments that
does not really improve anything (time, compression ratio, ...). So we
stick to these conservative limits to release memory faster.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 212 +++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 37d1bc35f23..16afa16d96b 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1163,8 +1163,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1238,6 +1242,13 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1345,6 +1356,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* no need to trim if we have not hit the memory limit yet */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1395,21 +1448,76 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		pfree(new);
+
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer);
 	}
@@ -1446,11 +1554,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1516,7 +1642,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1554,6 +1685,32 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1637,7 +1794,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1681,6 +1844,39 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
-- 
2.48.1

v20250220-0004-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250220-0004-Use-a-single-GIN-tuplesort.patchDownload

From 6338461644ffa52034f4ada18fb731a688edf566 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:55 +0100
Subject: [PATCH v20250220 4/5] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 397 ++++++++++-----------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 +++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 226 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 16afa16d96b..c6d0dc1fea9 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -164,14 +164,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1152,8 +1142,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1171,7 +1167,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1182,8 +1178,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1210,7 +1205,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1234,7 +1229,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1297,15 +1292,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1321,37 +1319,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1398,6 +1430,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1421,33 +1503,30 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
+	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
 	}
 
+	items = _gin_parse_tuple_items(tup);
+
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
 	 * the mergesort. We can do that with TIDs before the first TID in the new
@@ -1524,6 +1603,33 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
@@ -1535,14 +1641,21 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
  * But it's better to not let the array grow arbitrarily large, and enforce
  * work_mem as memory limit by flushing the buffer into the tuplestore.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1558,6 +1671,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1581,7 +1695,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1592,6 +1706,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1606,7 +1721,7 @@ GinBufferFree(GinBuffer *buffer)
  * for example). But in the leader we need to be careful not to force flushing
  * data too early, which might break the monotonicity of TID list.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1676,6 +1791,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1702,6 +1818,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1715,7 +1832,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1723,6 +1843,7 @@ _gin_parallel_merge(GinBuildState *state)
 	{
 		AssertCheckItemPointers(buffer);
 
+		Assert(!PointerIsValid(buffer->cached));
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
 					   buffer->items, buffer->nitems, &state->buildStats);
@@ -1772,144 +1893,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 ginleader->sharedsort, heap, index, sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -1973,12 +1956,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  (sortmem / 2),
-													  NULL,
-													  TUPLESORT_NONE);
-
 	/* remember how much space is allowed for the accumulated entries */
 	state->work_mem = (sortmem / 2);
 
@@ -1995,13 +1972,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2156,8 +2126,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2225,8 +2194,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 4d3114076b3..4d75a097617 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -642,9 +666,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -685,6 +711,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -887,17 +914,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -905,7 +932,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1925,19 +1952,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1963,6 +2034,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index dcd1ae3fc34..3faf6c80915 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -475,6 +475,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index c8fe1130aa4..66a93894958 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -45,6 +45,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a6e2ccef36b..9f1fff70b28 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3025,6 +3025,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250220-0005-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250220-0005-WIP-parallel-inserts-into-GIN-index.patchDownload

From 82ad4b5631f5366b584dbe29fc719a6ae2c6b38b Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:03:15 +0100
Subject: [PATCH v20250220 5/5] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 415 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 131 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c6d0dc1fea9..6b35572f3d2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -25,7 +25,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -173,7 +183,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -707,8 +722,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -997,6 +1016,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1064,6 +1089,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1077,6 +1107,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1732,134 +1764,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-
-		Assert(!PointerIsValid(buffer->cached));
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2076,6 +1980,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2086,6 +1993,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2357,3 +2278,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#53

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Tomas Vondra (#52)

7 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

One more patch version / rebase. I've been planning to get 0001
committed, but I realized there's one more loose end - progress reporting.

I could have committed it without it, I guess, but Matthias actually
mentioned this a couple days ago so I took a stab at it. The build goes
through these 5 build stages (on top of "INITIALIZE"):

PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN
PROGRESS_GIN_PHASE_PERFORMSORT_1
PROGRESS_GIN_PHASE_MERGE_1
PROGRESS_GIN_PHASE_PERFORMSORT_2
PROGRESS_GIN_PHASE_MERGE_2

The phases up to PROGRESS_GIN_PHASE_MERGE_1 happen in workers, i.e. it
ends with workers feeding the sorted/merged data into the shared
tuplesort. The last two phases are in the leader, which merges the data
and actually inserts it into the GIN index.

The "parallel" part has the blocks_done/blocks_total showing progress,
per the parallel scan. The "leader" phases use tuples_done/tuples_total,
where "tuple" is the GIN tuple produced by workers (each worker reports
the number of "tuples" it writes into the shared tuplesort, the leader
then tracks how many it processed).

I think this works pretty nicely. I'm not entirely sure we need all the
phases, maybe it'd be fine to have the sort+merge as a single phase? Or
maybe there should be one extra "sort" phase? Workers do two sorts,
first on their "private" tuplesort, then on the "shared" one.

What annoys me a little bit is that we only see those stages if the
leader participates as a worker. With parallel_leader_participation=off
none of this is visible anyway (we still see the blocks from the scan).

regards

--
Tomas Vondra

Attachments:

v20250225-0005-Enforce-memory-limit-during-parallel-GIN-b.patchtext/x-patch; charset=UTF-8; name=v20250225-0005-Enforce-memory-limit-during-parallel-GIN-b.patchDownload

From 7f94e97180a5e44540bb1be9f471e6f73671cc31 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:45 +0100
Subject: [PATCH v20250225 5/7] Enforce memory limit during parallel GIN builds

Index builds are expected to respect maintenance_work_mem, just like
other maintenance operations. For serial builds this is done simply by
flushing the buffer in ginBuildCallback() into the index. But with
parallel builds it's more complicated, because there are multiple places
that can allocate memory.

ginBuildCallbackParallel() does the same thing as ginBuildCallback(),
except that the accumulated items are written into tuplesort. Then the
entries with the same key get merged - first in the worker, then in the
leader - and the TID lists may get (arbitrarily) long. It's unlikely it
would exceed the memory limit, but it's possible. We address this by
evicting some of the data if the list gets too long.

We can't simply dump the whole in-memory TID list. The GIN index bulk
insert code expects to see TIDs in monotonic order; it may fail if the
TIDs go backwards. If the TID lists overlap, evicting the whole current
TID list would break this (a later entry might add "old" TID values into
the already-written part).

In the workers this is not an issue, because the lists never overlap.
But the leader may see overlapping lists produced by the workers.

We can however derive a safe "horizon" TID - the entries (for a given
key) are sorted by (key, first TID), which means no future list can add
values before the last "first TID" we've seen. This patch tracks the
"frozen" part of the TID list, which we know can't change by merging
additional TID lists. If needed, we can evict this part of the list.

We don't want to do this too often - the smaller lists we evict, the
more expensive it'll be to merge them in the next step (especially in
the leader). Therefore we only trim the list if we have at least 1024
frozen items, and if the whole list is at least 64kB large.

These limits are somewhat arbitrary and fairly low. We might calculate
some limits from maintenance_work_mem, but judging by experiments that
does not really improve anything (time, compression ratio, ...). So we
stick to these conservative limits to release memory faster.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 212 +++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a10266b39e1..e0938a71112 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1156,8 +1156,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1230,6 +1234,13 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1337,6 +1348,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* no need to trim if we have not hit the memory limit yet */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1387,21 +1440,76 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		pfree(new);
+
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer);
 	}
@@ -1433,11 +1541,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1505,7 +1631,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1563,6 +1694,32 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1656,7 +1813,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1712,6 +1875,39 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
-- 
2.48.1

v20250225-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchtext/x-patch; charset=UTF-8; name=v20250225-0001-Allow-parallel-CREATE-INDEX-for-GIN-indexe.patchDownload

From dcffd8a0203c3056e6bc4bc3881da4f8f6744721 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Mon, 24 Feb 2025 16:48:48 +0100
Subject: [PATCH v20250225 1/7] Allow parallel CREATE INDEX for GIN indexes

Allow using parallel workers to build a GIN index, similarly to BTREE
and BRIN. For large tables this may result in significant speedup when
the build is CPU-bound.

The work is divided so that each worker builds index entries on a subset
of the table, determined by the regular parallel scan used to read the
data. Each worker uses a local tuplesort to sort and merge the entries
for the same key. The TID lists do not overlap (for a given key), which
means the merge sort simply concatenates the two lists. The merged
entries are written into a shared tuplesort for the leader.

The leader needs to merge the sorted entries again, before writing them
into the index. But this way a significant part of the work happens in
the workers, and the leader is left with merging fewer large entries,
which is more efficient.

Most of the parallelism infrastructure is a simplified copy of the code
used by BTREE indexes, omitting the parts irrelevant for GIN indexes
(e.g. uniqueness checks).

Original patch by me, with reviews and substantial improvements by
Matthias van de Meent, certainly enough to make him a co-author.

Author: Tomas Vondra, Matthias van de Meent
Reviewed-by: Matthias van de Meent, Andy Fan, Kirill Reshke
Discussion: https://postgr.es/m/6ab4003f-a8b8-4d75-a67f-f25ad98582dc%40enterprisedb.com
---
 src/backend/access/gin/gininsert.c         | 1617 +++++++++++++++++++-
 src/backend/access/gin/ginutil.c           |    2 +-
 src/backend/access/transam/parallel.c      |    4 +
 src/backend/utils/sort/tuplesortvariants.c |  200 +++
 src/include/access/gin.h                   |    4 +
 src/include/access/gin_tuple.h             |   50 +
 src/include/utils/tuplesort.h              |    8 +
 src/tools/pgindent/typedefs.list           |    4 +
 8 files changed, 1874 insertions(+), 15 deletions(-)
 create mode 100644 src/include/access/gin_tuple.h

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d1b5e8f0dd1..a23b457bba3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -15,14 +15,126 @@
 #include "postgres.h"
 
 #include "access/gin_private.h"
+#include "access/gin_tuple.h"
+#include "access/table.h"
 #include "access/tableam.h"
 #include "access/xloginsert.h"
+#include "catalog/index.h"
+#include "catalog/pg_collation.h"
+#include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
+#include "pgstat.h"
 #include "storage/bufmgr.h"
 #include "storage/predicate.h"
+#include "tcop/tcopprot.h"		/* pgrminclude ignore */
+#include "utils/datum.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
+#include "utils/builtins.h"
+#include "utils/sortsupport.h"
+
+
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED			UINT64CONST(0xB000000000000001)
+#define PARALLEL_KEY_TUPLESORT			UINT64CONST(0xB000000000000002)
+#define PARALLEL_KEY_QUERY_TEXT			UINT64CONST(0xB000000000000003)
+#define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
+#define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
+
+/*
+ * Status for index builds performed in parallel.  This is allocated in a
+ * dynamic shared memory segment.
+ */
+typedef struct GinBuildShared
+{
+	/*
+	 * These fields are not modified during the build.  They primarily exist
+	 * for the benefit of worker processes that need to create state
+	 * corresponding to that used by the leader.
+	 */
+	Oid			heaprelid;
+	Oid			indexrelid;
+	bool		isconcurrent;
+	int			scantuplesortstates;
+
+	/*
+	 * workersdonecv is used to monitor the progress of workers.  All parallel
+	 * participants must indicate that they are done before leader can use
+	 * results built by the workers (and before leader can write the data into
+	 * the index).
+	 */
+	ConditionVariable workersdonecv;
+
+	/*
+	 * mutex protects all following fields
+	 *
+	 * These fields contain status information of interest to GIN index builds
+	 * that must work just the same when an index is built in parallel.
+	 */
+	slock_t		mutex;
+
+	/*
+	 * Mutable state that is maintained by workers, and reported back to
+	 * leader at end of the scans.
+	 *
+	 * nparticipantsdone is number of worker processes finished.
+	 *
+	 * reltuples is the total number of input heap tuples.
+	 *
+	 * indtuples is the total number of tuples that made it into the index.
+	 */
+	int			nparticipantsdone;
+	double		reltuples;
+	double		indtuples;
+
+	/*
+	 * ParallelTableScanDescData data follows. Can't directly embed here, as
+	 * implementations of the parallel table scan desc interface might need
+	 * stronger alignment.
+	 */
+} GinBuildShared;
+
+/*
+ * Return pointer to a GinBuildShared's parallel table scan.
+ *
+ * c.f. shm_toc_allocate as to why BUFFERALIGN is used, rather than just
+ * MAXALIGN.
+ */
+#define ParallelTableScanFromGinBuildShared(shared) \
+	(ParallelTableScanDesc) ((char *) (shared) + BUFFERALIGN(sizeof(GinBuildShared)))
+
+/*
+ * Status for leader in parallel index build.
+ */
+typedef struct GinLeader
+{
+	/* parallel context itself */
+	ParallelContext *pcxt;
+
+	/*
+	 * nparticipanttuplesorts is the exact number of worker processes
+	 * successfully launched, plus one leader process if it participates as a
+	 * worker (only DISABLE_LEADER_PARTICIPATION builds avoid leader
+	 * participating as a worker).
+	 */
+	int			nparticipanttuplesorts;
+
+	/*
+	 * Leader process convenience pointers to shared state (leader avoids TOC
+	 * lookups).
+	 *
+	 * GinBuildShared is the shared state for entire build.  sharedsort is the
+	 * shared, tuplesort-managed state passed to each process tuplesort.
+	 * snapshot is the snapshot used by the scan iff an MVCC snapshot is
+	 * required.
+	 */
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	Snapshot	snapshot;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+} GinLeader;
 
 typedef struct
 {
@@ -32,9 +144,58 @@ typedef struct
 	MemoryContext tmpCtx;
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
+	ItemPointerData tid;
+	int				work_mem;
+
+	/* FIXME likely duplicate with indtuples */
+	double		bs_numtuples;
+	double		bs_reltuples;
+
+	/*
+	 * bs_leader is only present when a parallel index build is performed, and
+	 * only in the leader process.
+	 */
+	GinLeader  *bs_leader;
+	int			bs_worker_id;
+
+	/*
+	 * The sortstate is used by workers (including the leader). It has to be
+	 * part of the build state, because that's the only thing passed to the
+	 * build callback etc.
+	 */
+	Tuplesortstate *bs_sortstate;
+
+	/*
+	 * The sortstate used only within a single worker for the first merge pass
+	 * happenning there. In principle it doesn't need to be part of the build
+	 * state and we could pass it around directly, but it's more convenient
+	 * this way. And it's part of the build state, after all.
+	 */
+	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
+/* parallel index builds */
+static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+								bool isconcurrent, int request);
+static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
+static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
+static double _gin_parallel_heapscan(GinBuildState *buildstate);
+static double _gin_parallel_merge(GinBuildState *buildstate);
+static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
+											  Relation heap, Relation index);
+static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
+										 GinBuildShared *ginshared,
+										 Sharedsort *sharedsort,
+										 Relation heap, Relation index,
+										 int sortmem, bool progress);
+
+static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+								  Datum key, int16 typlen, bool typbyval,
+								  ItemPointerData *items, uint32 nitems,
+								  Size *len);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -313,12 +474,122 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	MemoryContextSwitchTo(oldCtx);
 }
 
+/*
+ * ginFlushBuildState
+ *		Write all data from BuildAccumulator into the tuplesort.
+ */
+static void
+ginFlushBuildState(GinBuildState *buildstate, Relation index)
+{
+	ItemPointerData *list;
+	Datum		key;
+	GinNullCategory category;
+	uint32		nlist;
+	OffsetNumber attnum;
+	TupleDesc	tdesc = RelationGetDescr(index);
+
+	ginBeginBAScan(&buildstate->accum);
+	while ((list = ginGetBAEntry(&buildstate->accum,
+								 &attnum, &key, &category, &nlist)) != NULL)
+	{
+		/* information about the key */
+		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
+		/* GIN tuple and tuple length */
+		GinTuple   *tup;
+		Size		tuplen;
+
+		/* there could be many entries, so be willing to abort here */
+		CHECK_FOR_INTERRUPTS();
+
+		tup = _gin_build_tuple(attnum, category,
+							   key, attr->attlen, attr->attbyval,
+							   list, nlist, &tuplen);
+
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+
+		pfree(tup);
+	}
+
+	MemoryContextReset(buildstate->tmpCtx);
+	ginInitBA(&buildstate->accum);
+}
+
+/*
+ * ginBuildCallbackParallel
+ *		Callback for the parallel index build.
+ *
+ * This is similar to the serial build callback ginBuildCallback, but
+ * instead of writing the accumulated entries into the index, each worker
+ * writes them into a (local) tuplesort.
+ *
+ * The worker then sorts and combines these entries, before writing them
+ * into a shared tuplesort for the leader (see _gin_parallel_scan_and_build
+ * for the whole process).
+ */
+static void
+ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
+						 bool *isnull, bool tupleIsAlive, void *state)
+{
+	GinBuildState *buildstate = (GinBuildState *) state;
+	MemoryContext oldCtx;
+	int			i;
+
+	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
+
+	/*
+	 * if scan wrapped around - flush accumulated entries and start anew
+	 *
+	 * With parallel scans, we don't have a guarantee the scan does not start
+	 * half-way through the relation (serial builds disable sync scans and
+	 * always start from block 0, parallel scans require allow_sync=true).
+	 *
+	 * Building the posting lists assumes the TIDs are monotonic and never go
+	 * back, and the wrap around would break that. We handle that by detecting
+	 * the wraparound, and flushing all entries. This means we'll later see
+	 * two separate entries with non-overlapping TID lists (which can be
+	 * combined by merge sort).
+	 *
+	 * To detect a wraparound, we remember the last TID seen by each worker
+	 * (for any key). If the next TID seen by the worker is lower, the scan
+	 * must have wrapped around.
+	 */
+	if (ItemPointerCompare(tid, &buildstate->tid) < 0)
+		ginFlushBuildState(buildstate, index);
+
+	/* remember the TID we're about to process */
+	buildstate->tid = *tid;
+
+	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
+		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+							   values[i], isnull[i], tid);
+
+	/*
+	 * If we've maxed out our available memory, dump everything to the
+	 * tuplesort.
+	 *
+	 * XXX It might seem this should set the memory limit to 32MB, same as
+	 * what plan_create_index_workers() uses to calculate the number of
+	 * parallel workers, but that's the limit for tuplesort. So it seems
+	 * better to keep using work_mem here.
+	 *
+	 * XXX But maybe we should calculate this as a per-worker fraction of
+	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
+	 * maintenance command.
+	 */
+	if (buildstate->accum.allocatedMemory >= buildstate->work_mem * (Size) 1024)
+		ginFlushBuildState(buildstate, index);
+
+	MemoryContextSwitchTo(oldCtx);
+}
+
 IndexBuildResult *
 ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 {
 	IndexBuildResult *result;
 	double		reltuples;
 	GinBuildState buildstate;
+	GinBuildState *state = &buildstate;
 	Buffer		RootBuffer,
 				MetaBuffer;
 	ItemPointerData *list;
@@ -336,6 +607,16 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
+	/*
+	 * Initialize all the fields, not to trip valgrind.
+	 *
+	 * XXX Maybe there should be an "init" function for build state?
+	 */
+	buildstate.bs_numtuples = 0;
+	buildstate.bs_reltuples = 0;
+	buildstate.bs_leader = NULL;
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
 	/* initialize the meta page */
 	MetaBuffer = GinNewBuffer(index);
 
@@ -376,24 +657,91 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	ginInitBA(&buildstate.accum);
 
 	/*
-	 * Do the heap scan.  We disallow sync scan here because dataPlaceToPage
-	 * prefers to receive tuples in TID order.
+	 * Attempt to launch parallel worker scan when required
+	 *
+	 * XXX plan_create_index_workers makes the number of workers dependent on
+	 * maintenance_work_mem, requiring 32MB for each worker. For GIN that's
+	 * reasonable too, because we sort the data just like btree. It does
+	 * ignore the memory used to accumulate data in memory (set by work_mem),
+	 * but there is no way to communicate that to plan_create_index_workers.
 	 */
-	reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
-									   ginBuildCallback, &buildstate, NULL);
+	if (indexInfo->ii_ParallelWorkers > 0)
+		_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
+							indexInfo->ii_ParallelWorkers);
 
-	/* dump remaining entries to the index */
-	oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
-	ginBeginBAScan(&buildstate.accum);
-	while ((list = ginGetBAEntry(&buildstate.accum,
-								 &attnum, &key, &category, &nlist)) != NULL)
+	/*
+	 * If parallel build requested and at least one worker process was
+	 * successfully launched, set up coordination state, wait for workers to
+	 * complete. Then read all tuples from the shared tuplesort and insert
+	 * them into the index.
+	 *
+	 * In serial mode, simply scan the table and build the index one index
+	 * tuple at a time.
+	 */
+	if (state->bs_leader)
 	{
-		/* there could be many entries, so be willing to abort here */
-		CHECK_FOR_INTERRUPTS();
-		ginEntryInsert(&buildstate.ginstate, attnum, key, category,
-					   list, nlist, &buildstate.buildStats);
+		SortCoordinate coordinate;
+
+		coordinate = (SortCoordinate) palloc0(sizeof(SortCoordinateData));
+		coordinate->isWorker = false;
+		coordinate->nParticipants =
+			state->bs_leader->nparticipanttuplesorts;
+		coordinate->sharedsort = state->bs_leader->sharedsort;
+
+		/*
+		 * Begin leader tuplesort.
+		 *
+		 * In cases where parallelism is involved, the leader receives the
+		 * same share of maintenance_work_mem as a serial sort (it is
+		 * generally treated in the same way as a serial sort once we return).
+		 * Parallel worker Tuplesortstates will have received only a fraction
+		 * of maintenance_work_mem, though.
+		 *
+		 * We rely on the lifetime of the Leader Tuplesortstate almost not
+		 * overlapping with any worker Tuplesortstate's lifetime.  There may
+		 * be some small overlap, but that's okay because we rely on leader
+		 * Tuplesortstate only allocating a small, fixed amount of memory
+		 * here. When its tuplesort_performsort() is called (by our caller),
+		 * and significant amounts of memory are likely to be used, all
+		 * workers must have already freed almost all memory held by their
+		 * Tuplesortstates (they are about to go away completely, too).  The
+		 * overall effect is that maintenance_work_mem always represents an
+		 * absolute high watermark on the amount of memory used by a CREATE
+		 * INDEX operation, regardless of the use of parallelism or any other
+		 * factor.
+		 */
+		state->bs_sortstate =
+			tuplesort_begin_index_gin(heap, index,
+									  maintenance_work_mem, coordinate,
+									  TUPLESORT_NONE);
+
+		/* scan the relation in parallel and merge per-worker results */
+		reltuples = _gin_parallel_merge(state);
+
+		_gin_end_parallel(state->bs_leader, state);
+	}
+	else						/* no parallel index build */
+	{
+		/*
+		 * Do the heap scan.  We disallow sync scan here because
+		 * dataPlaceToPage prefers to receive tuples in TID order.
+		 */
+		reltuples = table_index_build_scan(heap, index, indexInfo, false, true,
+										   ginBuildCallback, &buildstate, NULL);
+
+		/* dump remaining entries to the index */
+		oldCtx = MemoryContextSwitchTo(buildstate.tmpCtx);
+		ginBeginBAScan(&buildstate.accum);
+		while ((list = ginGetBAEntry(&buildstate.accum,
+									 &attnum, &key, &category, &nlist)) != NULL)
+		{
+			/* there could be many entries, so be willing to abort here */
+			CHECK_FOR_INTERRUPTS();
+			ginEntryInsert(&buildstate.ginstate, attnum, key, category,
+						   list, nlist, &buildstate.buildStats);
+		}
+		MemoryContextSwitchTo(oldCtx);
 	}
-	MemoryContextSwitchTo(oldCtx);
 
 	MemoryContextDelete(buildstate.funcCtx);
 	MemoryContextDelete(buildstate.tmpCtx);
@@ -533,3 +881,1244 @@ gininsert(Relation index, Datum *values, bool *isnull,
 
 	return false;
 }
+
+/*
+ * Create parallel context, and launch workers for leader.
+ *
+ * buildstate argument should be initialized (with the exception of the
+ * tuplesort states, which may later be created based on shared
+ * state initially set up here).
+ *
+ * isconcurrent indicates if operation is CREATE INDEX CONCURRENTLY.
+ *
+ * request is the target number of parallel worker processes to launch.
+ *
+ * Sets buildstate's GinLeader, which caller must use to shut down parallel
+ * mode by passing it to _gin_end_parallel() at the very end of its index
+ * build.  If not even a single worker process can be launched, this is
+ * never set, and caller should proceed with a serial index build.
+ */
+static void
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
+					bool isconcurrent, int request)
+{
+	ParallelContext *pcxt;
+	int			scantuplesortstates;
+	Snapshot	snapshot;
+	Size		estginshared;
+	Size		estsort;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinLeader  *ginleader = (GinLeader *) palloc0(sizeof(GinLeader));
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	bool		leaderparticipates = true;
+	int			querylen;
+
+#ifdef DISABLE_LEADER_PARTICIPATION
+	leaderparticipates = false;
+#endif
+
+	/*
+	 * Enter parallel mode, and create context for parallel build of gin index
+	 */
+	EnterParallelMode();
+	Assert(request > 0);
+	pcxt = CreateParallelContext("postgres", "_gin_parallel_build_main",
+								 request);
+
+	scantuplesortstates = leaderparticipates ? request + 1 : request;
+
+	/*
+	 * Prepare for scan of the base relation.  In a normal index build, we use
+	 * SnapshotAny because we must retrieve all tuples and do our own time
+	 * qual checks (because we have to index RECENTLY_DEAD tuples).  In a
+	 * concurrent build, we take a regular MVCC snapshot and index whatever's
+	 * live according to that.
+	 */
+	if (!isconcurrent)
+		snapshot = SnapshotAny;
+	else
+		snapshot = RegisterSnapshot(GetTransactionSnapshot());
+
+	/*
+	 * Estimate size for our own PARALLEL_KEY_GIN_SHARED workspace.
+	 */
+	estginshared = _gin_parallel_estimate_shared(heap, snapshot);
+	shm_toc_estimate_chunk(&pcxt->estimator, estginshared);
+	estsort = tuplesort_estimate_shared(scantuplesortstates);
+	shm_toc_estimate_chunk(&pcxt->estimator, estsort);
+
+	shm_toc_estimate_keys(&pcxt->estimator, 2);
+
+	/*
+	 * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
+	 * and PARALLEL_KEY_BUFFER_USAGE.
+	 *
+	 * If there are no extensions loaded that care, we could skip this.  We
+	 * have no way of knowing whether anyone's looking at pgWalUsage or
+	 * pgBufferUsage, so do it unconditionally.
+	 */
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+	shm_toc_estimate_chunk(&pcxt->estimator,
+						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_estimate_keys(&pcxt->estimator, 1);
+
+	/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
+	if (debug_query_string)
+	{
+		querylen = strlen(debug_query_string);
+		shm_toc_estimate_chunk(&pcxt->estimator, querylen + 1);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+	else
+		querylen = 0;			/* keep compiler quiet */
+
+	/* Everyone's had a chance to ask for space, so now create the DSM */
+	InitializeParallelDSM(pcxt);
+
+	/* If no DSM segment was available, back out (do serial build) */
+	if (pcxt->seg == NULL)
+	{
+		if (IsMVCCSnapshot(snapshot))
+			UnregisterSnapshot(snapshot);
+		DestroyParallelContext(pcxt);
+		ExitParallelMode();
+		return;
+	}
+
+	/* Store shared build state, for which we reserved space */
+	ginshared = (GinBuildShared *) shm_toc_allocate(pcxt->toc, estginshared);
+	/* Initialize immutable state */
+	ginshared->heaprelid = RelationGetRelid(heap);
+	ginshared->indexrelid = RelationGetRelid(index);
+	ginshared->isconcurrent = isconcurrent;
+	ginshared->scantuplesortstates = scantuplesortstates;
+
+	ConditionVariableInit(&ginshared->workersdonecv);
+	SpinLockInit(&ginshared->mutex);
+
+	/* Initialize mutable state */
+	ginshared->nparticipantsdone = 0;
+	ginshared->reltuples = 0.0;
+	ginshared->indtuples = 0.0;
+
+	table_parallelscan_initialize(heap,
+								  ParallelTableScanFromGinBuildShared(ginshared),
+								  snapshot);
+
+	/*
+	 * Store shared tuplesort-private state, for which we reserved space.
+	 * Then, initialize opaque state using tuplesort routine.
+	 */
+	sharedsort = (Sharedsort *) shm_toc_allocate(pcxt->toc, estsort);
+	tuplesort_initialize_shared(sharedsort, scantuplesortstates,
+								pcxt->seg);
+
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_GIN_SHARED, ginshared);
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_TUPLESORT, sharedsort);
+
+	/* Store query string for workers */
+	if (debug_query_string)
+	{
+		char	   *sharedquery;
+
+		sharedquery = (char *) shm_toc_allocate(pcxt->toc, querylen + 1);
+		memcpy(sharedquery, debug_query_string, querylen + 1);
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_QUERY_TEXT, sharedquery);
+	}
+
+	/*
+	 * Allocate space for each worker's WalUsage and BufferUsage; no need to
+	 * initialize.
+	 */
+	walusage = shm_toc_allocate(pcxt->toc,
+								mul_size(sizeof(WalUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
+	bufferusage = shm_toc_allocate(pcxt->toc,
+								   mul_size(sizeof(BufferUsage), pcxt->nworkers));
+	shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+
+	/* Launch workers, saving status for leader/caller */
+	LaunchParallelWorkers(pcxt);
+	ginleader->pcxt = pcxt;
+	ginleader->nparticipanttuplesorts = pcxt->nworkers_launched;
+	if (leaderparticipates)
+		ginleader->nparticipanttuplesorts++;
+	ginleader->ginshared = ginshared;
+	ginleader->sharedsort = sharedsort;
+	ginleader->snapshot = snapshot;
+	ginleader->walusage = walusage;
+	ginleader->bufferusage = bufferusage;
+
+	/* If no workers were successfully launched, back out (do serial build) */
+	if (pcxt->nworkers_launched == 0)
+	{
+		_gin_end_parallel(ginleader, NULL);
+		return;
+	}
+
+	/* Save leader state now that it's clear build will be parallel */
+	buildstate->bs_leader = ginleader;
+
+	/* Join heap scan ourselves */
+	if (leaderparticipates)
+		_gin_leader_participate_as_worker(buildstate, heap, index);
+
+	/*
+	 * Caller needs to wait for all launched workers when we return.  Make
+	 * sure that the failure-to-start case will not hang forever.
+	 */
+	WaitForParallelWorkersToAttach(pcxt);
+}
+
+/*
+ * Shut down workers, destroy parallel context, and end parallel mode.
+ */
+static void
+_gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
+{
+	int			i;
+
+	/* Shutdown worker processes */
+	WaitForParallelWorkersToFinish(ginleader->pcxt);
+
+	/*
+	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
+	 * or we might get incomplete data.)
+	 */
+	for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
+		InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+
+	/* Free last reference to MVCC snapshot, if one was used */
+	if (IsMVCCSnapshot(ginleader->snapshot))
+		UnregisterSnapshot(ginleader->snapshot);
+	DestroyParallelContext(ginleader->pcxt);
+	ExitParallelMode();
+}
+
+/*
+ * Within leader, wait for end of heap scan.
+ *
+ * When called, parallel heap scan started by _gin_begin_parallel() will
+ * already be underway within worker processes (when leader participates
+ * as a worker, we should end up here just as workers are finishing).
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_heapscan(GinBuildState *state)
+{
+	GinBuildShared *ginshared = state->bs_leader->ginshared;
+	int			nparticipanttuplesorts;
+
+	nparticipanttuplesorts = state->bs_leader->nparticipanttuplesorts;
+	for (;;)
+	{
+		SpinLockAcquire(&ginshared->mutex);
+		if (ginshared->nparticipantsdone == nparticipanttuplesorts)
+		{
+			/* copy the data into leader state */
+			state->bs_reltuples = ginshared->reltuples;
+			state->bs_numtuples = ginshared->indtuples;
+
+			SpinLockRelease(&ginshared->mutex);
+			break;
+		}
+		SpinLockRelease(&ginshared->mutex);
+
+		ConditionVariableSleep(&ginshared->workersdonecv,
+							   WAIT_EVENT_PARALLEL_CREATE_INDEX_SCAN);
+	}
+
+	ConditionVariableCancelSleep();
+
+	return state->bs_reltuples;
+}
+
+/*
+ * Buffer used to accumulate TIDs from multiple GinTuples for the same key
+ * (we read these from the tuplesort, sorted by the key).
+ *
+ * This is similar to BuildAccumulator in that it's used to collect TIDs
+ * in memory before inserting them into the index, but it's much simpler
+ * as it only deals with a single index key at a time.
+ *
+ * When adding TIDs to the buffer, we make sure to keep them sorted, both
+ * during the initial table scan (and detecting when the scan wraps around),
+ * and during merging (where we do mergesort).
+ */
+typedef struct GinBuffer
+{
+	OffsetNumber attnum;
+	GinNullCategory category;
+	Datum		key;			/* 0 if no key (and keylen == 0) */
+	Size		keylen;			/* number of bytes (not typlen) */
+
+	/* type info */
+	int16		typlen;
+	bool		typbyval;
+
+	/* array of TID values */
+	int			nitems;
+	SortSupport ssup;			/* for sorting/comparing keys */
+	ItemPointerData *items;
+} GinBuffer;
+
+/*
+ * Check that TID array contains valid values, and that it's sorted (if we
+ * expect it to be).
+ */
+static void
+AssertCheckItemPointers(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* we should not have a buffer with no TIDs to sort */
+	Assert(buffer->items != NULL);
+	Assert(buffer->nitems > 0);
+
+	for (int i = 0; i < buffer->nitems; i++)
+	{
+		Assert(ItemPointerIsValid(&buffer->items[i]));
+
+		/* don't check ordering for the first TID item */
+		if (i == 0)
+			continue;
+
+		Assert(ItemPointerCompare(&buffer->items[i - 1], &buffer->items[i]) < 0);
+	}
+#endif
+}
+
+/*
+ * GinBuffer checks
+ *
+ * XXX Maybe it would be better to have AssertCheckGinBuffer with flags, instead
+ * of having to call AssertCheckItemPointers in some places, if we require the
+ * items to not be empty?
+ */
+static void
+AssertCheckGinBuffer(GinBuffer *buffer)
+{
+#ifdef USE_ASSERT_CHECKING
+	/* if we have any items, the array must exist */
+	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+
+	/*
+	 * The buffer may be empty, in which case we must not call the check of
+	 * item pointers, because that assumes non-emptiness.
+	 */
+	if (buffer->nitems == 0)
+		return;
+
+	/* Make sure the item pointers are valid and sorted. */
+	AssertCheckItemPointers(buffer);
+#endif
+}
+
+/*
+ * GinBufferInit
+ *		Initialize buffer to store tuples for a GIN index.
+ *
+ * Initialize the buffer used to accumulate TID for a single key at a time
+ * (we process the data sorted), so we know when we received all data for
+ * a given key.
+ *
+ * Initializes sort support procedures for all index attributes.
+ */
+static GinBuffer *
+GinBufferInit(Relation index)
+{
+	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
+	int			i,
+				nKeys;
+	TupleDesc	desc = RelationGetDescr(index);
+
+	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
+
+	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
+
+	/*
+	 * Lookup ordering operator for the index key data type, and initialize
+	 * the sort support function.
+	 */
+	for (i = 0; i < nKeys; i++)
+	{
+		Oid			cmpFunc;
+		SortSupport sortKey = &buffer->ssup[i];
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = index->rd_indcollation[i];
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		/*
+		 * If the compare proc isn't specified in the opclass definition, look
+		 * up the index key type's default btree comparator.
+		 */
+		cmpFunc = index_getprocid(index, i + 1, GIN_COMPARE_PROC);
+		if (cmpFunc == InvalidOid)
+		{
+			TypeCacheEntry *typentry;
+
+			typentry = lookup_type_cache(att->atttypid,
+										 TYPECACHE_CMP_PROC_FINFO);
+			if (!OidIsValid(typentry->cmp_proc_finfo.fn_oid))
+				ereport(ERROR,
+						(errcode(ERRCODE_UNDEFINED_FUNCTION),
+						 errmsg("could not identify a comparison function for type %s",
+								format_type_be(att->atttypid))));
+
+			cmpFunc = typentry->cmp_proc_finfo.fn_oid;
+		}
+
+		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
+	}
+
+	return buffer;
+}
+
+/* Is the buffer empty, i.e. has no TID values in the array? */
+static bool
+GinBufferIsEmpty(GinBuffer *buffer)
+{
+	return (buffer->nitems == 0);
+}
+
+/*
+ * GinBufferKeyEquals
+ *		Can the buffer store TIDs for the provided GIN tuple (same key)?
+ *
+ * Compare if the tuple matches the already accumulated data in the GIN
+ * buffer. Compare scalar fields first, before the actual key.
+ *
+ * Returns true if the key matches, and the TID belonds to the buffer, or
+ * false if the key does not match.
+ */
+static bool
+GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
+{
+	int			r;
+	Datum		tupkey;
+
+	AssertCheckGinBuffer(buffer);
+
+	if (tup->attrnum != buffer->attnum)
+		return false;
+
+	/* same attribute should have the same type info */
+	Assert(tup->typbyval == buffer->typbyval);
+	Assert(tup->typlen == buffer->typlen);
+
+	if (tup->category != buffer->category)
+		return false;
+
+	/*
+	 * For NULL/empty keys, this means equality, for normal keys we need to
+	 * compare the actual key value.
+	 */
+	if (buffer->category != GIN_CAT_NORM_KEY)
+		return true;
+
+	/*
+	 * For the tuple, get either the first sizeof(Datum) bytes for byval
+	 * types, or a pointer to the beginning of the data array.
+	 */
+	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+
+	r = ApplySortComparator(buffer->key, false,
+							tupkey, false,
+							&buffer->ssup[buffer->attnum - 1]);
+
+	return (r == 0);
+}
+
+/*
+ * GinBufferStoreTuple
+ *		Add data (especially TID list) from a GIN tuple to the buffer.
+ *
+ * The buffer is expected to be empty (in which case it's initialized), or
+ * having the same key. The TID values from the tuple are combined with the
+ * stored values using a merge sort.
+ *
+ * The tuples (for the same key) are expected to be sorted by first TID. But
+ * this does not guarantee the lists do not overlap, especially in the leader,
+ * because the workers process interleaving data. There should be no overlaps
+ * in a single worker - it could happen when the parallel scan wraps around,
+ * but we detect that and flush the data (see ginBuildCallbackParallel).
+ *
+ * By sorting the GinTuple not only by key, but also by the first TID, we make
+ * it more less likely the lists will overlap during merge. We merge them using
+ * mergesort, but it's cheaper to just append one list to the other.
+ *
+ * How often can the lists overlap? There should be no overlaps in workers,
+ * and in the leader we can see overlaps between lists built by different
+ * workers. But the workers merge the items as much as possible, so there
+ * should not be too many.
+ */
+static void
+GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+{
+	ItemPointerData *items;
+	Datum		key;
+
+	AssertCheckGinBuffer(buffer);
+
+	key = _gin_parse_tuple(tup, &items);
+
+	/* if the buffer is empty, set the fields (and copy the key) */
+	if (GinBufferIsEmpty(buffer))
+	{
+		buffer->category = tup->category;
+		buffer->keylen = tup->keylen;
+		buffer->attnum = tup->attrnum;
+
+		buffer->typlen = tup->typlen;
+		buffer->typbyval = tup->typbyval;
+
+		if (tup->category == GIN_CAT_NORM_KEY)
+			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+		else
+			buffer->key = (Datum) 0;
+	}
+
+	/* add the new TIDs into the buffer, combine using merge-sort */
+	{
+		int			nnew;
+		ItemPointer new;
+
+		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+								   items, tup->nitems, &nnew);
+
+		Assert(nnew == buffer->nitems + tup->nitems);
+
+		if (buffer->items)
+			pfree(buffer->items);
+
+		buffer->items = new;
+		buffer->nitems = nnew;
+
+		AssertCheckItemPointers(buffer);
+	}
+}
+
+/*
+ * GinBufferReset
+ *		Reset the buffer into a state as if it contains no data.
+ *
+ * XXX Should we do something if the array of TIDs gets too large? It may
+ * grow too much, and we'll not free it until the worker finishes building.
+ * But it's better to not let the array grow arbitrarily large, and enforce
+ * work_mem as memory limit by flushing the buffer into the tuplestore.
+ */
+static void
+GinBufferReset(GinBuffer *buffer)
+{
+	Assert(!GinBufferIsEmpty(buffer));
+
+	/* release byref values, do nothing for by-val ones */
+	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	/*
+	 * Not required, but makes it more likely to trigger NULL derefefence if
+	 * using the value incorrectly, etc.
+	 */
+	buffer->key = (Datum) 0;
+
+	buffer->attnum = 0;
+	buffer->category = 0;
+	buffer->keylen = 0;
+	buffer->nitems = 0;
+
+	buffer->typlen = 0;
+	buffer->typbyval = 0;
+}
+
+/*
+ * GinBufferFree
+ *		Release memory associated with the GinBuffer (including TID array).
+ */
+static void
+GinBufferFree(GinBuffer *buffer)
+{
+	if (buffer->items)
+		pfree(buffer->items);
+
+	/* release byref values, do nothing for by-val ones */
+	if (!GinBufferIsEmpty(buffer) &&
+		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
+		pfree(DatumGetPointer(buffer->key));
+
+	pfree(buffer);
+}
+
+/*
+ * GinBufferCanAddKey
+ *		Check if a given GIN tuple can be added to the current buffer.
+ *
+ * Returns true if the buffer is either empty or for the same index key.
+ *
+ * XXX This could / should also enforce a memory limit by checking the size of
+ * the TID array, and returning false if it's too large (more thant work_mem,
+ * for example). But in the leader we need to be careful not to force flushing
+ * data too early, which might break the monotonicity of TID list.
+ */
+static bool
+GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
+{
+	/* empty buffer can accept data for any key */
+	if (GinBufferIsEmpty(buffer))
+		return true;
+
+	/* otherwise just data for the same key */
+	return GinBufferKeyEquals(buffer, tup);
+}
+
+/*
+ * Within leader, wait for end of heap scan and merge per-worker results.
+ *
+ * After waiting for all workers to finish, merge the per-worker results into
+ * the complete index. The results from each worker are sorted by block number
+ * (start of the page range). While combinig the per-worker results we merge
+ * summaries for the same page range, and also fill-in empty summaries for
+ * ranges without any tuples.
+ *
+ * Returns the total number of heap tuples scanned.
+ */
+static double
+_gin_parallel_merge(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	double		reltuples = 0;
+	GinBuffer  *buffer;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(state->bs_sortstate);
+
+	return reltuples;
+}
+
+/*
+ * Returns size of shared memory required to store state for a parallel
+ * gin index build based on the snapshot its parallel scan will use.
+ */
+static Size
+_gin_parallel_estimate_shared(Relation heap, Snapshot snapshot)
+{
+	/* c.f. shm_toc_allocate as to why BUFFERALIGN is used */
+	return add_size(BUFFERALIGN(sizeof(GinBuildShared)),
+					table_parallelscan_estimate(heap, snapshot));
+}
+
+/*
+ * Within leader, participate as a parallel worker.
+ */
+static void
+_gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Relation index)
+{
+	GinLeader  *ginleader = buildstate->bs_leader;
+	int			sortmem;
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginleader->nparticipanttuplesorts;
+
+	/* Perform work common to all participants */
+	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
+								 ginleader->sharedsort, heap, index, sortmem, true);
+}
+
+/*
+ * _gin_process_worker_data
+ *		First phase of the key merging, happening in the worker.
+ *
+ * Depending on the number of distinct keys, the TID lists produced by the
+ * callback may be very short (due to frequent evictions in the callback).
+ * But combining many tiny lists is expensive, so we try to do as much as
+ * possible in the workers and only then pass the results to the leader.
+ *
+ * We read the tuples sorted by the key, and merge them into larger lists.
+ * At the moment there's no memory limit, so this will just produce one
+ * huge (sorted) list per key in each worker. Which means the leader will
+ * do a very limited number of mergesorts, which is good.
+ */
+static void
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+
+	GinBuffer  *buffer;
+
+	/* initialize buffer to combine entries for the same key */
+	buffer = GinBufferInit(state->ginstate.index);
+
+	/* sort the raw per-worker data */
+	tuplesort_performsort(state->bs_worker_sort);
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
+	 * merge them into larger chunks for the leader to combine.
+	 */
+	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
+	{
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nitems, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferStoreTuple(buffer, tup);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		GinTuple   *ntup;
+		Size		ntuplen;
+
+		AssertCheckItemPointers(buffer);
+
+		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+								buffer->key, buffer->typlen, buffer->typbyval,
+								buffer->items, buffer->nitems, &ntuplen);
+
+		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+		pfree(ntup);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	tuplesort_end(worker_sort);
+}
+
+/*
+ * Perform a worker's portion of a parallel GIN index build sort.
+ *
+ * This generates a tuplesort for the worker portion of the table.
+ *
+ * sortmem is the amount of working memory to use within each worker,
+ * expressed in KBs.
+ *
+ * When this returns, workers are done, and need only release resources.
+ *
+ * Before feeding data into a shared tuplesort (for the leader process),
+ * the workers process data in two phases.
+ *
+ * 1) A worker reads a portion of rows from the table, accumulates entries
+ * in memory, and flushes them into a private tuplesort (e.g. because of
+ * using too much memory).
+ *
+ * 2) The private tuplesort gets sorted (by key and TID), the worker reads
+ * the data again, and combines the entries as much as possible. This has
+ * to happen eventually, and this way it's done in workers in parallel.
+ *
+ * Finally, the combined entries are written into the shared tuplesort, so
+ * that the leader can process them.
+ *
+ * How well this works (compared to just writing entries into the shared
+ * tuplesort) depends on the data set. For large tables with many distinct
+ * keys this helps a lot. With many distinct keys it's likely the buffers has
+ * to be flushed often, generating many entries with the same key and short
+ * TID lists. These entries need to be sorted and merged at some point,
+ * before writing them to the index. The merging is quite expensive, it can
+ * easily be ~50% of a serial build, and doing as much of it in the workers
+ * means it's parallelized. The leader still has to merge results from the
+ * workers, but it's much more efficient to merge few large entries than
+ * many tiny ones.
+ *
+ * This also reduces the amount of data the workers pass to the leader through
+ * the shared tuplesort. OTOH the workers need more space for the private sort,
+ * possibly up to 2x of the data, if no entries be merged in a worker. But this
+ * is very unlikely, and the only consequence is inefficiency, so we ignore it.
+ */
+static void
+_gin_parallel_scan_and_build(GinBuildState *state,
+							 GinBuildShared *ginshared, Sharedsort *sharedsort,
+							 Relation heap, Relation index,
+							 int sortmem, bool progress)
+{
+	SortCoordinate coordinate;
+	TableScanDesc scan;
+	double		reltuples;
+	IndexInfo  *indexInfo;
+
+	/* Initialize local tuplesort coordination state */
+	coordinate = palloc0(sizeof(SortCoordinateData));
+	coordinate->isWorker = true;
+	coordinate->nParticipants = -1;
+	coordinate->sharedsort = sharedsort;
+
+	/* remember how much space is allowed for the accumulated entries */
+	state->work_mem = (sortmem / 2);
+
+	/* Begin "partial" tuplesort */
+	state->bs_sortstate = tuplesort_begin_index_gin(heap, index,
+													state->work_mem,
+													coordinate,
+													TUPLESORT_NONE);
+
+	/* Local per-worker sort of raw-data */
+	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
+													  state->work_mem,
+													  NULL,
+													  TUPLESORT_NONE);
+
+	/* Join parallel scan */
+	indexInfo = BuildIndexInfo(index);
+	indexInfo->ii_Concurrent = ginshared->isconcurrent;
+
+	scan = table_beginscan_parallel(heap,
+									ParallelTableScanFromGinBuildShared(ginshared));
+
+	reltuples = table_index_build_scan(heap, index, indexInfo, true, progress,
+									   ginBuildCallbackParallel, state, scan);
+
+	/* write remaining accumulated entries */
+	ginFlushBuildState(state, index);
+
+	/*
+	 * Do the first phase of in-worker processing - sort the data produced by
+	 * the callback, and combine them into much larger chunks and place that
+	 * into the shared tuplestore for leader to process.
+	 */
+	_gin_process_worker_data(state, state->bs_worker_sort);
+
+	/* sort the GIN tuples built by this worker */
+	tuplesort_performsort(state->bs_sortstate);
+
+	state->bs_reltuples += reltuples;
+
+	/*
+	 * Done.  Record ambuild statistics.
+	 */
+	SpinLockAcquire(&ginshared->mutex);
+	ginshared->nparticipantsdone++;
+	ginshared->reltuples += state->bs_reltuples;
+	ginshared->indtuples += state->bs_numtuples;
+	SpinLockRelease(&ginshared->mutex);
+
+	/* Notify leader */
+	ConditionVariableSignal(&ginshared->workersdonecv);
+
+	tuplesort_end(state->bs_sortstate);
+}
+
+/*
+ * Perform work within a launched parallel process.
+ */
+void
+_gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
+{
+	char	   *sharedquery;
+	GinBuildShared *ginshared;
+	Sharedsort *sharedsort;
+	GinBuildState buildstate;
+	Relation	heapRel;
+	Relation	indexRel;
+	LOCKMODE	heapLockmode;
+	LOCKMODE	indexLockmode;
+	WalUsage   *walusage;
+	BufferUsage *bufferusage;
+	int			sortmem;
+
+	/*
+	 * The only possible status flag that can be set to the parallel worker is
+	 * PROC_IN_SAFE_IC.
+	 */
+	Assert((MyProc->statusFlags == 0) ||
+		   (MyProc->statusFlags == PROC_IN_SAFE_IC));
+
+	/* Set debug_query_string for individual workers first */
+	sharedquery = shm_toc_lookup(toc, PARALLEL_KEY_QUERY_TEXT, true);
+	debug_query_string = sharedquery;
+
+	/* Report the query string from leader */
+	pgstat_report_activity(STATE_RUNNING, debug_query_string);
+
+	/* Look up gin shared state */
+	ginshared = shm_toc_lookup(toc, PARALLEL_KEY_GIN_SHARED, false);
+
+	/* Open relations using lock modes known to be obtained by index.c */
+	if (!ginshared->isconcurrent)
+	{
+		heapLockmode = ShareLock;
+		indexLockmode = AccessExclusiveLock;
+	}
+	else
+	{
+		heapLockmode = ShareUpdateExclusiveLock;
+		indexLockmode = RowExclusiveLock;
+	}
+
+	/* Open relations within worker */
+	heapRel = table_open(ginshared->heaprelid, heapLockmode);
+	indexRel = index_open(ginshared->indexrelid, indexLockmode);
+
+	/* initialize the GIN build state */
+	initGinState(&buildstate.ginstate, indexRel);
+	buildstate.indtuples = 0;
+	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
+	memset(&buildstate.tid, 0, sizeof(ItemPointerData));
+
+	/*
+	 * create a temporary memory context that is used to hold data not yet
+	 * dumped out to the index
+	 */
+	buildstate.tmpCtx = AllocSetContextCreate(CurrentMemoryContext,
+											  "Gin build temporary context",
+											  ALLOCSET_DEFAULT_SIZES);
+
+	/*
+	 * create a temporary memory context that is used for calling
+	 * ginExtractEntries(), and can be reset after each tuple
+	 */
+	buildstate.funcCtx = AllocSetContextCreate(CurrentMemoryContext,
+											   "Gin build temporary context for user-defined function",
+											   ALLOCSET_DEFAULT_SIZES);
+
+	buildstate.accum.ginstate = &buildstate.ginstate;
+	ginInitBA(&buildstate.accum);
+
+
+	/* Look up shared state private to tuplesort.c */
+	sharedsort = shm_toc_lookup(toc, PARALLEL_KEY_TUPLESORT, false);
+	tuplesort_attach_shared(sharedsort, seg);
+
+	/* Prepare to track buffer usage during parallel execution */
+	InstrStartParallelQuery();
+
+	/*
+	 * Might as well use reliable figure when doling out maintenance_work_mem
+	 * (when requested number of workers were not launched, this will be
+	 * somewhat higher than it is for other workers).
+	 */
+	sortmem = maintenance_work_mem / ginshared->scantuplesortstates;
+
+	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
+								 heapRel, indexRel, sortmem, false);
+
+	/* Report WAL/buffer usage during parallel execution */
+	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
+	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+						  &walusage[ParallelWorkerNumber]);
+
+	index_close(indexRel, indexLockmode);
+	table_close(heapRel, heapLockmode);
+}
+
+/*
+ * _gin_build_tuple
+ *		Serialize the state for an index key into a tuple for tuplesort.
+ *
+ * The tuple has a number of scalar fields (mostly matching the build state),
+ * and then a data array that stores the key first, and then the TID list.
+ *
+ * For by-reference data types, we store the actual data. For by-val types
+ * we simply copy the whole Datum, so that we don't have to care about stuff
+ * like endianess etc. We could make it a little bit smaller, but it's not
+ * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
+ * start of the TID list anyway. So we wouldn't save anything.
+ */
+static GinTuple *
+_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
+				 Datum key, int16 typlen, bool typbyval,
+				 ItemPointerData *items, uint32 nitems,
+				 Size *len)
+{
+	GinTuple   *tuple;
+	char	   *ptr;
+
+	Size		tuplen;
+	int			keylen;
+
+	/*
+	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
+	 * have actual non-empty key. We include varlena headers and \0 bytes for
+	 * strings, to make it easier to access the data in-line.
+	 *
+	 * For byval types we simply copy the whole Datum. We could store just the
+	 * necessary bytes, but this is simpler to work with and not worth the
+	 * extra complexity. Moreover we still need to do the MAXALIGN to allow
+	 * direct access to items pointers.
+	 *
+	 * XXX Note that for byval types we store the whole datum, no matter what
+	 * the typlen value is.
+	 */
+	if (category != GIN_CAT_NORM_KEY)
+		keylen = 0;
+	else if (typbyval)
+		keylen = sizeof(Datum);
+	else if (typlen > 0)
+		keylen = typlen;
+	else if (typlen == -1)
+		keylen = VARSIZE_ANY(key);
+	else if (typlen == -2)
+		keylen = strlen(DatumGetPointer(key)) + 1;
+	else
+		elog(ERROR, "unexpected typlen value (%d)", typlen);
+
+	/*
+	 * Determine GIN tuple length with all the data included. Be careful about
+	 * alignment, to allow direct access to item pointers.
+	 */
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
+		(sizeof(ItemPointerData) * nitems);
+
+	*len = tuplen;
+
+	/*
+	 * Allocate space for the whole GIN tuple.
+	 *
+	 * XXX palloc0 so that valgrind does not complain about uninitialized
+	 * bytes in writetup_index_gin, likely because of padding
+	 */
+	tuple = palloc0(tuplen);
+
+	tuple->tuplen = tuplen;
+	tuple->attrnum = attrnum;
+	tuple->category = category;
+	tuple->keylen = keylen;
+	tuple->nitems = nitems;
+
+	/* key type info */
+	tuple->typlen = typlen;
+	tuple->typbyval = typbyval;
+
+	/*
+	 * Copy the key and items into the tuple. First the key value, which we
+	 * can simply copy right at the beginning of the data array.
+	 */
+	if (category == GIN_CAT_NORM_KEY)
+	{
+		if (typbyval)
+		{
+			memcpy(tuple->data, &key, sizeof(Datum));
+		}
+		else if (typlen > 0)	/* byref, fixed length */
+		{
+			memcpy(tuple->data, DatumGetPointer(key), typlen);
+		}
+		else if (typlen == -1)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+		else if (typlen == -2)
+		{
+			memcpy(tuple->data, DatumGetPointer(key), keylen);
+		}
+	}
+
+	/* finally, copy the TIDs into the array */
+	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
+
+	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+
+	return tuple;
+}
+
+/*
+ * _gin_parse_tuple
+ *		Deserialize the tuple from the tuplestore representation.
+ *
+ * Most of the fields are actually directly accessible, the only thing that
+ * needs more care is the key and the TID list.
+ *
+ * For the key, this returns a regular Datum representing it. It's either the
+ * actual key value, or a pointer to the beginning of the data array (which is
+ * where the data was copied by _gin_build_tuple).
+ *
+ * The pointer to the TID list is returned through 'items' (which is simply
+ * a pointer to the data array).
+ */
+static Datum
+_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+{
+	Datum		key;
+
+	if (items)
+	{
+		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+		*items = (ItemPointerData *) ptr;
+	}
+
+	if (a->category != GIN_CAT_NORM_KEY)
+		return (Datum) 0;
+
+	if (a->typbyval)
+	{
+		memcpy(&key, a->data, a->keylen);
+		return key;
+	}
+
+	return PointerGetDatum(a->data);
+}
+
+/*
+ * _gin_compare_tuples
+ *		Compare GIN tuples, used by tuplesort during parallel index build.
+ *
+ * The scalar fields (attrnum, category) are compared first, the key value is
+ * compared last. The comparisons are done using type-specific sort support
+ * functions.
+ *
+ * If the key value matches, we compare the first TID value in the TID list,
+ * which means the tuples are merged in an order in which they are most
+ * likely to be simply concatenated. (This "first" TID will also allow us
+ * to determine a point up to which the list is fully determined and can be
+ * written into the index to enforce a memory limit etc.)
+ */
+int
+_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
+{
+	int			r;
+	Datum		keya,
+				keyb;
+
+	if (a->attrnum < b->attrnum)
+		return -1;
+
+	if (a->attrnum > b->attrnum)
+		return 1;
+
+	if (a->category < b->category)
+		return -1;
+
+	if (a->category > b->category)
+		return 1;
+
+	if (a->category == GIN_CAT_NORM_KEY)
+	{
+		keya = _gin_parse_tuple(a, NULL);
+		keyb = _gin_parse_tuple(b, NULL);
+
+		r = ApplySortComparator(keya, false,
+								keyb, false,
+								&ssup[a->attrnum - 1]);
+
+		/* if the key is the same, consider the first TID in the array */
+		return (r != 0) ? r : ItemPointerCompare(GinTupleGetFirst(a),
+												 GinTupleGetFirst(b));
+	}
+
+	return ItemPointerCompare(GinTupleGetFirst(a),
+							  GinTupleGetFirst(b));
+}
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 1f9e58c4f1f..6b2dd40fa0f 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -53,7 +53,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amclusterable = false;
 	amroutine->ampredlocks = true;
 	amroutine->amcanparallel = false;
-	amroutine->amcanbuildparallel = false;
+	amroutine->amcanbuildparallel = true;
 	amroutine->amcaninclude = false;
 	amroutine->amusemaintenanceworkmem = true;
 	amroutine->amsummarizing = false;
diff --git a/src/backend/access/transam/parallel.c b/src/backend/access/transam/parallel.c
index 4ab5df92133..f6d81d6e1fc 100644
--- a/src/backend/access/transam/parallel.c
+++ b/src/backend/access/transam/parallel.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include "access/brin.h"
+#include "access/gin.h"
 #include "access/nbtree.h"
 #include "access/parallel.h"
 #include "access/session.h"
@@ -148,6 +149,9 @@ static const struct
 	{
 		"_brin_parallel_build_main", _brin_parallel_build_main
 	},
+	{
+		"_gin_parallel_build_main", _gin_parallel_build_main
+	},
 	{
 		"parallel_vacuum_main", parallel_vacuum_main
 	}
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 913c4ef455e..4d3114076b3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -20,10 +20,12 @@
 #include "postgres.h"
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/hash.h"
 #include "access/htup_details.h"
 #include "access/nbtree.h"
 #include "catalog/index.h"
+#include "catalog/pg_collation.h"
 #include "executor/executor.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
@@ -46,6 +48,8 @@ static void removeabbrev_index(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static void removeabbrev_index_brin(Tuplesortstate *state, SortTuple *stups,
 									int count);
+static void removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups,
+								   int count);
 static void removeabbrev_datum(Tuplesortstate *state, SortTuple *stups,
 							   int count);
 static int	comparetup_heap(const SortTuple *a, const SortTuple *b,
@@ -74,6 +78,8 @@ static int	comparetup_index_hash_tiebreak(const SortTuple *a, const SortTuple *b
 										   Tuplesortstate *state);
 static int	comparetup_index_brin(const SortTuple *a, const SortTuple *b,
 								  Tuplesortstate *state);
+static int	comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+								 Tuplesortstate *state);
 static void writetup_index(Tuplesortstate *state, LogicalTape *tape,
 						   SortTuple *stup);
 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
@@ -82,6 +88,10 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
+static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
+							   SortTuple *stup);
+static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
 							 Tuplesortstate *state);
 static int	comparetup_datum_tiebreak(const SortTuple *a, const SortTuple *b,
@@ -568,6 +578,79 @@ tuplesort_begin_index_brin(int workMem,
 	return state;
 }
 
+Tuplesortstate *
+tuplesort_begin_index_gin(Relation heapRel,
+						  Relation indexRel,
+						  int workMem, SortCoordinate coordinate,
+						  int sortopt)
+{
+	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
+												   sortopt);
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext;
+	int			i;
+	TupleDesc	desc = RelationGetDescr(indexRel);
+
+	oldcontext = MemoryContextSwitchTo(base->maincontext);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "begin index sort: workMem = %d, randomAccess = %c",
+			 workMem,
+			 sortopt & TUPLESORT_RANDOMACCESS ? 't' : 'f');
+#endif
+
+	/*
+	 * Multi-column GIN indexes expand the row into a separate index entry for
+	 * attribute, and that's what we write into the tuplesort. But we still
+	 * need to initialize sortsupport for all the attributes.
+	 */
+	base->nKeys = IndexRelationGetNumberOfKeyAttributes(indexRel);
+
+	/* Prepare SortSupport data for each column */
+	base->sortKeys = (SortSupport) palloc0(base->nKeys *
+										   sizeof(SortSupportData));
+
+	for (i = 0; i < base->nKeys; i++)
+	{
+		SortSupport sortKey = base->sortKeys + i;
+		Form_pg_attribute att = TupleDescAttr(desc, i);
+		TypeCacheEntry *typentry;
+
+		sortKey->ssup_cxt = CurrentMemoryContext;
+		sortKey->ssup_collation = indexRel->rd_indcollation[i];
+		sortKey->ssup_nulls_first = false;
+		sortKey->ssup_attno = i + 1;
+		sortKey->abbreviate = false;
+
+		Assert(sortKey->ssup_attno != 0);
+
+		if (!OidIsValid(sortKey->ssup_collation))
+			sortKey->ssup_collation = DEFAULT_COLLATION_OID;
+
+		/*
+		 * Look for a ordering for the index key data type, and then the sort
+		 * support function.
+		 *
+		 * XXX does this use the right opckeytype/opcintype for GIN?
+		 */
+		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
+		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
+	}
+
+	base->removeabbrev = removeabbrev_index_gin;
+	base->comparetup = comparetup_index_gin;
+	base->writetup = writetup_index_gin;
+	base->readtup = readtup_index_gin;
+	base->haveDatum1 = false;
+	base->arg = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	return state;
+}
+
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag, int workMem,
@@ -803,6 +886,37 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 	MemoryContextSwitchTo(oldcontext);
 }
 
+void
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+{
+	SortTuple	stup;
+	GinTuple   *ctup;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		tuplen;
+
+	/* copy the GinTuple into the right memory context */
+	ctup = palloc(size);
+	memcpy(ctup, tuple, size);
+
+	stup.tuple = ctup;
+	stup.datum1 = (Datum) 0;
+	stup.isnull1 = false;
+
+	/* GetMemoryChunkSpace is not supported for bump contexts */
+	if (TupleSortUseBumpTupleCxt(base->sortopt))
+		tuplen = MAXALIGN(size);
+	else
+		tuplen = GetMemoryChunkSpace(ctup);
+
+	tuplesort_puttuple_common(state, &stup,
+							  base->sortKeys &&
+							  base->sortKeys->abbrev_converter &&
+							  !stup.isnull1, tuplen);
+
+	MemoryContextSwitchTo(oldcontext);
+}
+
 /*
  * Accept one Datum while collecting input data for sort.
  *
@@ -975,6 +1089,29 @@ tuplesort_getbrintuple(Tuplesortstate *state, Size *len, bool forward)
 	return &btup->tuple;
 }
 
+GinTuple *
+tuplesort_getgintuple(Tuplesortstate *state, Size *len, bool forward)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	MemoryContext oldcontext = MemoryContextSwitchTo(base->sortcontext);
+	SortTuple	stup;
+	GinTuple   *tup;
+
+	if (!tuplesort_gettuple_common(state, forward, &stup))
+		stup.tuple = NULL;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	if (!stup.tuple)
+		return false;
+
+	tup = (GinTuple *) stup.tuple;
+
+	*len = tup->tuplen;
+
+	return tup;
+}
+
 /*
  * Fetch the next Datum in either forward or back direction.
  * Returns false if no more datums.
@@ -1763,6 +1900,69 @@ readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = tuple->tuple.bt_blkno;
 }
 
+/*
+ * Routines specialized for GIN case
+ */
+
+static void
+removeabbrev_index_gin(Tuplesortstate *state, SortTuple *stups, int count)
+{
+	Assert(false);
+	elog(ERROR, "removeabbrev_index_gin not implemented");
+}
+
+static int
+comparetup_index_gin(const SortTuple *a, const SortTuple *b,
+					 Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+
+	Assert(!TuplesortstateGetPublic(state)->haveDatum1);
+
+	return _gin_compare_tuples((GinTuple *) a->tuple,
+							   (GinTuple *) b->tuple,
+							   base->sortKeys);
+}
+
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *tuple = (GinTuple *) stup->tuple;
+	unsigned int tuplen = tuple->tuplen;
+
+	tuplen = tuplen + sizeof(tuplen);
+	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
+}
+
+static void
+readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
+				  LogicalTape *tape, unsigned int len)
+{
+	GinTuple   *tuple;
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	unsigned int tuplen = len - sizeof(unsigned int);
+
+	/*
+	 * Allocate space for the GIN sort tuple, which already has the proper
+	 * length included in the header.
+	 */
+	tuple = (GinTuple *) tuplesort_readtup_alloc(state, tuplen);
+
+	tuple->tuplen = tuplen;
+
+	LogicalTapeReadExact(tape, tuple, tuplen);
+	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
+		LogicalTapeReadExact(tape, &tuplen, sizeof(tuplen));
+	stup->tuple = (void *) tuple;
+
+	/* no abbreviations (FIXME maybe use attrnum for this?) */
+	stup->datum1 = (Datum) 0;
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 9ed48dfde4b..2debdac0f43 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -12,6 +12,8 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "nodes/execnodes.h"
+#include "storage/shm_toc.h"
 #include "storage/block.h"
 #include "utils/relcache.h"
 
@@ -88,4 +90,6 @@ extern void ginGetStats(Relation index, GinStatsData *stats);
 extern void ginUpdateStats(Relation index, const GinStatsData *stats,
 						   bool is_build);
 
+extern void _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+
 #endif							/* GIN_H */
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
new file mode 100644
index 00000000000..c8fe1130aa4
--- /dev/null
+++ b/src/include/access/gin_tuple.h
@@ -0,0 +1,50 @@
+/*--------------------------------------------------------------------------
+ * gin.h
+ *	  Public header file for Generalized Inverted Index access method.
+ *
+ *	Copyright (c) 2006-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/access/gin.h
+ *--------------------------------------------------------------------------
+ */
+#ifndef GIN_TUPLE_
+#define GIN_TUPLE_
+
+#include "access/ginblock.h"
+#include "storage/itemptr.h"
+#include "utils/sortsupport.h"
+
+/*
+ * Each worker sees tuples in CTID order, so if we track the first TID and
+ * compare that when combining results in the worker, we would not need to
+ * do an expensive sort in workers (the mergesort is already smart about
+ * detecting this and just concatenating the lists). We'd still need the
+ * full mergesort in the leader, but that's much cheaper.
+ *
+ * XXX do we still need all the fields now that we use SortSupport?
+ */
+typedef struct GinTuple
+{
+	int			tuplen;			/* length of the whole tuple */
+	OffsetNumber attrnum;		/* attnum of index key */
+	uint16		keylen;			/* bytes in data for key value */
+	int16		typlen;			/* typlen for key */
+	bool		typbyval;		/* typbyval for key */
+	signed char category;		/* category: normal or NULL? */
+	int			nitems;			/* number of TIDs in the data */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} GinTuple;
+
+static inline ItemPointer
+GinTupleGetFirst(GinTuple *tup)
+{
+	GinPostingList *list;
+
+	list = (GinPostingList *) SHORTALIGN(tup->data + tup->keylen);
+
+	return &list->first;
+}
+
+extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
+
+#endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index c63f1e5d6da..ef79f259f93 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -22,6 +22,7 @@
 #define TUPLESORT_H
 
 #include "access/brin_tuple.h"
+#include "access/gin_tuple.h"
 #include "access/itup.h"
 #include "executor/tuptable.h"
 #include "storage/dsm.h"
@@ -443,6 +444,10 @@ extern Tuplesortstate *tuplesort_begin_index_gist(Relation heapRel,
 												  int sortopt);
 extern Tuplesortstate *tuplesort_begin_index_brin(int workMem, SortCoordinate coordinate,
 												  int sortopt);
+extern Tuplesortstate *tuplesort_begin_index_gin(Relation heapRel,
+												 Relation indexRel,
+												 int workMem, SortCoordinate coordinate,
+												 int sortopt);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 											 Oid sortOperator, Oid sortCollation,
 											 bool nullsFirstFlag,
@@ -456,6 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
@@ -465,6 +471,8 @@ extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
 extern IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward);
 extern BrinTuple *tuplesort_getbrintuple(Tuplesortstate *state, Size *len,
 										 bool forward);
+extern GinTuple *tuplesort_getgintuple(Tuplesortstate *state, Size *len,
+									   bool forward);
 extern bool tuplesort_getdatum(Tuplesortstate *state, bool forward, bool copy,
 							   Datum *val, bool *isNull, Datum *abbrev);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e3e09a2207e..a884c771fa4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1030,11 +1030,14 @@ GinBtreeData
 GinBtreeDataLeafInsertData
 GinBtreeEntryInsertData
 GinBtreeStack
+GinBuffer
+GinBuildShared
 GinBuildState
 GinChkVal
 GinEntries
 GinEntryAccumulator
 GinIndexStat
+GinLeader
 GinMetaPageData
 GinNullCategory
 GinOptions
@@ -1050,6 +1053,7 @@ GinScanOpaqueData
 GinState
 GinStatsData
 GinTernaryValue
+GinTuple
 GinTupleCollector
 GinVacuumState
 GistBuildMode
-- 
2.48.1

v20250225-0002-cleanup.patchtext/x-patch; charset=UTF-8; name=v20250225-0002-cleanup.patchDownload

From ba74bb6adbd52a960ed2d805591da378643c6639 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 24 Feb 2025 23:14:17 +0100
Subject: [PATCH v20250225 2/7] cleanup

---
 src/backend/access/gin/gininsert.c         | 40 +++++-----------------
 src/backend/utils/sort/tuplesortvariants.c |  2 --
 src/include/access/gin_tuple.h             |  8 +----
 3 files changed, 10 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index a23b457bba3..7c2f46b9541 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -145,7 +145,7 @@ typedef struct
 	MemoryContext funcCtx;
 	BuildAccumulator accum;
 	ItemPointerData tid;
-	int				work_mem;
+	int			work_mem;
 
 	/* FIXME likely duplicate with indtuples */
 	double		bs_numtuples;
@@ -566,16 +566,8 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 
 	/*
 	 * If we've maxed out our available memory, dump everything to the
-	 * tuplesort.
-	 *
-	 * XXX It might seem this should set the memory limit to 32MB, same as
-	 * what plan_create_index_workers() uses to calculate the number of
-	 * parallel workers, but that's the limit for tuplesort. So it seems
-	 * better to keep using work_mem here.
-	 *
-	 * XXX But maybe we should calculate this as a per-worker fraction of
-	 * maintenance_work_mem. It's weird to use work_mem here, in a clearly
-	 * maintenance command.
+	 * tuplesort. We use half the per-worker fraction of maintenance_work_mem,
+	 * the other half is used for the tuplesort.
 	 */
 	if (buildstate->accum.allocatedMemory >= buildstate->work_mem * (Size) 1024)
 		ginFlushBuildState(buildstate, index);
@@ -607,11 +599,7 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	memset(&buildstate.buildStats, 0, sizeof(GinStatsData));
 
-	/*
-	 * Initialize all the fields, not to trip valgrind.
-	 *
-	 * XXX Maybe there should be an "init" function for build state?
-	 */
+	/* Initialize fields for parallel build too. */
 	buildstate.bs_numtuples = 0;
 	buildstate.bs_reltuples = 0;
 	buildstate.bs_leader = NULL;
@@ -1195,9 +1183,8 @@ AssertCheckItemPointers(GinBuffer *buffer)
 /*
  * GinBuffer checks
  *
- * XXX Maybe it would be better to have AssertCheckGinBuffer with flags, instead
- * of having to call AssertCheckItemPointers in some places, if we require the
- * items to not be empty?
+ * Make sure the nitems/items fields are consistent (either the array is empty
+ * or not empty, the fields need to agree). If there are items, check ordering.
  */
 static void
 AssertCheckGinBuffer(GinBuffer *buffer)
@@ -1415,11 +1402,6 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 /*
  * GinBufferReset
  *		Reset the buffer into a state as if it contains no data.
- *
- * XXX Should we do something if the array of TIDs gets too large? It may
- * grow too much, and we'll not free it until the worker finishes building.
- * But it's better to not let the array grow arbitrarily large, and enforce
- * work_mem as memory limit by flushing the buffer into the tuplestore.
  */
 static void
 GinBufferReset(GinBuffer *buffer)
@@ -1468,11 +1450,6 @@ GinBufferFree(GinBuffer *buffer)
  *		Check if a given GIN tuple can be added to the current buffer.
  *
  * Returns true if the buffer is either empty or for the same index key.
- *
- * XXX This could / should also enforce a memory limit by checking the size of
- * the TID array, and returning false if it's too large (more thant work_mem,
- * for example). But in the leader we need to be careful not to force flushing
- * data too early, which might break the monotonicity of TID list.
  */
 static bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
@@ -1987,8 +1964,9 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
-	 * XXX palloc0 so that valgrind does not complain about uninitialized
-	 * bytes in writetup_index_gin, likely because of padding
+	 * The palloc0 is needed - writetup_index_gin will write the whole tuple
+	 * to disk, so we need to make sure the padding bytes are defined
+	 * (otherwise valgrind would report this).
 	 */
 	tuple = palloc0(tuplen);
 
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 4d3114076b3..eb8601e2257 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -632,8 +632,6 @@ tuplesort_begin_index_gin(Relation heapRel,
 		/*
 		 * Look for a ordering for the index key data type, and then the sort
 		 * support function.
-		 *
-		 * XXX does this use the right opckeytype/opcintype for GIN?
 		 */
 		typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
 		PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index c8fe1130aa4..ce555031335 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -15,13 +15,7 @@
 #include "utils/sortsupport.h"
 
 /*
- * Each worker sees tuples in CTID order, so if we track the first TID and
- * compare that when combining results in the worker, we would not need to
- * do an expensive sort in workers (the mergesort is already smart about
- * detecting this and just concatenating the lists). We'd still need the
- * full mergesort in the leader, but that's much cheaper.
- *
- * XXX do we still need all the fields now that we use SortSupport?
+ * Data for one key in a GIN index.
  */
 typedef struct GinTuple
 {
-- 
2.48.1

v20250225-0003-progress.patchtext/x-patch; charset=UTF-8; name=v20250225-0003-progress.patchDownload

From 8df37528e0ccab135fad3a9182fb448183a413ee Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 15:40:59 +0100
Subject: [PATCH v20250225 3/7] progress

---
 src/backend/access/gin/gininsert.c | 61 ++++++++++++++++++++++++++++--
 src/backend/access/gin/ginutil.c   | 28 +++++++++++++-
 src/include/access/gin.h           | 11 ++++++
 src/include/access/gin_private.h   |  1 +
 4 files changed, 97 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7c2f46b9541..7286432698e 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -21,6 +21,7 @@
 #include "access/xloginsert.h"
 #include "catalog/index.h"
 #include "catalog/pg_collation.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
@@ -644,6 +645,10 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.accum.ginstate = &buildstate.ginstate;
 	ginInitBA(&buildstate.accum);
 
+	/* Report table scan phase started */
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+								 PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN);
+
 	/*
 	 * Attempt to launch parallel worker scan when required
 	 *
@@ -1481,15 +1486,42 @@ _gin_parallel_merge(GinBuildState *state)
 	double		reltuples = 0;
 	GinBuffer  *buffer;
 
+	/* GIN tuples from workers, merged by leader */
+	double		numtuples = 0;
+
 	/* wait for workers to scan table and produce partial results */
 	reltuples = _gin_parallel_heapscan(state);
 
+	/* Execute the sort */
+	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+								 PROGRESS_GIN_PHASE_PERFORMSORT_2);
+
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
 	/* initialize buffer to combine entries for the same key */
 	buffer = GinBufferInit(state->ginstate.index);
 
+	/*
+	 * Set the progress target for the next phase.  Reset the block number
+	 * values set by table_index_build_scan
+	 */
+	{
+		const int	progress_index[] = {
+			PROGRESS_CREATEIDX_SUBPHASE,
+			PROGRESS_CREATEIDX_TUPLES_TOTAL,
+			PROGRESS_SCAN_BLOCKS_TOTAL,
+			PROGRESS_SCAN_BLOCKS_DONE
+		};
+		const int64 progress_vals[] = {
+			PROGRESS_GIN_PHASE_MERGE_2,
+			state->bs_numtuples,
+			0, 0
+		};
+
+		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
+	}
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by category and
 	 * key. That probably gives us order matching how data is organized in the
@@ -1530,6 +1562,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * or append if to the existing data).
 		 */
 		GinBufferStoreTuple(buffer, tup);
+
+		/* Report progress */
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+									 ++numtuples);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -1543,6 +1579,10 @@ _gin_parallel_merge(GinBuildState *state)
 
 		/* discard the existing data */
 		GinBufferReset(buffer);
+
+		/* Report progress */
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+									 ++numtuples);
 	}
 
 	/* relase all the memory */
@@ -1583,7 +1623,8 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 
 	/* Perform work common to all participants */
 	_gin_parallel_scan_and_build(buildstate, ginleader->ginshared,
-								 ginleader->sharedsort, heap, index, sortmem, true);
+								 ginleader->sharedsort, heap, index,
+								 sortmem, true);
 }
 
 /*
@@ -1601,7 +1642,8 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
  * do a very limited number of mergesorts, which is good.
  */
 static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
+_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
+						 bool progress)
 {
 	GinTuple   *tup;
 	Size		tuplen;
@@ -1612,8 +1654,19 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
+	if (progress)
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+									 PROGRESS_GIN_PHASE_PERFORMSORT_1);
+
 	tuplesort_performsort(state->bs_worker_sort);
 
+	/* reset the number of GIN tuples produced by this worker */
+	state->bs_numtuples = 0;
+
+	if (progress)
+		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
+									 PROGRESS_GIN_PHASE_MERGE_1);
+
 	/*
 	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
 	 * merge them into larger chunks for the leader to combine.
@@ -1645,6 +1698,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 									buffer->items, buffer->nitems, &ntuplen);
 
 			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+			state->bs_numtuples++;
 
 			pfree(ntup);
 
@@ -1672,6 +1726,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort)
 								buffer->items, buffer->nitems, &ntuplen);
 
 		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+		state->bs_numtuples++;
 
 		pfree(ntup);
 
@@ -1775,7 +1830,7 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	 * the callback, and combine them into much larger chunks and place that
 	 * into the shared tuplestore for leader to process.
 	 */
-	_gin_process_worker_data(state, state->bs_worker_sort);
+	_gin_process_worker_data(state, state->bs_worker_sort, progress);
 
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6b2dd40fa0f..a61532538c0 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -20,6 +20,7 @@
 #include "access/xloginsert.h"
 #include "catalog/pg_collation.h"
 #include "catalog/pg_type.h"
+#include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
@@ -72,7 +73,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->amgettreeheight = NULL;
 	amroutine->amoptions = ginoptions;
 	amroutine->amproperty = NULL;
-	amroutine->ambuildphasename = NULL;
+	amroutine->ambuildphasename = ginbuildphasename;
 	amroutine->amvalidate = ginvalidate;
 	amroutine->amadjustmembers = ginadjustmembers;
 	amroutine->ambeginscan = ginbeginscan;
@@ -700,3 +701,28 @@ ginUpdateStats(Relation index, const GinStatsData *stats, bool is_build)
 
 	END_CRIT_SECTION();
 }
+
+/*
+ *	ginbuildphasename() -- Return name of index build phase.
+ */
+char *
+ginbuildphasename(int64 phasenum)
+{
+	switch (phasenum)
+	{
+		case PROGRESS_CREATEIDX_SUBPHASE_INITIALIZE:
+			return "initializing";
+		case PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN:
+			return "scanning table";
+		case PROGRESS_GIN_PHASE_PERFORMSORT_1:
+			return "sorting tuples (workers)";
+		case PROGRESS_GIN_PHASE_MERGE_1:
+			return "merging tuples (workers)";
+		case PROGRESS_GIN_PHASE_PERFORMSORT_2:
+			return "sorting tuples";
+		case PROGRESS_GIN_PHASE_MERGE_2:
+			return "merging tuples";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index 2debdac0f43..2e1076a0499 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -38,6 +38,17 @@
 #define GIN_SEARCH_MODE_ALL				2
 #define GIN_SEARCH_MODE_EVERYTHING		3	/* for internal use only */
 
+/*
+ * Constant definition for progress reporting.  Phase numbers must match
+ * ginbuildphasename.
+ */
+/* PROGRESS_CREATEIDX_SUBPHASE_INITIALIZE is 1 (see progress.h) */
+#define PROGRESS_GIN_PHASE_INDEXBUILD_TABLESCAN		2
+#define PROGRESS_GIN_PHASE_PERFORMSORT_1			3
+#define PROGRESS_GIN_PHASE_MERGE_1					4
+#define PROGRESS_GIN_PHASE_PERFORMSORT_2			5
+#define PROGRESS_GIN_PHASE_MERGE_2					6
+
 /*
  * GinStatsData represents stats data for planner use
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 50478db9820..95d8805b66f 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -109,6 +109,7 @@ extern Datum *ginExtractEntries(GinState *ginstate, OffsetNumber attnum,
 extern OffsetNumber gintuple_get_attrnum(GinState *ginstate, IndexTuple tuple);
 extern Datum gintuple_get_key(GinState *ginstate, IndexTuple tuple,
 							  GinNullCategory *category);
+extern char *ginbuildphasename(int64 phasenum);
 
 /* gininsert.c */
 extern IndexBuildResult *ginbuild(Relation heap, Relation index,
-- 
2.48.1

v20250225-0004-Compress-TID-lists-when-writing-GIN-tuples.patchtext/x-patch; charset=UTF-8; name=v20250225-0004-Compress-TID-lists-when-writing-GIN-tuples.patchDownload

From d3d06e1c5e34a20d2fd0cc110266556e6b31c59f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:01:43 +0100
Subject: [PATCH v20250225 4/7] Compress TID lists when writing GIN tuples to
 disk

When serializing GIN tuples to tuplesorts during parallel index builds,
we can significantly reduce the amount of data by compressing the TID
lists. The GIN opclasses may produce a lot of data (depending on how
many keys are extracted from each row), and the TID compression is very
efficient and effective.

If the number of distinct keys is high, the first worker pass (reading
data from the table and writing them into a private tuplesort) may not
benefit from the compression very much. It is likely to spill data to
disk before the TID lists get long enough for the compression to help.
The second pass (writing the merged data into the shared tuplesort) is
more likely to benefit from compression.

The compression can be seen as a way to reduce the amount of disk space
needed by the parallel builds, because the data is written twice - first
into the per-worker tuplesorts, then into the shared tuplesort.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7286432698e..a10266b39e1 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,7 +191,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1366,7 +1368,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1402,6 +1405,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1956,6 +1962,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1968,6 +1983,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1981,6 +2001,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -2007,12 +2032,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * only SHORTALIGN).
 	 */
-	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2062,37 +2109,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2105,6 +2155,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2140,8 +2212,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a884c771fa4..c34358d20bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1050,6 +1050,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.48.1

v20250225-0006-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250225-0006-Use-a-single-GIN-tuplesort.patchDownload

From d9f6e45e906205b36269a4a635f06377e3e83639 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:12:37 +0100
Subject: [PATCH v20250225 6/7] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 411 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 240 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e0938a71112..ba03c81b729 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -165,14 +165,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -196,8 +188,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -500,16 +491,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1145,8 +1135,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1164,7 +1160,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1175,8 +1171,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1202,7 +1197,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1226,7 +1221,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1289,15 +1284,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1313,37 +1311,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1390,6 +1422,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1413,32 +1495,29 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1516,20 +1595,54 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
  * GinBufferReset
  *		Reset the buffer into a state as if it contains no data.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1545,6 +1658,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1568,7 +1682,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1579,6 +1693,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1588,7 +1703,7 @@ GinBufferFree(GinBuffer *buffer)
  *
  * Returns true if the buffer is either empty or for the same index key.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1685,6 +1800,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1711,6 +1827,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1724,7 +1841,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 
 		/* Report progress */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
@@ -1735,6 +1855,7 @@ _gin_parallel_merge(GinBuildState *state)
 	if (!GinBufferIsEmpty(buffer))
 	{
 		AssertCheckItemPointers(buffer);
+		Assert(!PointerIsValid(buffer->cached));
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1790,158 +1911,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
-						 bool progress)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_PERFORMSORT_1);
-
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* reset the number of GIN tuples produced by this worker */
-	state->bs_numtuples = 0;
-
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_MERGE_1);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-			state->bs_numtuples++;
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-		state->bs_numtuples++;
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -2008,12 +1977,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  state->work_mem,
-													  NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2027,13 +1990,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort, progress);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2188,8 +2144,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2257,8 +2212,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..a106cc79efd 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -640,9 +664,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -683,6 +709,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -885,17 +912,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -903,7 +930,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1923,19 +1950,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1961,6 +2032,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 95d8805b66f..da4351c3d3d 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -478,6 +478,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index ce555031335..4de7b5c32b5 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -39,6 +39,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c34358d20bc..8cd09651cb3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3032,6 +3032,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250225-0007-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250225-0007-WIP-parallel-inserts-into-GIN-index.patchDownload

From 70f8f21a70f77d81b19f13cb073d7a2e75385137 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:16:24 +0100
Subject: [PATCH v20250225 7/7] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 450 +++++++++++-------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 166 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index ba03c81b729..1dec12bf7fa 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -26,7 +26,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -43,6 +45,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -89,6 +96,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -174,7 +184,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -190,6 +199,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -700,8 +715,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -990,6 +1009,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1057,6 +1082,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1070,6 +1100,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1714,169 +1746,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* GIN tuples from workers, merged by leader */
-	double		numtuples = 0;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* Execute the sort */
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-								 PROGRESS_GIN_PHASE_PERFORMSORT_2);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Set the progress target for the next phase.  Reset the block number
-	 * values set by table_index_build_scan
-	 */
-	{
-		const int	progress_index[] = {
-			PROGRESS_CREATEIDX_SUBPHASE,
-			PROGRESS_CREATEIDX_TUPLES_TOTAL,
-			PROGRESS_SCAN_BLOCKS_TOTAL,
-			PROGRESS_SCAN_BLOCKS_DONE
-		};
-		const int64 progress_vals[] = {
-			PROGRESS_GIN_PHASE_MERGE_2,
-			state->bs_numtuples,
-			0, 0
-		};
-
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
-	}
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-		Assert(!PointerIsValid(buffer->cached));
-
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2094,6 +1963,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2104,6 +1976,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2376,3 +2262,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#54

Tomas Vondra

tomas@vondra.me

11 months ago

In reply to: Tomas Vondra (#53)

1 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

While working on the progress reporting, I've been looking into the
performance results, particularly why the parallelism doesn't help much
for some indexes - e.g. the index on the headers JSONB column.

CREATE INDEX headers_jsonb_idx
ON messages USING gin (msg_headers);

In this case the parallelism helps only a little bit - serial build
takes ~47 seconds, parallel builds with 1 worker (so 2 with leader)
takes ~40 seconds. Not great.

There are two reasons for this. First, the "keys" (JSONB values) are
mostly unique, with only 1 or 2 TIDs per key, which means the workers
can't really do much merging. But shifting the merges to workers is the
main benefit of parallel builds - if the merge happens in the leader
anyway, this explains the lack of speedup.

The other reason is that with JSON keys the comparisons are rather
expensive, and we're comparing a lot of keys. It occurred to me we can
work around this by comparing hashes first, and comparing the full keys
only when the hashes match. And indeed, this helps a lot (there's a very
rough PoC patch attached) - I'm seeing ~20% speedup from this, so the
parallel build runs in ~30 seconds now. Still not quite serial speedup,
but better than before.

But I think this optimization is mostly orthogonal to parallel builds,
i.e. we could do the same thing for serial builds (while accumulating
data in memory, we could do these comparisons). But it needs to be
careful about still writing the data out in the "natural" order, not
ordered by hash. The hash randomizes the pattern, making it much less
efficient for bulk inserts (it trashes the buffers, etc.). The PoC patch
for parallel builds addresses this by ignoring the hash during the final
tuplesort, the serial builds would need to do something similar.

My conclusion is this can be left as a future improvement, independent
of the parallel builds.

regards

--
Tomas Vondra

Attachments:

poc-gin-build-hashing.patchtext/x-patch; charset=UTF-8; name=poc-gin-build-hashing.patchDownload

diff --git a/src/backend/access/gin/ginbulk.c b/src/backend/access/gin/ginbulk.c
index 302cb2092a9..74cc62839cb 100644
--- a/src/backend/access/gin/ginbulk.c
+++ b/src/backend/access/gin/ginbulk.c
@@ -76,8 +76,8 @@ cmpEntryAccumulator(const RBTNode *a, const RBTNode *b, void *arg)
 	BuildAccumulator *accum = (BuildAccumulator *) arg;
 
 	return ginCompareAttEntries(accum->ginstate,
-								ea->attnum, ea->key, ea->category,
-								eb->attnum, eb->key, eb->category);
+								ea->attnum, ea->key, ea->hash, ea->category,
+								eb->attnum, eb->key, eb->hash, eb->category);
 }
 
 /* Allocator function for rbtree.c */
@@ -147,7 +147,7 @@ getDatumCopy(BuildAccumulator *accum, OffsetNumber attnum, Datum value)
 static void
 ginInsertBAEntry(BuildAccumulator *accum,
 				 ItemPointer heapptr, OffsetNumber attnum,
-				 Datum key, GinNullCategory category)
+				 Datum key, uint32 hash, GinNullCategory category)
 {
 	GinEntryAccumulator eatmp;
 	GinEntryAccumulator *ea;
@@ -159,6 +159,7 @@ ginInsertBAEntry(BuildAccumulator *accum,
 	 */
 	eatmp.attnum = attnum;
 	eatmp.key = key;
+	eatmp.hash = hash;
 	eatmp.category = category;
 	/* temporarily set up single-entry itempointer list */
 	eatmp.list = heapptr;
@@ -209,7 +210,7 @@ ginInsertBAEntry(BuildAccumulator *accum,
 void
 ginInsertBAEntries(BuildAccumulator *accum,
 				   ItemPointer heapptr, OffsetNumber attnum,
-				   Datum *entries, GinNullCategory *categories,
+				   Datum *entries, uint32 *hashes, GinNullCategory *categories,
 				   int32 nentries)
 {
 	uint32		step = nentries;
@@ -236,7 +237,7 @@ ginInsertBAEntries(BuildAccumulator *accum,
 
 		for (i = step - 1; i < nentries && i >= 0; i += step << 1 /* *2 */ )
 			ginInsertBAEntry(accum, heapptr, attnum,
-							 entries[i], categories[i]);
+							 entries[i], hashes[i], categories[i]);
 
 		step >>= 1;				/* /2 */
 	}
diff --git a/src/backend/access/gin/ginentrypage.c b/src/backend/access/gin/ginentrypage.c
index ba6bbd562df..8b2125bb00c 100644
--- a/src/backend/access/gin/ginentrypage.c
+++ b/src/backend/access/gin/ginentrypage.c
@@ -255,8 +255,8 @@ entryIsMoveRight(GinBtree btree, Page page)
 	key = gintuple_get_key(btree->ginstate, itup, &category);
 
 	if (ginCompareAttEntries(btree->ginstate,
-							 btree->entryAttnum, btree->entryKey, btree->entryCategory,
-							 attnum, key, category) > 0)
+							 btree->entryAttnum, btree->entryKey, 0, btree->entryCategory,
+							 attnum, key, 0, category) > 0)
 		return true;
 
 	return false;
@@ -313,8 +313,9 @@ entryLocateEntry(GinBtree btree, GinBtreeStack *stack)
 			result = ginCompareAttEntries(btree->ginstate,
 										  btree->entryAttnum,
 										  btree->entryKey,
+										  0,
 										  btree->entryCategory,
-										  attnum, key, category);
+										  attnum, key, 0, category);
 		}
 
 		if (result == 0)
@@ -384,8 +385,9 @@ entryLocateLeafEntry(GinBtree btree, GinBtreeStack *stack)
 		result = ginCompareAttEntries(btree->ginstate,
 									  btree->entryAttnum,
 									  btree->entryKey,
+									  0,
 									  btree->entryCategory,
-									  attnum, key, category);
+									  attnum, key, 0, category);
 		if (result == 0)
 		{
 			stack->off = mid;
diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c
index a6d88572cc2..6010e8091bb 100644
--- a/src/backend/access/gin/ginfast.c
+++ b/src/backend/access/gin/ginfast.c
@@ -746,7 +746,7 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
 			 * and reset ka.
 			 */
 			ginInsertBAEntries(accum, &heapptr, attrnum,
-							   ka->keys, ka->categories, ka->nvalues);
+							   ka->keys, NULL, ka->categories, ka->nvalues);
 			ka->nvalues = 0;
 			heapptr = itup->t_tid;
 			attrnum = curattnum;
@@ -759,7 +759,7 @@ processPendingPage(BuildAccumulator *accum, KeyArray *ka,
 
 	/* Dump out all remaining keys */
 	ginInsertBAEntries(accum, &heapptr, attrnum,
-					   ka->keys, ka->categories, ka->nvalues);
+					   ka->keys, NULL, ka->categories, ka->nvalues);
 }
 
 /*
diff --git a/src/backend/access/gin/ginget.c b/src/backend/access/gin/ginget.c
index f3b19d280c3..1d38b6aa462 100644
--- a/src/backend/access/gin/ginget.c
+++ b/src/backend/access/gin/ginget.c
@@ -283,8 +283,8 @@ collectMatchBitmap(GinBtreeData *btree, GinBtreeStack *stack,
 												&newCategory);
 
 					if (ginCompareEntries(btree->ginstate, attnum,
-										  newDatum, newCategory,
-										  idatum, icategory) == 0)
+										  newDatum, 0, newCategory,
+										  idatum, 0, icategory) == 0)
 						break;	/* Found! */
 				}
 
@@ -1724,8 +1724,10 @@ collectMatchesForHeapRow(IndexScanDesc scan, pendingPosition *pos)
 						res = ginCompareEntries(&so->ginstate,
 												entry->attnum,
 												entry->queryKey,
+												0,
 												entry->queryCategory,
 												datum[StopMiddle - 1],
+												0,
 												category[StopMiddle - 1]);
 					}
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 7286432698e..f54e9e9b26c 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -193,7 +193,7 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 
 static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
-								  Datum key, int16 typlen, bool typbyval,
+								  Datum key, int16 typlen, bool typbyval, uint32 hash,
 								  ItemPointerData *items, uint32 nitems,
 								  Size *len);
 
@@ -412,8 +412,8 @@ ginEntryInsert(GinState *ginstate,
  * This function is used only during initial index creation.
  */
 static void
-ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
-					   Datum value, bool isNull,
+ginHeapTupleBulkInsert(GinBuildState *buildstate, Relation index,
+					   OffsetNumber attnum, Datum value, bool isNull,
 					   ItemPointer heapptr)
 {
 	Datum	   *entries;
@@ -421,14 +421,41 @@ ginHeapTupleBulkInsert(GinBuildState *buildstate, OffsetNumber attnum,
 	int32		nentries;
 	MemoryContext oldCtx;
 
+	uint32	   *hashes;
+	TypeCacheEntry *typentry;
+	TupleDesc	tdesc = RelationGetDescr(index);
+	Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+
 	oldCtx = MemoryContextSwitchTo(buildstate->funcCtx);
 	entries = ginExtractEntries(buildstate->accum.ginstate, attnum,
 								value, isNull,
 								&nentries, &categories);
+
+	hashes = palloc0(sizeof(uint32) * nentries);
+
 	MemoryContextSwitchTo(oldCtx);
 
+	typentry = lookup_type_cache(attr->atttypid, TYPECACHE_HASH_PROC);
+	if (OidIsValid(typentry->hash_proc))
+	{
+		/* FIXME is it correct to use attr->attcollation, not rd_indcollation? */
+		Oid	collation = attr->attcollation;
+		if (!OidIsValid(collation))
+			collation = DEFAULT_COLLATION_OID;
+
+		for (int i = 0; i < nentries; i++)
+		{
+			if (categories[i] != GIN_CAT_NORM_KEY)
+				continue;
+
+			hashes[i] = DatumGetUInt32(OidFunctionCall1Coll(typentry->hash_proc,
+															collation,
+															entries[i]));
+		}
+	}
+
 	ginInsertBAEntries(&buildstate->accum, heapptr, attnum,
-					   entries, categories, nentries);
+					   entries, hashes, categories, nentries);
 
 	buildstate->indtuples += nentries;
 
@@ -446,7 +473,7 @@ ginBuildCallback(Relation index, ItemPointer tid, Datum *values,
 	oldCtx = MemoryContextSwitchTo(buildstate->tmpCtx);
 
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
-		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+		ginHeapTupleBulkInsert(buildstate, index, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
 
 	/* If we've maxed out our available memory, dump everything to the index */
@@ -495,6 +522,8 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 	{
 		/* information about the key */
 		Form_pg_attribute attr = TupleDescAttr(tdesc, (attnum - 1));
+		TypeCacheEntry *typentry;
+		uint32	hash = 0;
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
@@ -503,8 +532,22 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
+		/* calculate hash of the key */
+		typentry = lookup_type_cache(attr->atttypid, TYPECACHE_HASH_PROC);
+		if (OidIsValid(typentry->hash_proc) && (category == GIN_CAT_NORM_KEY))
+		{
+			/* FIXME is it correct to use attr->attcollation, not rd_indcollation? */
+			Oid	collation = attr->attcollation;
+			if (!OidIsValid(collation))
+				collation = DEFAULT_COLLATION_OID;
+
+			hash = DatumGetUInt32(OidFunctionCall1Coll(typentry->hash_proc,
+													   collation,
+													   key));
+		}
+
 		tup = _gin_build_tuple(attnum, category,
-							   key, attr->attlen, attr->attbyval,
+							   key, attr->attlen, attr->attbyval, hash,
 							   list, nlist, &tuplen);
 
 		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
@@ -562,7 +605,7 @@ ginBuildCallbackParallel(Relation index, ItemPointer tid, Datum *values,
 	buildstate->tid = *tid;
 
 	for (i = 0; i < buildstate->ginstate.origTupdesc->natts; i++)
-		ginHeapTupleBulkInsert(buildstate, (OffsetNumber) (i + 1),
+		ginHeapTupleBulkInsert(buildstate, index, (OffsetNumber) (i + 1),
 							   values[i], isnull[i], tid);
 
 	/*
@@ -1149,6 +1192,7 @@ typedef struct GinBuffer
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
 	Size		keylen;			/* number of bytes (not typlen) */
+	uint32		hash;
 
 	/* type info */
 	int16		typlen;
@@ -1315,6 +1359,9 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	if (tup->category != buffer->category)
 		return false;
 
+	if (tup->hash != buffer->hash)
+		return false;
+
 	/*
 	 * For NULL/empty keys, this means equality, for normal keys we need to
 	 * compare the actual key value.
@@ -1377,6 +1424,7 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		buffer->typlen = tup->typlen;
 		buffer->typbyval = tup->typbyval;
+		buffer->hash = tup->hash;
 
 		if (tup->category == GIN_CAT_NORM_KEY)
 			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
@@ -1695,6 +1743,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
+									0,
 									buffer->items, buffer->nitems, &ntuplen);
 
 			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
@@ -1723,6 +1772,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
+								0,
 								buffer->items, buffer->nitems, &ntuplen);
 
 		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
@@ -1971,7 +2021,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
-				 Datum key, int16 typlen, bool typbyval,
+				 Datum key, int16 typlen, bool typbyval, uint32 hash,
 				 ItemPointerData *items, uint32 nitems,
 				 Size *len)
 {
@@ -2030,6 +2080,7 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	tuple->category = category;
 	tuple->keylen = keylen;
 	tuple->nitems = nitems;
+	tuple->hash = hash;
 
 	/* key type info */
 	tuple->typlen = typlen;
@@ -2138,6 +2189,12 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	if (a->category > b->category)
 		return 1;
 
+	if (a->hash < b->hash)
+		return -1;
+
+	if (a->hash > b->hash)
+		return 1;
+
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
 		keya = _gin_parse_tuple(a, NULL);
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index 63ded6301e2..4bb15d6ccc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -82,8 +82,10 @@ ginFillScanEntry(GinScanOpaque so, OffsetNumber attnum,
 				prevEntry->attnum == attnum &&
 				ginCompareEntries(ginstate, attnum,
 								  prevEntry->queryKey,
+								  0,
 								  prevEntry->queryCategory,
 								  queryKey,
+								  0,
 								  queryCategory) == 0)
 			{
 				/* Successful match */
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index a61532538c0..f47417ee638 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -388,8 +388,8 @@ GinInitMetabuffer(Buffer b)
  */
 int
 ginCompareEntries(GinState *ginstate, OffsetNumber attnum,
-				  Datum a, GinNullCategory categorya,
-				  Datum b, GinNullCategory categoryb)
+				  Datum a, uint32 hasha, GinNullCategory categorya,
+				  Datum b, uint32 hashb, GinNullCategory categoryb)
 {
 	/* if not of same null category, sort by that first */
 	if (categorya != categoryb)
@@ -399,6 +399,9 @@ ginCompareEntries(GinState *ginstate, OffsetNumber attnum,
 	if (categorya != GIN_CAT_NORM_KEY)
 		return 0;
 
+	if (hasha != hashb)
+		return (hasha < hashb) ? -1 : 1;
+
 	/* both not null, so safe to call the compareFn */
 	return DatumGetInt32(FunctionCall2Coll(&ginstate->compareFn[attnum - 1],
 										   ginstate->supportCollation[attnum - 1],
@@ -410,14 +413,14 @@ ginCompareEntries(GinState *ginstate, OffsetNumber attnum,
  */
 int
 ginCompareAttEntries(GinState *ginstate,
-					 OffsetNumber attnuma, Datum a, GinNullCategory categorya,
-					 OffsetNumber attnumb, Datum b, GinNullCategory categoryb)
+					 OffsetNumber attnuma, Datum a, uint32 hasha, GinNullCategory categorya,
+					 OffsetNumber attnumb, Datum b, uint32 hashb, GinNullCategory categoryb)
 {
 	/* attribute number is the first sort key */
 	if (attnuma != attnumb)
 		return (attnuma < attnumb) ? -1 : 1;
 
-	return ginCompareEntries(ginstate, attnuma, a, categorya, b, categoryb);
+	return ginCompareEntries(ginstate, attnuma, a, hasha, categorya, b, hashb, categoryb);
 }
 
 
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 95d8805b66f..d213f22d6f3 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -97,11 +97,11 @@ extern void GinInitBuffer(Buffer b, uint32 f);
 extern void GinInitPage(Page page, uint32 f, Size pageSize);
 extern void GinInitMetabuffer(Buffer b);
 extern int	ginCompareEntries(GinState *ginstate, OffsetNumber attnum,
-							  Datum a, GinNullCategory categorya,
-							  Datum b, GinNullCategory categoryb);
+							  Datum a, uint32 hasha, GinNullCategory categorya,
+							  Datum b, uint32 hashb, GinNullCategory categoryb);
 extern int	ginCompareAttEntries(GinState *ginstate,
-								 OffsetNumber attnuma, Datum a, GinNullCategory categorya,
-								 OffsetNumber attnumb, Datum b, GinNullCategory categoryb);
+								 OffsetNumber attnuma, Datum a, uint32 hasha, GinNullCategory categorya,
+								 OffsetNumber attnumb, Datum b, uint32 hashb, GinNullCategory categoryb);
 extern Datum *ginExtractEntries(GinState *ginstate, OffsetNumber attnum,
 								Datum value, bool isNull,
 								int32 *nentries, GinNullCategory **categories);
@@ -423,6 +423,7 @@ typedef struct GinEntryAccumulator
 {
 	RBTNode		rbtnode;
 	Datum		key;
+	uint32		hash;
 	GinNullCategory category;
 	OffsetNumber attnum;
 	bool		shouldSort;
@@ -444,7 +445,8 @@ typedef struct
 extern void ginInitBA(BuildAccumulator *accum);
 extern void ginInsertBAEntries(BuildAccumulator *accum,
 							   ItemPointer heapptr, OffsetNumber attnum,
-							   Datum *entries, GinNullCategory *categories,
+							   Datum *entries, uint32 *hashes,
+							   GinNullCategory *categories,
 							   int32 nentries);
 extern void ginBeginBAScan(BuildAccumulator *accum);
 extern ItemPointerData *ginGetBAEntry(BuildAccumulator *accum,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index ce555031335..1075aad917d 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -26,6 +26,7 @@ typedef struct GinTuple
 	bool		typbyval;		/* typbyval for key */
 	signed char category;		/* category: normal or NULL? */
 	int			nitems;			/* number of TIDs in the data */
+	uint32		hash;			/* hash of the value */
 	char		data[FLEXIBLE_ARRAY_MEMBER];
 } GinTuple;

#55

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#54)

4 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

I've pushed the first part of the series (0001 + the cleanup and
progress patch). That leaves the two smaller improvement parts
(compression + memory limit enforcement) - I intend to push those
sometime this week, if possible.

Here's a rebased version of the whole patch series, including the two
WIP parts that are unlikely to make it into PG18 at this point.

regards

--
Tomas Vondra

Attachments:

v20250303-0001-Compress-TID-lists-when-writing-GIN-tuples.patchtext/x-patch; charset=UTF-8; name=v20250303-0001-Compress-TID-lists-when-writing-GIN-tuples.patchDownload

From 0541012bd9a092d0d6e4c020608d4fdea98d7ab8 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:01:43 +0100
Subject: [PATCH v20250303 1/4] Compress TID lists when writing GIN tuples to
 disk

When serializing GIN tuples to tuplesorts during parallel index builds,
we can significantly reduce the amount of data by compressing the TID
lists. The GIN opclasses may produce a lot of data (depending on how
many keys are extracted from each row), and the TID compression is very
efficient and effective.

If the number of distinct keys is high, the first worker pass (reading
data from the table and writing them into a private tuplesort) may not
benefit from the compression very much. It is likely to spill data to
disk before the TID lists get long enough for the compression to help.
The second pass (writing the merged data into the shared tuplesort) is
more likely to benefit from compression.

The compression can be seen as a way to reduce the amount of disk space
needed by the parallel builds, because the data is written twice - first
into the per-worker tuplesorts, then into the shared tuplesort.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 116 +++++++++++++++++++++++------
 src/tools/pgindent/typedefs.list   |   1 +
 2 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e399d867e0f..27c14adbc3a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -190,7 +190,9 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 Relation heap, Relation index,
 										 int sortmem, bool progress);
 
-static Datum _gin_parse_tuple(GinTuple *a, ItemPointerData **items);
+static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static Datum _gin_parse_tuple_key(GinTuple *a);
+
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems,
@@ -1365,7 +1367,8 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple(tup, &items);
+	key = _gin_parse_tuple_key(tup);
+	items = _gin_parse_tuple_items(tup);
 
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
@@ -1401,6 +1404,9 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 		AssertCheckItemPointers(buffer);
 	}
+
+	/* free the decompressed TID list */
+	pfree(items);
 }
 
 /*
@@ -1955,6 +1961,15 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	table_close(heapRel, heapLockmode);
 }
 
+/*
+ * Used to keep track of compressed TID lists when building a GIN tuple.
+ */
+typedef struct
+{
+	dlist_node	node;			/* linked list pointers */
+	GinPostingList *seg;
+} GinSegmentInfo;
+
 /*
  * _gin_build_tuple
  *		Serialize the state for an index key into a tuple for tuplesort.
@@ -1967,6 +1982,11 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
  * like endianess etc. We could make it a little bit smaller, but it's not
  * worth it - it's a tiny fraction of the data, and we need to MAXALIGN the
  * start of the TID list anyway. So we wouldn't save anything.
+ *
+ * The TID list is serialized as compressed - it's highly compressible, and
+ * we already have ginCompressPostingList for this purpose. The list may be
+ * pretty long, so we compress it into multiple segments and then copy all
+ * of that into the GIN tuple.
  */
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1980,6 +2000,11 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	Size		tuplen;
 	int			keylen;
 
+	dlist_mutable_iter iter;
+	dlist_head	segments;
+	int			ncompressed;
+	Size		compresslen;
+
 	/*
 	 * Calculate how long is the key value. Only keys with GIN_CAT_NORM_KEY
 	 * have actual non-empty key. We include varlena headers and \0 bytes for
@@ -2006,12 +2031,34 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	else
 		elog(ERROR, "unexpected typlen value (%d)", typlen);
 
+	/* compress the item pointers */
+	ncompressed = 0;
+	compresslen = 0;
+	dlist_init(&segments);
+
+	/* generate compressed segments of TID list chunks */
+	while (ncompressed < nitems)
+	{
+		int			cnt;
+		GinSegmentInfo *seginfo = palloc(sizeof(GinSegmentInfo));
+
+		seginfo->seg = ginCompressPostingList(&items[ncompressed],
+											  (nitems - ncompressed),
+											  UINT16_MAX,
+											  &cnt);
+
+		ncompressed += cnt;
+		compresslen += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_push_tail(&segments, &seginfo->node);
+	}
+
 	/*
 	 * Determine GIN tuple length with all the data included. Be careful about
-	 * alignment, to allow direct access to item pointers.
+	 * alignment, to allow direct access to compressed segments (those require
+	 * only SHORTALIGN).
 	 */
-	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) +
-		(sizeof(ItemPointerData) * nitems);
+	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
 	*len = tuplen;
 
@@ -2061,37 +2108,40 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	/* finally, copy the TIDs into the array */
 	ptr = (char *) tuple + SHORTALIGN(offsetof(GinTuple, data) + keylen);
 
-	memcpy(ptr, items, sizeof(ItemPointerData) * nitems);
+	/* copy in the compressed data, and free the segments */
+	dlist_foreach_modify(iter, &segments)
+	{
+		GinSegmentInfo *seginfo = dlist_container(GinSegmentInfo, node, iter.cur);
+
+		memcpy(ptr, seginfo->seg, SizeOfGinPostingList(seginfo->seg));
+
+		ptr += SizeOfGinPostingList(seginfo->seg);
+
+		dlist_delete(&seginfo->node);
+
+		pfree(seginfo->seg);
+		pfree(seginfo);
+	}
 
 	return tuple;
 }
 
 /*
- * _gin_parse_tuple
- *		Deserialize the tuple from the tuplestore representation.
+ * _gin_parse_tuple_key
+ *		Return a Datum representing the key stored in the tuple.
  *
- * Most of the fields are actually directly accessible, the only thing that
+ * Most of the tuple fields are directly accessible, the only thing that
  * needs more care is the key and the TID list.
  *
  * For the key, this returns a regular Datum representing it. It's either the
  * actual key value, or a pointer to the beginning of the data array (which is
  * where the data was copied by _gin_build_tuple).
- *
- * The pointer to the TID list is returned through 'items' (which is simply
- * a pointer to the data array).
  */
 static Datum
-_gin_parse_tuple(GinTuple *a, ItemPointerData **items)
+_gin_parse_tuple_key(GinTuple *a)
 {
 	Datum		key;
 
-	if (items)
-	{
-		char	   *ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
-
-		*items = (ItemPointerData *) ptr;
-	}
-
 	if (a->category != GIN_CAT_NORM_KEY)
 		return (Datum) 0;
 
@@ -2104,6 +2154,28 @@ _gin_parse_tuple(GinTuple *a, ItemPointerData **items)
 	return PointerGetDatum(a->data);
 }
 
+/*
+* _gin_parse_tuple_items
+ *		Return a pointer to a palloc'd array of decompressed TID array.
+ */
+static ItemPointer
+_gin_parse_tuple_items(GinTuple *a)
+{
+	int			len;
+	char	   *ptr;
+	int			ndecoded;
+	ItemPointer items;
+
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+
+	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+
+	Assert(ndecoded == a->nitems);
+
+	return (ItemPointer) items;
+}
+
 /*
  * _gin_compare_tuples
  *		Compare GIN tuples, used by tuplesort during parallel index build.
@@ -2139,8 +2211,8 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 
 	if (a->category == GIN_CAT_NORM_KEY)
 	{
-		keya = _gin_parse_tuple(a, NULL);
-		keyb = _gin_parse_tuple(b, NULL);
+		keya = _gin_parse_tuple_key(a);
+		keyb = _gin_parse_tuple_key(b);
 
 		r = ApplySortComparator(keya, false,
 								keyb, false,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 19ff271ba50..9840060997f 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1052,6 +1052,7 @@ GinScanEntry
 GinScanKey
 GinScanOpaque
 GinScanOpaqueData
+GinSegmentInfo
 GinState
 GinStatsData
 GinTernaryValue
-- 
2.48.1

v20250303-0002-Enforce-memory-limit-during-parallel-GIN-b.patchtext/x-patch; charset=UTF-8; name=v20250303-0002-Enforce-memory-limit-during-parallel-GIN-b.patchDownload

From 916d9c7d8223a5259b0df3a766a835d429788b1c Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Sat, 15 Feb 2025 21:02:45 +0100
Subject: [PATCH v20250303 2/4] Enforce memory limit during parallel GIN builds

Index builds are expected to respect maintenance_work_mem, just like
other maintenance operations. For serial builds this is done simply by
flushing the buffer in ginBuildCallback() into the index. But with
parallel builds it's more complicated, because there are multiple places
that can allocate memory.

ginBuildCallbackParallel() does the same thing as ginBuildCallback(),
except that the accumulated items are written into tuplesort. Then the
entries with the same key get merged - first in the worker, then in the
leader - and the TID lists may get (arbitrarily) long. It's unlikely it
would exceed the memory limit, but it's possible. We address this by
evicting some of the data if the list gets too long.

We can't simply dump the whole in-memory TID list. The GIN index bulk
insert code expects to see TIDs in monotonic order; it may fail if the
TIDs go backwards. If the TID lists overlap, evicting the whole current
TID list would break this (a later entry might add "old" TID values into
the already-written part).

In the workers this is not an issue, because the lists never overlap.
But the leader may see overlapping lists produced by the workers.

We can however derive a safe "horizon" TID - the entries (for a given
key) are sorted by (key, first TID), which means no future list can add
values before the last "first TID" we've seen. This patch tracks the
"frozen" part of the TID list, which we know can't change by merging
additional TID lists. If needed, we can evict this part of the list.

We don't want to do this too often - the smaller lists we evict, the
more expensive it'll be to merge them in the next step (especially in
the leader). Therefore we only trim the list if we have at least 1024
frozen items, and if the whole list is at least 64kB large.

These limits are somewhat arbitrary and fairly low. We might calculate
some limits from maintenance_work_mem, but judging by experiments that
does not really improve anything (time, compression ratio, ...). So we
stick to these conservative limits to release memory faster.

Author: Tomas Vondra
Reviewed-by: Matthias van de Meent
Discussion:
---
 src/backend/access/gin/gininsert.c | 212 +++++++++++++++++++++++++++--
 1 file changed, 204 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 27c14adbc3a..b2f89cad880 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1155,8 +1155,12 @@ typedef struct GinBuffer
 	int16		typlen;
 	bool		typbyval;
 
+	/* Number of TIDs to collect before attempt to write some out. */
+	int			maxitems;
+
 	/* array of TID values */
 	int			nitems;
+	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
 } GinBuffer;
@@ -1229,6 +1233,13 @@ GinBufferInit(Relation index)
 				nKeys;
 	TupleDesc	desc = RelationGetDescr(index);
 
+	/*
+	 * How many items can we fit into the memory limit? We don't want to end
+	 * with too many TIDs. and 64kB seems more than enough. But maybe this
+	 * should be tied to maintenance_work_mem or something like that?
+	 */
+	buffer->maxitems = (64 * 1024L) / sizeof(ItemPointerData);
+
 	nKeys = IndexRelationGetNumberOfKeyAttributes(index);
 
 	buffer->ssup = palloc0(sizeof(SortSupportData) * nKeys);
@@ -1336,6 +1347,48 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 	return (r == 0);
 }
 
+/*
+ * GinBufferShouldTrim
+ *		Should we trim the list of item pointers?
+ *
+ * By trimming we understand writing out and removing the tuple IDs that
+ * we know can't change by future merges. We can deduce the TID up to which
+ * this is guaranteed from the "first" TID in each GIN tuple, which provides
+ * a "horizon" (for a given key) thanks to the sort.
+ *
+ * We don't want to do this too often - compressing longer TID lists is more
+ * efficient. But we also don't want to accumulate too many TIDs, for two
+ * reasons. First, it consumes memory and we might exceed maintenance_work_mem
+ * (or whatever limit applies), even if that's unlikely because TIDs are very
+ * small so we can fit a lot of them. Second, and more importantly, long TID
+ * lists are an issue if the scan wraps around, because a key may get a very
+ * wide list (with min/max TID for that key), forcing "full" mergesorts for
+ * every list merged into it (instead of the efficient append).
+ *
+ * So we look at two things when deciding if to trim - if the resulting list
+ * (after adding TIDs from the new tuple) would be too long, and if there is
+ * enough TIDs to trim (with values less than "first" TID from the new tuple),
+ * we do the trim. By enough we mean at least 128 TIDs (mostly an arbitrary
+ * number).
+ */
+static bool
+GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
+{
+	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
+	if (buffer->nfrozen < 1024)
+		return false;
+
+	/* no need to trim if we have not hit the memory limit yet */
+	if ((buffer->nitems + tup->nitems) < buffer->maxitems)
+		return false;
+
+	/*
+	 * OK, we have enough frozen TIDs to flush, and we have hit the memory
+	 * limit, so it's time to write it out.
+	 */
+	return true;
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1386,21 +1439,76 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 			buffer->key = (Datum) 0;
 	}
 
+	/*
+	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
+	 * the mergesort. We can do that with TIDs before the first TID in the new
+	 * tuple we're about to add into the buffer.
+	 *
+	 * We do this incrementally when adding data into the in-memory buffer,
+	 * and not later (e.g. when hitting a memory limit), because it allows us
+	 * to skip the frozen data during the mergesort, making it cheaper.
+	 */
+
+	/*
+	 * Check if the last TID in the current list is frozen. This is the case
+	 * when merging non-overlapping lists, e.g. in each parallel worker.
+	 */
+	if ((buffer->nitems > 0) &&
+		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
+							GinTupleGetFirst(tup)) == 0))
+		buffer->nfrozen = buffer->nitems;
+
+	/*
+	 * Now find the last TID we know to be frozen, i.e. the last TID right
+	 * before the new GIN tuple.
+	 *
+	 * Start with the first not-yet-frozen tuple, and walk until we find the
+	 * first TID that's higher. If we already know the whole list is frozen
+	 * (i.e. nfrozen == nitems), this does nothing.
+	 *
+	 * XXX This might do a binary search for sufficiently long lists, but it
+	 * does not seem worth the complexity. Overlapping lists should be rare
+	 * common, TID comparisons are cheap, and we should quickly freeze most of
+	 * the list.
+	 */
+	for (int i = buffer->nfrozen; i < buffer->nitems; i++)
+	{
+		/* Is the TID after the first TID of the new tuple? Can't freeze. */
+		if (ItemPointerCompare(&buffer->items[i],
+							   GinTupleGetFirst(tup)) > 0)
+			break;
+
+		buffer->nfrozen++;
+	}
+
 	/* add the new TIDs into the buffer, combine using merge-sort */
 	{
 		int			nnew;
 		ItemPointer new;
 
-		new = ginMergeItemPointers(buffer->items, buffer->nitems,
+		/*
+		 * Resize the array - we do this first, because we'll dereference the
+		 * first unfrozen TID, which would fail if the array is NULL. We'll
+		 * still pass 0 as number of elements in that array though.
+		 */
+		if (buffer->items == NULL)
+			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		else
+			buffer->items = repalloc(buffer->items,
+									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+
+		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
 								   items, tup->nitems, &nnew);
 
-		Assert(nnew == buffer->nitems + tup->nitems);
+		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
 
-		if (buffer->items)
-			pfree(buffer->items);
+		memcpy(&buffer->items[buffer->nfrozen], new,
+			   nnew * sizeof(ItemPointerData));
 
-		buffer->items = new;
-		buffer->nitems = nnew;
+		pfree(new);
+
+		buffer->nitems += tup->nitems;
 
 		AssertCheckItemPointers(buffer);
 	}
@@ -1432,11 +1540,29 @@ GinBufferReset(GinBuffer *buffer)
 	buffer->category = 0;
 	buffer->keylen = 0;
 	buffer->nitems = 0;
+	buffer->nfrozen = 0;
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
 }
 
+/*
+ * GinBufferTrim
+ *		Discard the "frozen" part of the TID list (which should have been
+ *		written to disk/index before this call).
+ */
+static void
+GinBufferTrim(GinBuffer *buffer)
+{
+	Assert((buffer->nfrozen > 0) && (buffer->nfrozen <= buffer->nitems));
+
+	memmove(&buffer->items[0], &buffer->items[buffer->nfrozen],
+			sizeof(ItemPointerData) * (buffer->nitems - buffer->nfrozen));
+
+	buffer->nitems -= buffer->nfrozen;
+	buffer->nfrozen = 0;
+}
+
 /*
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
@@ -1504,7 +1630,12 @@ _gin_parallel_merge(GinBuildState *state)
 	/* do the actual sort in the leader */
 	tuplesort_performsort(state->bs_sortstate);
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/*
@@ -1562,6 +1693,32 @@ _gin_parallel_merge(GinBuildState *state)
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
@@ -1655,7 +1812,13 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 	GinBuffer  *buffer;
 
-	/* initialize buffer to combine entries for the same key */
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The workers are limited to the same amount of memory as during the sort
+	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
+	 * during planning, just like there.
+	 */
 	buffer = GinBufferInit(state->ginstate.index);
 
 	/* sort the raw per-worker data */
@@ -1711,6 +1874,39 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 			GinBufferReset(buffer);
 		}
 
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			GinTuple   *ntup;
+			Size		ntuplen;
+
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+
+			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
+									buffer->key, buffer->typlen, buffer->typbyval,
+									buffer->items, buffer->nfrozen, &ntuplen);
+
+			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+
+			pfree(ntup);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
 		/*
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
-- 
2.48.1

v20250303-0003-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250303-0003-Use-a-single-GIN-tuplesort.patchDownload

From d1067eebd9846553b8318cea2972d1843fb46e05 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:12:37 +0100
Subject: [PATCH v20250303 3/4] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 411 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 240 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad880..e873443784a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -164,14 +164,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1144,8 +1134,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1163,7 +1159,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1174,8 +1170,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1201,7 +1196,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1225,7 +1220,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1288,15 +1283,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1312,37 +1310,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1389,6 +1421,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1412,32 +1494,29 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1515,20 +1594,54 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
  * GinBufferReset
  *		Reset the buffer into a state as if it contains no data.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1544,6 +1657,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1567,7 +1681,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1578,6 +1692,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1587,7 +1702,7 @@ GinBufferFree(GinBuffer *buffer)
  *
  * Returns true if the buffer is either empty or for the same index key.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1684,6 +1799,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1710,6 +1826,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1723,7 +1840,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 
 		/* Report progress */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
@@ -1734,6 +1854,7 @@ _gin_parallel_merge(GinBuildState *state)
 	if (!GinBufferIsEmpty(buffer))
 	{
 		AssertCheckItemPointers(buffer);
+		Assert(!PointerIsValid(buffer->cached));
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1789,158 +1910,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
-						 bool progress)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_PERFORMSORT_1);
-
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* reset the number of GIN tuples produced by this worker */
-	state->bs_numtuples = 0;
-
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_MERGE_1);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-			state->bs_numtuples++;
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-		state->bs_numtuples++;
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -2007,12 +1976,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  state->work_mem,
-													  NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2026,13 +1989,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort, progress);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2187,8 +2143,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2256,8 +2211,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..a106cc79efd 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -640,9 +664,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -683,6 +709,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -885,17 +912,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -903,7 +930,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1923,19 +1950,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1961,6 +2032,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 95d8805b66f..da4351c3d3d 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -478,6 +478,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index ce555031335..4de7b5c32b5 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -39,6 +39,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..522e98109ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3037,6 +3037,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250303-0004-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250303-0004-WIP-parallel-inserts-into-GIN-index.patchDownload

From e21754e663e5bebfe005dd95d8e61184d4e18b05 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:16:24 +0100
Subject: [PATCH v20250303 4/4] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 450 +++++++++++-------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 166 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e873443784a..750c0c3270d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -26,7 +26,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -173,7 +183,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -699,8 +714,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -989,6 +1008,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1056,6 +1081,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1069,6 +1099,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1713,169 +1745,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* GIN tuples from workers, merged by leader */
-	double		numtuples = 0;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* Execute the sort */
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-								 PROGRESS_GIN_PHASE_PERFORMSORT_2);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Set the progress target for the next phase.  Reset the block number
-	 * values set by table_index_build_scan
-	 */
-	{
-		const int	progress_index[] = {
-			PROGRESS_CREATEIDX_SUBPHASE,
-			PROGRESS_CREATEIDX_TUPLES_TOTAL,
-			PROGRESS_SCAN_BLOCKS_TOTAL,
-			PROGRESS_SCAN_BLOCKS_DONE
-		};
-		const int64 progress_vals[] = {
-			PROGRESS_GIN_PHASE_MERGE_2,
-			state->bs_numtuples,
-			0, 0
-		};
-
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
-	}
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-		Assert(!PointerIsValid(buffer->cached));
-
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2093,6 +1962,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2103,6 +1975,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2375,3 +2261,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#56

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tomas Vondra (#55)

2 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

regards

--
Tomas Vondra

Attachments:

v20250304-0001-Use-a-single-GIN-tuplesort.patchtext/x-patch; charset=UTF-8; name=v20250304-0001-Use-a-single-GIN-tuplesort.patchDownload

From bea52f76255830af45b7122b0fa5786997182cf5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:12:37 +0100
Subject: [PATCH v20250304 1/2] Use a single GIN tuplesort

The previous approach was to sort the data on a private sort, then read
it back, merge the GinTuples, and write it into the shared sort, to
later be used by the shared tuple sort.

The new approach is to use a single sort, merging tuples as we write
them to disk.  This reduces temporary disk space.

An optimization was added to GinBuffer in which we don't deserialize
tuples unless we need access to the itemIds.

This modifies TUplesort to have a new flushwrites callback. Sort's
writetup can now decide to buffer writes until the next flushwrites()
callback.
---
 src/backend/access/gin/gininsert.c         | 411 +++++++++------------
 src/backend/utils/sort/tuplesort.c         |   5 +
 src/backend/utils/sort/tuplesortvariants.c | 102 ++++-
 src/include/access/gin_private.h           |   3 +
 src/include/access/gin_tuple.h             |  10 +
 src/include/utils/tuplesort.h              |  10 +-
 src/tools/pgindent/typedefs.list           |   1 +
 7 files changed, 302 insertions(+), 240 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad880..e873443784a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -164,14 +164,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -195,8 +187,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +490,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -1144,8 +1134,14 @@ _gin_parallel_heapscan(GinBuildState *state)
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, and any
+	 * produced GinTuples.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1163,7 +1159,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1174,8 +1170,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1201,7 +1196,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
@@ -1225,7 +1220,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1288,15 +1283,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1312,37 +1310,71 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	if (tup->category != buffer->category)
-		return false;
+		if (tup->category != cached->category)
+			return false;
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
+
+	prev = MemoryContextSwitchTo(buffer->context);
 
-	r = ApplySortComparator(buffer->key, false,
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1389,6 +1421,56 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 	return true;
 }
 
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	ItemPointer items;
+	GinTuple   *cached;
+	int			totitems;
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	items = _gin_parse_tuple_items(cached);
+
+	if (buffer->items == NULL)
+	{
+		buffer->items = palloc0(totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+	memcpy(buffer->items, items, buffer->nitems * sizeof(ItemPointerData));
+	buffer->nitems = cached->nitems;
+
+	buffer->cached = NULL;
+	pfree(cached);
+	pfree(items);
+}
+
 /*
  * GinBufferStoreTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
@@ -1412,32 +1494,29 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	ItemPointerData *items;
-	Datum		key;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
 	if (GinBufferIsEmpty(buffer))
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
-
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
 	}
+	else if (buffer->cached != NULL)
+	{
+		GinBufferUnpackCached(buffer, tup->nitems);
+	}
+
+	items = _gin_parse_tuple_items(tup);
 
 	/*
 	 * Try freeze TIDs at the beginning of the list, i.e. exclude them from
@@ -1515,20 +1594,54 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 
 	/* free the decompressed TID list */
 	pfree(items);
+
+	MemoryContextSwitchTo(prev);
+}
+
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+	}
+
+	GinBufferReset(buffer);
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
  * GinBufferReset
  *		Reset the buffer into a state as if it contains no data.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+		pfree(buffer->cached);
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+			pfree(DatumGetPointer(buffer->key));
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1544,6 +1657,7 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+	/* Note that we don't reset the memory context, this is deliberate */
 }
 
 /*
@@ -1567,7 +1681,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1578,6 +1692,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1587,7 +1702,7 @@ GinBufferFree(GinBuffer *buffer)
  *
  * Returns true if the buffer is either empty or for the same index key.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1684,6 +1799,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1710,6 +1826,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1723,7 +1840,10 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 
 		/* Report progress */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
@@ -1734,6 +1854,7 @@ _gin_parallel_merge(GinBuildState *state)
 	if (!GinBufferIsEmpty(buffer))
 	{
 		AssertCheckItemPointers(buffer);
+		Assert(!PointerIsValid(buffer->cached));
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1789,158 +1910,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
-						 bool progress)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_PERFORMSORT_1);
-
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* reset the number of GIN tuples produced by this worker */
-	state->bs_numtuples = 0;
-
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_MERGE_1);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-			state->bs_numtuples++;
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-			Size		ntuplen;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-		Size		ntuplen;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
-		state->bs_numtuples++;
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -2007,12 +1976,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  state->work_mem,
-													  NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2026,13 +1989,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort, progress);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
@@ -2187,8 +2143,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2256,8 +2211,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..a106cc79efd 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -32,6 +32,7 @@
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
 #include "utils/tuplesort.h"
+#include "access/gin.h"
 
 
 /* sort-type codes for sort__start probes */
@@ -90,6 +91,7 @@ static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
 static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
 							   SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +103,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +138,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -209,6 +222,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +299,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +408,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +488,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +533,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +589,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -587,6 +606,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -611,6 +631,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -640,9 +664,11 @@ tuplesort_begin_index_gin(Relation heapRel,
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
 	base->writetup = writetup_index_gin;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -683,6 +709,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -885,17 +912,17 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
-	Size		tuplen;
+	Size		tuplen = tuple->tuplen;
 
 	/* copy the GinTuple into the right memory context */
-	ctup = palloc(size);
-	memcpy(ctup, tuple, size);
+	ctup = palloc(tuplen);
+	memcpy(ctup, tuple, tuplen);
 
 	stup.tuple = ctup;
 	stup.datum1 = (Datum) 0;
@@ -903,7 +930,7 @@ tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
 
 	/* GetMemoryChunkSpace is not supported for bump contexts */
 	if (TupleSortUseBumpTupleCxt(base->sortopt))
-		tuplen = MAXALIGN(size);
+		tuplen = MAXALIGN(tuplen);
 	else
 		tuplen = GetMemoryChunkSpace(ctup);
 
@@ -1923,19 +1950,63 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 }
 
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+_writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, GinTuple *tup)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
-	unsigned int tuplen = tuple->tuplen;
+	unsigned int tuplen = tup->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
+
 	LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
-	LogicalTapeWrite(tape, tuple, tuple->tuplen);
+	LogicalTapeWrite(tape, tup, tup->tuplen);
+
 	if (base->sortopt & TUPLESORT_RANDOMACCESS) /* need trailing length word? */
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+static void
+writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	_writetup_index_gin(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		_writetup_index_gin(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1961,6 +2032,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 95d8805b66f..da4351c3d3d 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -478,6 +478,9 @@ extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TI
 
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
+extern bool ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer into, int capacity,
+												int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
 extern ItemPointer ginMergeItemPointers(ItemPointerData *a, uint32 na,
 										ItemPointerData *b, uint32 nb,
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index ce555031335..4de7b5c32b5 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -39,6 +39,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..64176b23cbe 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,14 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient use of
+	 * the tape's resources, e.g. when deduplicating or merging values.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
@@ -461,7 +469,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, struct GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..522e98109ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3037,6 +3037,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.48.1

v20250304-0002-WIP-parallel-inserts-into-GIN-index.patchtext/x-patch; charset=UTF-8; name=v20250304-0002-WIP-parallel-inserts-into-GIN-index.patchDownload

From fb3d0d4f319a8e548a792a4ebbd47142d94f96af Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Tue, 25 Feb 2025 16:16:24 +0100
Subject: [PATCH v20250304 2/2] WIP: parallel inserts into GIN index

---
 src/backend/access/gin/gininsert.c            | 450 +++++++++++-------
 .../utils/activity/wait_event_names.txt       |   2 +
 2 files changed, 286 insertions(+), 166 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e873443784a..750c0c3270d 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -26,7 +26,9 @@
 #include "miscadmin.h"
 #include "nodes/execnodes.h"
 #include "pgstat.h"
+#include "storage/barrier.h"
 #include "storage/bufmgr.h"
+#include "storage/buffile.h"
 #include "storage/predicate.h"
 #include "tcop/tcopprot.h"		/* pgrminclude ignore */
 #include "utils/datum.h"
@@ -42,6 +44,11 @@
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xB000000000000004)
 #define PARALLEL_KEY_BUFFER_USAGE		UINT64CONST(0xB000000000000005)
 
+/* The phases for parallel builds, used by build_barrier. */
+#define GIN_BUILD_INIT					0
+#define GIN_BUILD_SCAN					1
+#define GIN_BUILD_PARTITION				2
+
 /*
  * Status for index builds performed in parallel.  This is allocated in a
  * dynamic shared memory segment.
@@ -88,6 +95,9 @@ typedef struct GinBuildShared
 	double		reltuples;
 	double		indtuples;
 
+	Barrier		build_barrier;
+	SharedFileSet fileset;		/* space for shared temporary files */
+
 	/*
 	 * ParallelTableScanDescData data follows. Can't directly embed here, as
 	 * implementations of the parallel table scan desc interface might need
@@ -173,7 +183,6 @@ static void _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relati
 static void _gin_end_parallel(GinLeader *ginleader, GinBuildState *state);
 static Size _gin_parallel_estimate_shared(Relation heap, Snapshot snapshot);
 static double _gin_parallel_heapscan(GinBuildState *buildstate);
-static double _gin_parallel_merge(GinBuildState *buildstate);
 static void _gin_leader_participate_as_worker(GinBuildState *buildstate,
 											  Relation heap, Relation index);
 static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
@@ -189,6 +198,12 @@ static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
 								  ItemPointerData *items, uint32 nitems);
 
+static double _gin_partition_sorted_data(GinBuildState *state);
+static void _gin_parallel_insert(GinBuildState *state,
+								 GinBuildShared *gistshared,
+								 Relation heap, Relation index,
+								 bool progress);
+
 /*
  * Adds array of item pointers to tuple's posting list, or
  * creates posting tree and tuple pointing to tree in case
@@ -699,8 +714,12 @@ ginbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 									  maintenance_work_mem, coordinate,
 									  TUPLESORT_NONE);
 
-		/* scan the relation in parallel and merge per-worker results */
-		reltuples = _gin_parallel_merge(state);
+		/* partition the sorted data */
+		reltuples = _gin_partition_sorted_data(state);
+
+		/* do the insert for the leader's partition */
+		_gin_parallel_insert(state, state->bs_leader->ginshared,
+							 heap, index, true);
 
 		_gin_end_parallel(state->bs_leader, state);
 	}
@@ -989,6 +1008,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	ginshared->reltuples = 0.0;
 	ginshared->indtuples = 0.0;
 
+	/* used to wait for data to insert */
+	BarrierInit(&ginshared->build_barrier, scantuplesortstates);
+
+	/* Set up the space we'll use for shared temporary files. */
+	SharedFileSetInit(&ginshared->fileset, pcxt->seg);
+
 	table_parallelscan_initialize(heap,
 								  ParallelTableScanFromGinBuildShared(ginshared),
 								  snapshot);
@@ -1056,6 +1081,11 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * sure that the failure-to-start case will not hang forever.
 	 */
 	WaitForParallelWorkersToAttach(pcxt);
+
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned, leader continues");
 }
 
 /*
@@ -1069,6 +1099,8 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
 	/* Shutdown worker processes */
 	WaitForParallelWorkersToFinish(ginleader->pcxt);
 
+	SharedFileSetDeleteAll(&ginleader->ginshared->fileset);
+
 	/*
 	 * Next, accumulate WAL usage.  (This must wait for the workers to finish,
 	 * or we might get incomplete data.)
@@ -1713,169 +1745,6 @@ GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 	return GinBufferKeyEquals(buffer, tup);
 }
 
-/*
- * Within leader, wait for end of heap scan and merge per-worker results.
- *
- * After waiting for all workers to finish, merge the per-worker results into
- * the complete index. The results from each worker are sorted by block number
- * (start of the page range). While combinig the per-worker results we merge
- * summaries for the same page range, and also fill-in empty summaries for
- * ranges without any tuples.
- *
- * Returns the total number of heap tuples scanned.
- */
-static double
-_gin_parallel_merge(GinBuildState *state)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-	double		reltuples = 0;
-	GinBuffer  *buffer;
-
-	/* GIN tuples from workers, merged by leader */
-	double		numtuples = 0;
-
-	/* wait for workers to scan table and produce partial results */
-	reltuples = _gin_parallel_heapscan(state);
-
-	/* Execute the sort */
-	pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-								 PROGRESS_GIN_PHASE_PERFORMSORT_2);
-
-	/* do the actual sort in the leader */
-	tuplesort_performsort(state->bs_sortstate);
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The leader is allowed to use the whole maintenance_work_mem buffer to
-	 * combine data. The parallel workers already completed.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/*
-	 * Set the progress target for the next phase.  Reset the block number
-	 * values set by table_index_build_scan
-	 */
-	{
-		const int	progress_index[] = {
-			PROGRESS_CREATEIDX_SUBPHASE,
-			PROGRESS_CREATEIDX_TUPLES_TOTAL,
-			PROGRESS_SCAN_BLOCKS_TOTAL,
-			PROGRESS_SCAN_BLOCKS_DONE
-		};
-		const int64 progress_vals[] = {
-			PROGRESS_GIN_PHASE_MERGE_2,
-			state->bs_numtuples,
-			0, 0
-		};
-
-		pgstat_progress_update_multi_param(4, progress_index, progress_vals);
-	}
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by category and
-	 * key. That probably gives us order matching how data is organized in the
-	 * index.
-	 *
-	 * We don't insert the GIN tuples right away, but instead accumulate as
-	 * many TIDs for the same key as possible, and then insert that at once.
-	 * This way we don't need to decompress/recompress the posting lists, etc.
-	 */
-	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
-	{
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nitems, &state->buildStats);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-			Assert(!PointerIsValid(buffer->cached));
-
-			ginEntryInsert(&state->ginstate,
-						   buffer->attnum, buffer->key, buffer->category,
-						   buffer->items, buffer->nfrozen, &state->buildStats);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferMergeTuple(buffer, tup);
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, 0);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		AssertCheckItemPointers(buffer);
-		Assert(!PointerIsValid(buffer->cached));
-
-		ginEntryInsert(&state->ginstate,
-					   buffer->attnum, buffer->key, buffer->category,
-					   buffer->items, buffer->nitems, &state->buildStats);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-
-		/* Report progress */
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-									 ++numtuples);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(state->bs_sortstate);
-
-	return reltuples;
-}
-
 /*
  * Returns size of shared memory required to store state for a parallel
  * gin index build based on the snapshot its parallel scan will use.
@@ -2093,6 +1962,9 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Prepare to track buffer usage during parallel execution */
 	InstrStartParallelQuery();
 
+	/* attach to the fileset too */
+	SharedFileSetAttach(&ginshared->fileset, seg);
+
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
 	 * (when requested number of workers were not launched, this will be
@@ -2103,6 +1975,20 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	_gin_parallel_scan_and_build(&buildstate, ginshared, sharedsort,
 								 heapRel, indexRel, sortmem, false);
 
+	/* wait for workers to read the data and add them to tuplesort */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_SCAN))
+		elog(LOG, "data scanned by workers, leader continues");
+
+	/* leader sorts and partitions the data */
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&ginshared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned by leader, worker continues");
+
+	_gin_parallel_insert(&buildstate, ginshared, heapRel, indexRel, false);
+
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
@@ -2375,3 +2261,235 @@ _gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup)
 	return ItemPointerCompare(GinTupleGetFirst(a),
 							  GinTupleGetFirst(b));
 }
+
+static double
+_gin_partition_sorted_data(GinBuildState *state)
+{
+	GinTuple   *tup;
+	Size		tuplen;
+	GinBuildShared *shared = state->bs_leader->ginshared;
+	BufFile   **files;
+	int64		fileidx = 0;
+	double		reltuples;
+
+	/* how many tuples per worker */
+	int64		worker_tuples = (state->indtuples / shared->scantuplesortstates) + 1;
+	int64		remaining = Min(worker_tuples, 1000);
+	int64		ntmp = 0;
+
+	/* wait for workers to scan table and produce partial results */
+	reltuples = _gin_parallel_heapscan(state);
+
+	/* do the actual sort in the leader */
+	tuplesort_performsort(state->bs_sortstate);
+
+	/* Allocate BufFiles, one for each participants. */
+	files = palloc0_array(BufFile *, shared->scantuplesortstates);
+
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		char		fname[MAXPGPATH];
+
+		sprintf(fname, "worker-%d", i);
+
+		files[i] = BufFileCreateFileSet(&shared->fileset.fs, fname);
+	}
+
+	/*
+	 * Read the GIN tuples from the shared tuplesort, sorted by category and
+	 * key. That probably gives us order matching how data is organized in the
+	 * index.
+	 *
+	 * We don't insert the GIN tuples right away, but instead accumulate as
+	 * many TIDs for the same key as possible, and then insert that at once.
+	 * This way we don't need to decompress/recompress the posting lists, etc.
+	 *
+	 * XXX Maybe we should sort by key first, then by category? The idea is
+	 * that if this matches the order of the keys in the index, we'd insert
+	 * the entries in order better matching the index.
+	 */
+	while ((tup = tuplesort_getgintuple(state->bs_sortstate, &tuplen, true)) != NULL)
+	{
+		ntmp++;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * FIXME Maybe move to next partition only when the index key changes?
+		 * Otherwise we might have issues with 'could not fit onto page' when
+		 * adding overlapping TID lists to the index. But maybe it can't with
+		 * the merging of data in the tuplesort?
+		 */
+
+		BufFileWrite(files[fileidx], &tuplen, sizeof(tuplen));
+		BufFileWrite(files[fileidx], tup, tuplen);
+
+		remaining--;
+
+		/* move to the next file */
+		if (remaining == 0)
+		{
+			remaining = Min(worker_tuples, 1000);
+			fileidx++;
+			fileidx = fileidx % shared->scantuplesortstates;
+		}
+	}
+
+	/* close the files */
+	for (int i = 0; i < shared->scantuplesortstates; i++)
+	{
+		BufFileClose(files[i]);
+	}
+
+	/* and also close the tuplesort */
+	tuplesort_end(state->bs_sortstate);
+
+	/* wait for the leader to partition the data */
+	if (BarrierArriveAndWait(&shared->build_barrier,
+							 WAIT_EVENT_GIN_BUILD_PARTITION))
+		elog(LOG, "data partitioned, leader continues");
+
+	return reltuples;
+}
+
+static void
+_gin_parallel_insert(GinBuildState *state, GinBuildShared *ginshared,
+					 Relation heap, Relation index, bool progress)
+{
+	GinBuffer  *buffer;
+	GinTuple   *tup;
+	Size		len;
+
+	BufFile    *file;
+	char		fname[MAXPGPATH];
+	char	   *buff;
+	int64		ntuples = 0;
+	Size		maxlen;
+
+	/*
+	 * Initialize buffer to combine entries for the same key.
+	 *
+	 * The leader is allowed to use the whole maintenance_work_mem buffer to
+	 * combine data. The parallel workers already completed.
+	 */
+	buffer = GinBufferInit(state->ginstate.index);
+
+
+	sprintf(fname, "worker-%d", ParallelWorkerNumber + 1);
+	file = BufFileOpenFileSet(&ginshared->fileset.fs, fname, O_RDONLY, false);
+
+	/* 8kB seems like a reasonable starting point */
+	maxlen = 8192;
+	buff = palloc(maxlen);
+
+	while (true)
+	{
+		size_t		ret;
+
+		ret = BufFileRead(file, &len, sizeof(len));
+
+		if (ret == 0)
+			break;
+		if (ret != sizeof(len))
+			elog(ERROR, "incorrect data %zu %zu", ret, sizeof(len));
+
+		/* maybe resize the buffer */
+		if (maxlen < len)
+		{
+			while (maxlen < len)
+				maxlen *= 2;
+
+			buff = repalloc(buff, maxlen);
+		}
+
+		tup = (GinTuple *) buff;
+
+
+		BufFileReadExact(file, tup, len);
+
+		ntuples++;
+
+		if (ntuples % 100000 == 0)
+			elog(LOG, "inserted " INT64_FORMAT " tuples", ntuples);
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * If the buffer can accept the new GIN tuple, just store it there and
+		 * we're done. If it's a different key (or maybe too much data) flush
+		 * the current contents into the index first.
+		 */
+		if (!GinBufferCanAddKey(buffer, tup))
+		{
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nitems, &state->buildStats);
+
+			/* discard the existing data */
+			GinBufferReset(buffer);
+		}
+
+		/*
+		 * We're about to add a GIN tuple to the buffer - check the memory
+		 * limit first, and maybe write out some of the data into the index
+		 * first, if needed (and possible). We only flush the part of the TID
+		 * list that we know won't change, and only if there's enough data for
+		 * compression to work well.
+		 */
+		if (GinBufferShouldTrim(buffer, tup))
+		{
+			Assert(buffer->nfrozen > 0);
+
+			/*
+			 * Buffer is not empty and it's storing a different key - flush
+			 * the data into the insert, and start a new entry for current
+			 * GinTuple.
+			 */
+			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
+
+			ginEntryInsert(&state->ginstate,
+						   buffer->attnum, buffer->key, buffer->category,
+						   buffer->items, buffer->nfrozen, &state->buildStats);
+
+			/* truncate the data we've just discarded */
+			GinBufferTrim(buffer);
+		}
+
+		/*
+		 * Remember data for the current tuple (either remember the new key,
+		 * or append if to the existing data).
+		 */
+		GinBufferMergeTuple(buffer, tup);
+
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
+	}
+
+	/* flush data remaining in the buffer (for the last key) */
+	if (!GinBufferIsEmpty(buffer))
+	{
+		AssertCheckItemPointers(buffer);
+
+		Assert(!PointerIsValid(buffer->cached));
+		ginEntryInsert(&state->ginstate,
+					   buffer->attnum, buffer->key, buffer->category,
+					   buffer->items, buffer->nitems, &state->buildStats);
+
+		/* discard the existing data */
+		GinBufferReset(buffer);
+	}
+
+	/* relase all the memory */
+	GinBufferFree(buffer);
+
+	BufFileClose(file);
+}
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..afb9be848a0 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -116,6 +116,8 @@ CHECKPOINT_DELAY_START	"Waiting for a backend that blocks a checkpoint from star
 CHECKPOINT_DONE	"Waiting for a checkpoint to complete."
 CHECKPOINT_START	"Waiting for a checkpoint to start."
 EXECUTE_GATHER	"Waiting for activity from a child process while executing a <literal>Gather</literal> plan node."
+GIN_BUILD_SCAN	"Wait for scan of data during parallel GIN index build."
+GIN_BUILD_PARTITION	"Wait for partition of data during parallel GIN index build."
 HASH_BATCH_ALLOCATE	"Waiting for an elected Parallel Hash participant to allocate a hash table."
 HASH_BATCH_ELECT	"Waiting to elect a Parallel Hash participant to allocate a hash table."
 HASH_BATCH_LOAD	"Waiting for other Parallel Hash participants to finish loading a hash table."
-- 
2.48.1

#57

Matthias van de Meent

boekewurm+postgres@gmail.com

10 months ago

In reply to: Tomas Vondra (#56)

4 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

On Tue, 4 Mar 2025 at 20:50, Tomas Vondra <tomas@vondra.me> wrote:

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

As promised on- and off-list, here's the 0001 patch, polished, split,
and further adapted for performance.

As seen before, it reduces tempspace requirements by up to 50%. I've
not tested this against HEAD for performance.

It has been split into:

0001: Some API cleanup/changes that creaped into the patch. This
removes manual length-passing from the gin tuplesort APIs, instead
relying on GinTuple's tuplen field. It's not critical for anything,
and could be ignored if so desired.

0002: Tuplesort changes to allow TupleSort users to buffer and merge
tuples during the sort operations.
The patch was pulled directly from [0]/messages/by-id/CAEze2WhRFzd=nvh9YevwiLjrS1j1fP85vjNCXAab=iybZ2rNKw@mail.gmail.com (which was derived from earlier
work in this thread), is fairly easy to understand, and has no other
moving parts.

0003: Deduplication in tuplesort's flush-to-disk actions, utilizing
API introduced with 0002.
This improves temporary disk usage by deduplicating data even further,
for when there's a lot of duplicated data but the data has enough
distinct values to not fit in the available memory.

0004: Use a single tuplesort. This removes the worker-local tuplesort
in favor of only storing data in the global one.

This mainly reduces the code size and complexity of parallel GIN
builds; we already were using that global sort for various tasks.

Open questions and open items for this:
- I did not yet update the pg_stat_progress systems, nor docs.
- Maybe 0003 needs further splitting up, one for the optimizations in
GinBuffer, one for the tuplesort buffering.
- Maybe we need to trim the buffer in gin's tuplesort flush?
- Maybe we should grow the GinBuffer->items array superlinearly rather
than to the exact size requirement of the merge operation.

Apart from the complexities in 0003, I think the changes are fairly
straightforward.

I did not include the 0002 of the earlier patch, as it was WIP and its
feature explicitly conflicts with my 0004.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

[0]: /messages/by-id/CAEze2WhRFzd=nvh9YevwiLjrS1j1fP85vjNCXAab=iybZ2rNKw@mail.gmail.com

Attachments:

v20250307-0004-Make-Gin-parallel-builds-use-a-single-tupl.patchapplication/octet-stream; name=v20250307-0004-Make-Gin-parallel-builds-use-a-single-tupl.patchDownload

From e56feda19b2c908e85750e778646717c3d82292b Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 6 Mar 2025 02:41:58 +0100
Subject: [PATCH v20250307 4/4] Make Gin parallel builds use a single tuplesort

This reduces the size requirement of scratch space and reduces the
cycles we have to spend on passing data around.

As another benefit, it reduces the code size and complexity of GIN
builds.
---
 src/backend/access/gin/gininsert.c | 175 +----------------------------
 1 file changed, 1 insertion(+), 174 deletions(-)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 80cabae99b1..f3a7f375fb2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -164,14 +164,6 @@ typedef struct
 	 * build callback etc.
 	 */
 	Tuplesortstate *bs_sortstate;
-
-	/*
-	 * The sortstate used only within a single worker for the first merge pass
-	 * happenning there. In principle it doesn't need to be part of the build
-	 * state and we could pass it around directly, but it's more convenient
-	 * this way. And it's part of the build state, after all.
-	 */
-	Tuplesortstate *bs_worker_sort;
 } GinBuildState;
 
 
@@ -508,7 +500,7 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 							   key, attr->attlen, attr->attbyval,
 							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup);
+		tuplesort_putgintuple(buildstate->bs_sortstate, tup);
 
 		pfree(tup);
 	}
@@ -2034,158 +2026,6 @@ _gin_leader_participate_as_worker(GinBuildState *buildstate, Relation heap, Rela
 								 sortmem, true);
 }
 
-/*
- * _gin_process_worker_data
- *		First phase of the key merging, happening in the worker.
- *
- * Depending on the number of distinct keys, the TID lists produced by the
- * callback may be very short (due to frequent evictions in the callback).
- * But combining many tiny lists is expensive, so we try to do as much as
- * possible in the workers and only then pass the results to the leader.
- *
- * We read the tuples sorted by the key, and merge them into larger lists.
- * At the moment there's no memory limit, so this will just produce one
- * huge (sorted) list per key in each worker. Which means the leader will
- * do a very limited number of mergesorts, which is good.
- */
-static void
-_gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
-						 bool progress)
-{
-	GinTuple   *tup;
-	Size		tuplen;
-
-	GinBuffer  *buffer;
-
-	/*
-	 * Initialize buffer to combine entries for the same key.
-	 *
-	 * The workers are limited to the same amount of memory as during the sort
-	 * in ginBuildCallbackParallel. But this probably should be the 32MB used
-	 * during planning, just like there.
-	 */
-	buffer = GinBufferInit(state->ginstate.index);
-
-	/* sort the raw per-worker data */
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_PERFORMSORT_1);
-
-	tuplesort_performsort(state->bs_worker_sort);
-
-	/* reset the number of GIN tuples produced by this worker */
-	state->bs_numtuples = 0;
-
-	if (progress)
-		pgstat_progress_update_param(PROGRESS_CREATEIDX_SUBPHASE,
-									 PROGRESS_GIN_PHASE_MERGE_1);
-
-	/*
-	 * Read the GIN tuples from the shared tuplesort, sorted by the key, and
-	 * merge them into larger chunks for the leader to combine.
-	 */
-	while ((tup = tuplesort_getgintuple(worker_sort, &tuplen, true)) != NULL)
-	{
-
-		CHECK_FOR_INTERRUPTS();
-
-		/*
-		 * If the buffer can accept the new GIN tuple, just store it there and
-		 * we're done. If it's a different key (or maybe too much data) flush
-		 * the current contents into the index first.
-		 */
-		if (!GinBufferCanAddKey(buffer, tup))
-		{
-			GinTuple   *ntup;
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup);
-			state->bs_numtuples++;
-
-			pfree(ntup);
-
-			/* discard the existing data */
-			GinBufferReset(buffer);
-		}
-
-		if (buffer->cached)
-			GinBufferUnpackCached(buffer, tup->nitems);
-
-		/*
-		 * We're about to add a GIN tuple to the buffer - check the memory
-		 * limit first, and maybe write out some of the data into the index
-		 * first, if needed (and possible). We only flush the part of the TID
-		 * list that we know won't change, and only if there's enough data for
-		 * compression to work well.
-		 */
-		if (GinBufferShouldTrim(buffer, tup))
-		{
-			GinTuple   *ntup;
-
-			Assert(buffer->nfrozen > 0);
-
-			/*
-			 * Buffer is not empty and it's storing a different key - flush
-			 * the data into the insert, and start a new entry for current
-			 * GinTuple.
-			 */
-			AssertCheckItemPointers(buffer);
-
-			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen);
-
-			tuplesort_putgintuple(state->bs_sortstate, ntup);
-
-			pfree(ntup);
-
-			/* truncate the data we've just discarded */
-			GinBufferTrim(buffer);
-		}
-
-		/*
-		 * Remember data for the current tuple (either remember the new key,
-		 * or append if to the existing data).
-		 */
-		GinBufferStoreOrMergeTuple(buffer, tup);
-	}
-
-	/* flush data remaining in the buffer (for the last key) */
-	if (!GinBufferIsEmpty(buffer))
-	{
-		GinTuple   *ntup;
-
-		AssertCheckItemPointers(buffer);
-
-		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
-								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems);
-
-		tuplesort_putgintuple(state->bs_sortstate, ntup);
-		state->bs_numtuples++;
-
-		pfree(ntup);
-
-		/* discard the existing data */
-		GinBufferReset(buffer);
-	}
-
-	/* relase all the memory */
-	GinBufferFree(buffer);
-
-	tuplesort_end(worker_sort);
-}
-
 /*
  * Perform a worker's portion of a parallel GIN index build sort.
  *
@@ -2252,12 +2092,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 													coordinate,
 													TUPLESORT_NONE);
 
-	/* Local per-worker sort of raw-data */
-	state->bs_worker_sort = tuplesort_begin_index_gin(heap, index,
-													  state->work_mem,
-													  NULL,
-													  TUPLESORT_NONE);
-
 	/* Join parallel scan */
 	indexInfo = BuildIndexInfo(index);
 	indexInfo->ii_Concurrent = ginshared->isconcurrent;
@@ -2271,13 +2105,6 @@ _gin_parallel_scan_and_build(GinBuildState *state,
 	/* write remaining accumulated entries */
 	ginFlushBuildState(state, index);
 
-	/*
-	 * Do the first phase of in-worker processing - sort the data produced by
-	 * the callback, and combine them into much larger chunks and place that
-	 * into the shared tuplestore for leader to process.
-	 */
-	_gin_process_worker_data(state, state->bs_worker_sort, progress);
-
 	/* sort the GIN tuples built by this worker */
 	tuplesort_performsort(state->bs_sortstate);
 
-- 
2.45.2

v20250307-0002-Allow-tuplesort-implementations-to-buffer-.patchapplication/octet-stream; name=v20250307-0002-Allow-tuplesort-implementations-to-buffer-.patchDownload

From 1a9797f82982aa6db9fe2001cd8ad4ff8ebd7c0c Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 28 Aug 2024 15:28:37 +0200
Subject: [PATCH v20250307 2/4] Allow tuplesort implementations to buffer
 writes

Before, all writes to the sort tapes would have to be completed during
the call to writetup().  That's sufficient when the user of tuplesort
isn't interested in merging sorted tuples, but btree (and in the future,
GIN) sorts tuples to later merge them during insertion into the index.
If it'd merge the tuples before writing them to disk, we can save
significant disk space and IO.

As such, we allow WRITETUP to do whatever it wants when we're filling a
tape with tuples, and call FLUSHWRITES() at the end to mark the end of
that tape so that the tuplesort can flush any remaining buffers to disk.

By design, this does _not_ allow deduplication while the dataset is still
in memory. Writing data to disk is inherently expensive, so we're likely
to win time by spending some additional cycles on buffering the data in
the hopes of not writing as much data. However, in memory the additional
cycles may cause too much of an overhead to be useful.

Note that any implementation of tuple merging using the buffering
strategy that is enabled by this commit must also make sure that the
merged tuples are definitely not larger than the sum of the sizes of the
merged tuples.
---
 src/include/utils/tuplesort.h              | 9 +++++++++
 src/backend/utils/sort/tuplesort.c         | 5 +++++
 src/backend/utils/sort/tuplesortvariants.c | 7 +++++++
 3 files changed, 21 insertions(+)

diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index be1bd8a1862..a89299296bb 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -195,6 +195,15 @@ typedef struct
 	void		(*writetup) (Tuplesortstate *state, LogicalTape *tape,
 							 SortTuple *stup);
 
+	/*
+	 * Flush any buffered writetup() writes.
+	 *
+	 * This is useful when writetup() buffers writes for more efficient
+	 * use of the tape's resources, e.g. when deduplicating or merging
+	 * sort tuples.
+	 */
+	void		(*flushwrites) (Tuplesortstate *state, LogicalTape *tape);
+
 	/*
 	 * Function to read a stored tuple from tape back into memory. 'len' is
 	 * the already-read length of the stored tuple.  The tuple is allocated
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2ef32d53a43..7f346325678 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -395,6 +395,7 @@ struct Sharedsort
 #define REMOVEABBREV(state,stup,count)	((*(state)->base.removeabbrev) (state, stup, count))
 #define COMPARETUP(state,a,b)	((*(state)->base.comparetup) (a, b, state))
 #define WRITETUP(state,tape,stup)	((*(state)->base.writetup) (state, tape, stup))
+#define FLUSHWRITES(state,tape)	((state)->base.flushwrites ? (*(state)->base.flushwrites) (state, tape) : (void) 0)
 #define READTUP(state,stup,tape,len) ((*(state)->base.readtup) (state, stup, tape, len))
 #define FREESTATE(state)	((state)->base.freestate ? (*(state)->base.freestate) (state) : (void) 0)
 #define LACKMEM(state)		((state)->availMem < 0 && !(state)->slabAllocatorUsed)
@@ -2244,6 +2245,8 @@ mergeonerun(Tuplesortstate *state)
 		}
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	/*
 	 * When the heap empties, we're done.  Write an end-of-run marker on the
 	 * output tape.
@@ -2369,6 +2372,8 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->destTape, stup);
 	}
 
+	FLUSHWRITES(state, state->destTape);
+
 	state->memtupcount = 0;
 
 	/*
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 0b83b8b25b3..79bd29aa90e 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -209,6 +209,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 	base->comparetup = comparetup_heap;
 	base->comparetup_tiebreak = comparetup_heap_tiebreak;
 	base->writetup = writetup_heap;
+	base->flushwrites = NULL;
 	base->readtup = readtup_heap;
 	base->haveDatum1 = true;
 	base->arg = tupDesc;		/* assume we need not copy tupDesc */
@@ -285,6 +286,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 	base->comparetup = comparetup_cluster;
 	base->comparetup_tiebreak = comparetup_cluster_tiebreak;
 	base->writetup = writetup_cluster;
+	base->flushwrites = NULL;
 	base->readtup = readtup_cluster;
 	base->freestate = freestate_cluster;
 	base->arg = arg;
@@ -393,6 +395,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -472,6 +475,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 	base->comparetup = comparetup_index_hash;
 	base->comparetup_tiebreak = comparetup_index_hash_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -516,6 +520,7 @@ tuplesort_begin_index_gist(Relation heapRel,
 	base->comparetup = comparetup_index_btree;
 	base->comparetup_tiebreak = comparetup_index_btree_tiebreak;
 	base->writetup = writetup_index;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index;
 	base->haveDatum1 = true;
 	base->arg = arg;
@@ -571,6 +576,7 @@ tuplesort_begin_index_brin(int workMem,
 	base->removeabbrev = removeabbrev_index_brin;
 	base->comparetup = comparetup_index_brin;
 	base->writetup = writetup_index_brin;
+	base->flushwrites = NULL;
 	base->readtup = readtup_index_brin;
 	base->haveDatum1 = true;
 	base->arg = NULL;
@@ -683,6 +689,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	base->comparetup = comparetup_datum;
 	base->comparetup_tiebreak = comparetup_datum_tiebreak;
 	base->writetup = writetup_datum;
+	base->flushwrites = NULL;
 	base->readtup = readtup_datum;
 	base->haveDatum1 = true;
 	base->arg = arg;
-- 
2.45.2

v20250307-0001-Remove-size-argument-from-GIN-tuplesort-in.patchapplication/octet-stream; name=v20250307-0001-Remove-size-argument-from-GIN-tuplesort-in.patchDownload

From 99b023afc89108aa5a4b795014d92014a7689caf Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Wed, 5 Mar 2025 19:37:21 +0100
Subject: [PATCH v20250307 1/4] Remove size argument from GIN tuplesort infra

The length is already implied by GinTuple->tuplen, so there's
no need to pass it around separately.
---
 src/include/utils/tuplesort.h              |  2 +-
 src/backend/access/gin/gininsert.c         | 28 ++++++++--------------
 src/backend/utils/sort/tuplesortvariants.c |  3 ++-
 3 files changed, 13 insertions(+), 20 deletions(-)

diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index ef79f259f93..be1bd8a1862 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -461,7 +461,7 @@ extern void tuplesort_putindextuplevalues(Tuplesortstate *state,
 										  Relation rel, ItemPointer self,
 										  const Datum *values, const bool *isnull);
 extern void tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size);
-extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size);
+extern void tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple);
 extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 							   bool isNull);
 
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index b2f89cad880..f5782ea95df 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -195,8 +195,7 @@ static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 								  Datum key, int16 typlen, bool typbyval,
-								  ItemPointerData *items, uint32 nitems,
-								  Size *len);
+								  ItemPointerData *items, uint32 nitems);
 
 /*
  * Adds array of item pointers to tuple's posting list, or
@@ -499,16 +498,15 @@ ginFlushBuildState(GinBuildState *buildstate, Relation index)
 
 		/* GIN tuple and tuple length */
 		GinTuple   *tup;
-		Size		tuplen;
 
 		/* there could be many entries, so be willing to abort here */
 		CHECK_FOR_INTERRUPTS();
 
 		tup = _gin_build_tuple(attnum, category,
 							   key, attr->attlen, attr->attbyval,
-							   list, nlist, &tuplen);
+							   list, nlist);
 
-		tuplesort_putgintuple(buildstate->bs_worker_sort, tup, tuplen);
+		tuplesort_putgintuple(buildstate->bs_worker_sort, tup);
 
 		pfree(tup);
 	}
@@ -1852,7 +1850,6 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 		if (!GinBufferCanAddKey(buffer, tup))
 		{
 			GinTuple   *ntup;
-			Size		ntuplen;
 
 			/*
 			 * Buffer is not empty and it's storing a different key - flush
@@ -1863,9 +1860,9 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nitems, &ntuplen);
+									buffer->items, buffer->nitems);
 
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+			tuplesort_putgintuple(state->bs_sortstate, ntup);
 			state->bs_numtuples++;
 
 			pfree(ntup);
@@ -1884,7 +1881,6 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 		if (GinBufferShouldTrim(buffer, tup))
 		{
 			GinTuple   *ntup;
-			Size		ntuplen;
 
 			Assert(buffer->nfrozen > 0);
 
@@ -1897,9 +1893,9 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 
 			ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 									buffer->key, buffer->typlen, buffer->typbyval,
-									buffer->items, buffer->nfrozen, &ntuplen);
+									buffer->items, buffer->nfrozen);
 
-			tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+			tuplesort_putgintuple(state->bs_sortstate, ntup);
 
 			pfree(ntup);
 
@@ -1918,15 +1914,14 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 	if (!GinBufferIsEmpty(buffer))
 	{
 		GinTuple   *ntup;
-		Size		ntuplen;
 
 		AssertCheckItemPointers(buffer);
 
 		ntup = _gin_build_tuple(buffer->attnum, buffer->category,
 								buffer->key, buffer->typlen, buffer->typbyval,
-								buffer->items, buffer->nitems, &ntuplen);
+								buffer->items, buffer->nitems);
 
-		tuplesort_putgintuple(state->bs_sortstate, ntup, ntuplen);
+		tuplesort_putgintuple(state->bs_sortstate, ntup);
 		state->bs_numtuples++;
 
 		pfree(ntup);
@@ -2187,8 +2182,7 @@ typedef struct
 static GinTuple *
 _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 				 Datum key, int16 typlen, bool typbyval,
-				 ItemPointerData *items, uint32 nitems,
-				 Size *len)
+				 ItemPointerData *items, uint32 nitems)
 {
 	GinTuple   *tuple;
 	char	   *ptr;
@@ -2256,8 +2250,6 @@ _gin_build_tuple(OffsetNumber attrnum, unsigned char category,
 	 */
 	tuplen = SHORTALIGN(offsetof(GinTuple, data) + keylen) + compresslen;
 
-	*len = tuplen;
-
 	/*
 	 * Allocate space for the whole GIN tuple.
 	 *
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index eb8601e2257..0b83b8b25b3 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -885,12 +885,13 @@ tuplesort_putbrintuple(Tuplesortstate *state, BrinTuple *tuple, Size size)
 }
 
 void
-tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple, Size size)
+tuplesort_putgintuple(Tuplesortstate *state, GinTuple *tuple)
 {
 	SortTuple	stup;
 	GinTuple   *ctup;
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
 	MemoryContext oldcontext = MemoryContextSwitchTo(base->tuplecontext);
+	Size		size = tuple->tuplen;
 	Size		tuplen;
 
 	/* copy the GinTuple into the right memory context */
-- 
2.45.2

v20250307-0003-Merge-GinTuples-during-tuplesort-before-fl.patchapplication/octet-stream; name=v20250307-0003-Merge-GinTuples-during-tuplesort-before-fl.patchDownload

From 2c41e560658e85fb507fdb20e492b392b8515ea5 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 6 Mar 2025 02:25:18 +0100
Subject: [PATCH v20250307 3/4] Merge GinTuples during tuplesort before
 flushing to disk

This reduces on-disk size of the data, and increases the behaviour
of our tuplesort when we have to flush to disk relatively frequently.

In passing we also improve GinBuffer's merging of GinTuples to
use direct-to-buffer decoding of posting lists. Previously this
decoding would always allocate a new temporary array, but with
these changes we don't have to re-allocate this data and move
it around.

When we have overlapping TID ranges we still allocate the arrays,
but that's much rarer than non-overlapping TID ranges, thus in
general improving performance.
---
 src/include/access/gin_private.h           |   2 +
 src/include/access/gin_tuple.h             |  10 +
 src/backend/access/gin/gininsert.c         | 411 +++++++++++++++++----
 src/backend/access/gin/ginpostinglist.c    |  67 ++++
 src/backend/utils/sort/tuplesortvariants.c |  93 ++++-
 src/tools/pgindent/typedefs.list           |   1 +
 6 files changed, 507 insertions(+), 77 deletions(-)

diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 95d8805b66f..681cc30fe71 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -476,6 +476,8 @@ extern GinPostingList *ginCompressPostingList(const ItemPointer ipd, int nipd,
 											  int maxsize, int *nwritten);
 extern int	ginPostingListDecodeAllSegmentsToTbm(GinPostingList *ptr, int len, TIDBitmap *tbm);
 
+extern void ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+												ItemPointer out, int nptrs);
 extern ItemPointer ginPostingListDecodeAllSegments(GinPostingList *segment, int len,
 												   int *ndecoded_out);
 extern ItemPointer ginPostingListDecode(GinPostingList *plist, int *ndecoded_out);
diff --git a/src/include/access/gin_tuple.h b/src/include/access/gin_tuple.h
index ce555031335..309427646f2 100644
--- a/src/include/access/gin_tuple.h
+++ b/src/include/access/gin_tuple.h
@@ -39,6 +39,16 @@ GinTupleGetFirst(GinTuple *tup)
 	return &list->first;
 }
 
+typedef struct GinBuffer GinBuffer;
+
 extern int	_gin_compare_tuples(GinTuple *a, GinTuple *b, SortSupport ssup);
 
+extern GinBuffer *GinBufferInit(Relation index);
+extern bool GinBufferIsEmpty(GinBuffer *buffer);
+extern bool GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup);
+extern void GinBufferReset(GinBuffer *buffer);
+extern void GinBufferFree(GinBuffer *buffer);
+extern void GinBufferStoreOrMergeTuple(GinBuffer *buffer, GinTuple *tup);
+extern GinTuple *GinBufferBuildTuple(GinBuffer *buffer);
+
 #endif							/* GIN_TUPLE_H */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index f5782ea95df..80cabae99b1 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -191,6 +191,8 @@ static void _gin_parallel_scan_and_build(GinBuildState *buildstate,
 										 int sortmem, bool progress);
 
 static ItemPointer _gin_parse_tuple_items(GinTuple *a);
+static void _gin_parse_tuple_items_into(GinTuple *a, ItemPointer items,
+										int nspace);
 static Datum _gin_parse_tuple_key(GinTuple *a);
 
 static GinTuple *_gin_build_tuple(OffsetNumber attrnum, unsigned char category,
@@ -1141,9 +1143,23 @@ _gin_parallel_heapscan(GinBuildState *state)
  * When adding TIDs to the buffer, we make sure to keep them sorted, both
  * during the initial table scan (and detecting when the scan wraps around),
  * and during merging (where we do mergesort).
+ *
+ * Note: When nitems == 0, we may still have a cached GinTuple that holds
+ * many items.
+ * We don't always deserialize the cached GinTuple, as deserializing and
+ * then serializing this data while we're in tuplesort code can be quite
+ * expensive, especially while we don't know if the next tuple actually needs
+ * to be merged into this tuple: The cached tuple could just as well be written
+ * as-is to the tape without any further modifications.
  */
-typedef struct GinBuffer
+struct GinBuffer
 {
+	/*
+	 * The memory context holds the dynamic allocation of items, key, cached,
+	 * and any GinTuple returned from GinBufferBuildTuple.
+	 */
+	MemoryContext context;
+	GinTuple   *cached;			/* copy of previous GIN tuple, if any */
 	OffsetNumber attnum;
 	GinNullCategory category;
 	Datum		key;			/* 0 if no key (and keylen == 0) */
@@ -1161,7 +1177,7 @@ typedef struct GinBuffer
 	int			nfrozen;
 	SortSupport ssup;			/* for sorting/comparing keys */
 	ItemPointerData *items;
-} GinBuffer;
+};
 
 /*
  * Check that TID array contains valid values, and that it's sorted (if we
@@ -1172,8 +1188,7 @@ AssertCheckItemPointers(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* we should not have a buffer with no TIDs to sort */
-	Assert(buffer->items != NULL);
-	Assert(buffer->nitems > 0);
+	Assert(buffer->nitems == 0 || buffer->items != NULL);
 
 	for (int i = 0; i < buffer->nitems; i++)
 	{
@@ -1199,14 +1214,17 @@ AssertCheckGinBuffer(GinBuffer *buffer)
 {
 #ifdef USE_ASSERT_CHECKING
 	/* if we have any items, the array must exist */
-	Assert(!((buffer->nitems > 0) && (buffer->items == NULL)));
+	Assert((buffer->nitems == 0) || (buffer->items != NULL));
 
 	/*
 	 * The buffer may be empty, in which case we must not call the check of
 	 * item pointers, because that assumes non-emptiness.
 	 */
 	if (buffer->nitems == 0)
+	{
+		Assert(buffer->nfrozen == 0);
 		return;
+	}
 
 	/* Make sure the item pointers are valid and sorted. */
 	AssertCheckItemPointers(buffer);
@@ -1223,7 +1241,7 @@ AssertCheckGinBuffer(GinBuffer *buffer)
  *
  * Initializes sort support procedures for all index attributes.
  */
-static GinBuffer *
+GinBuffer *
 GinBufferInit(Relation index)
 {
 	GinBuffer  *buffer = palloc0(sizeof(GinBuffer));
@@ -1286,15 +1304,18 @@ GinBufferInit(Relation index)
 
 		PrepareSortSupportComparisonShim(cmpFunc, sortKey);
 	}
+	buffer->context = GenerationContextCreate(CurrentMemoryContext,
+											  "Gin Buffer",
+											  ALLOCSET_DEFAULT_SIZES);
 
 	return buffer;
 }
 
 /* Is the buffer empty, i.e. has no TID values in the array? */
-static bool
+bool
 GinBufferIsEmpty(GinBuffer *buffer)
 {
-	return (buffer->nitems == 0);
+	return (buffer->nitems == 0 && buffer->cached == NULL);
 }
 
 /*
@@ -1310,37 +1331,81 @@ GinBufferIsEmpty(GinBuffer *buffer)
 static bool
 GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 {
+	MemoryContext prev;
 	int			r;
+	AttrNumber	attnum;
 	Datum		tupkey;
+	Datum		bufkey;
 
 	AssertCheckGinBuffer(buffer);
 
-	if (tup->attrnum != buffer->attnum)
-		return false;
+	/*
+	 * If we have a cached GinTuple, compare against its stored info, as
+	 * we haven't yet populated the GinBuffer with its data.
+	 */
+	if (buffer->cached)
+	{
+		GinTuple   *cached = buffer->cached;
 
-	/* same attribute should have the same type info */
-	Assert(tup->typbyval == buffer->typbyval);
-	Assert(tup->typlen == buffer->typlen);
+		if (tup->attrnum != cached->attrnum)
+			return false;
 
-	if (tup->category != buffer->category)
-		return false;
+		Assert(tup->typbyval == cached->typbyval);
+		Assert(tup->typlen == cached->typlen);
 
-	/*
-	 * For NULL/empty keys, this means equality, for normal keys we need to
-	 * compare the actual key value.
-	 */
-	if (buffer->category != GIN_CAT_NORM_KEY)
-		return true;
+		if (tup->category != cached->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need
+		 * to compare the actual key value.
+		 */
+		if (cached->category != GIN_CAT_NORM_KEY)
+			return true;
+
+		attnum = cached->attrnum;
+		bufkey = _gin_parse_tuple_key(cached);
+	}
+	else
+	{
+		if (tup->attrnum != buffer->attnum)
+			return false;
+
+		/* same attribute should have the same type info */
+		Assert(tup->typbyval == buffer->typbyval);
+		Assert(tup->typlen == buffer->typlen);
+
+		if (tup->category != buffer->category)
+			return false;
+
+		/*
+		 * For NULL/empty keys, this means equality, for normal keys we need to
+		 * compare the actual key value.
+		 */
+		if (buffer->category != GIN_CAT_NORM_KEY)
+			return true;
+		attnum = buffer->attnum;
+		bufkey = buffer->key;
+	}
 
 	/*
 	 * For the tuple, get either the first sizeof(Datum) bytes for byval
 	 * types, or a pointer to the beginning of the data array.
 	 */
-	tupkey = (buffer->typbyval) ? *(Datum *) tup->data : PointerGetDatum(tup->data);
+	tupkey = _gin_parse_tuple_key(tup);
 
-	r = ApplySortComparator(buffer->key, false,
+	/*
+	 * We can be called from within TupleSort territories, which requires
+	 * us to not allocate in its memory context. To comply with that
+	 * requirement, use the buffer context instead.
+	 */
+	prev = MemoryContextSwitchTo(buffer->context);
+
+	r = ApplySortComparator(bufkey, false,
 							tupkey, false,
-							&buffer->ssup[buffer->attnum - 1]);
+							&buffer->ssup[attnum - 1]);
+
+	MemoryContextSwitchTo(prev);
 
 	return (r == 0);
 }
@@ -1372,6 +1437,8 @@ GinBufferKeyEquals(GinBuffer *buffer, GinTuple *tup)
 static bool
 GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 {
+	Assert(!buffer->cached);
+
 	/* not enough TIDs to trim (1024 is somewhat arbitrary number) */
 	if (buffer->nfrozen < 1024)
 		return false;
@@ -1388,7 +1455,72 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
 }
 
 /*
- * GinBufferStoreTuple
+ * Unpack the buffered tuple: we're about to use or merge
+ * the contained TIDs or data.
+ */
+static void
+GinBufferUnpackCached(GinBuffer *buffer, int reserve_space)
+{
+	Datum		key;
+	GinTuple   *cached;
+	int			totitems;
+
+	Assert(buffer->cached != NULL);
+	Assert(buffer->nitems == 0);
+
+	cached = buffer->cached;
+	totitems = cached->nitems + reserve_space;
+	key = _gin_parse_tuple_key(cached);
+
+	buffer->category = cached->category;
+	buffer->keylen = cached->keylen;
+	buffer->attnum = cached->attrnum;
+
+	buffer->typlen = cached->typlen;
+	buffer->typbyval = cached->typbyval;
+
+	if (cached->category == GIN_CAT_NORM_KEY)
+		buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
+	else
+		buffer->key = (Datum) 0;
+
+	/*
+	 * Ensure we can unpack all item pointers into the buffer's item array.
+	 */
+	if (buffer->items == NULL)
+	{
+		Size	maxitems = Max(buffer->maxitems, totitems);
+		buffer->items = palloc0(maxitems * sizeof(ItemPointerData));
+		buffer->maxitems = maxitems;
+	}
+	else if (buffer->maxitems < totitems)
+	{
+		buffer->items = repalloc(buffer->items,
+								 totitems * sizeof(ItemPointerData));
+		buffer->maxitems = totitems;
+	}
+	else
+	{
+		Assert(PointerIsValid(buffer->items) &&
+			   buffer->maxitems >= totitems);
+	}
+
+	/* Unpack the item pointers directly into the buffer's items array */
+	_gin_parse_tuple_items_into(cached, buffer->items,
+								buffer->maxitems);
+
+	buffer->nitems = cached->nitems;
+	buffer->nfrozen = 1;
+
+	buffer->cached = NULL;
+
+	AssertCheckItemPointers(buffer);
+
+	pfree(cached);
+}
+
+/*
+ * GinBufferStoreOrMergeTuple
  *		Add data (especially TID list) from a GIN tuple to the buffer.
  *
  * The buffer is expected to be empty (in which case it's initialized), or
@@ -1410,31 +1542,42 @@ GinBufferShouldTrim(GinBuffer *buffer, GinTuple *tup)
  * workers. But the workers merge the items as much as possible, so there
  * should not be too many.
  */
-static void
-GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
+void
+GinBufferStoreOrMergeTuple(GinBuffer *buffer, GinTuple *tup)
 {
-	ItemPointerData *items;
-	Datum		key;
+	MemoryContext prev;
 
+	prev = MemoryContextSwitchTo(buffer->context);
 	AssertCheckGinBuffer(buffer);
 
-	key = _gin_parse_tuple_key(tup);
-	items = _gin_parse_tuple_items(tup);
-
 	/* if the buffer is empty, set the fields (and copy the key) */
-	if (GinBufferIsEmpty(buffer))
+	if (buffer->nitems == 0 && buffer->cached == NULL)
 	{
-		buffer->category = tup->category;
-		buffer->keylen = tup->keylen;
-		buffer->attnum = tup->attrnum;
+		/*
+		 * Buffer is actually empty, so move the GinTuple into the right
+		 * context and then put it away for later use
+		 */
+		GinTuple   *tuple = palloc(tup->tuplen);
 
-		buffer->typlen = tup->typlen;
-		buffer->typbyval = tup->typbyval;
+		memcpy(tuple, tup, tup->tuplen);
+		buffer->cached = tuple;
+		MemoryContextSwitchTo(prev);
+		return;
+	}
 
-		if (tup->category == GIN_CAT_NORM_KEY)
-			buffer->key = datumCopy(key, buffer->typbyval, buffer->typlen);
-		else
-			buffer->key = (Datum) 0;
+	if (buffer->nitems == 0)
+	{
+		Assert(buffer->cached);
+		/*
+		 * We skipped decoding the previous GIN tuple, but now we definitely
+		 * need to merge the tuples, so we can't stave off deforming the
+		 * cached GIN tuple any longer.
+		 */
+		GinBufferUnpackCached(buffer, tup->nitems);
+		
+		Assert(buffer->nitems > 0);
+		Assert(buffer->nfrozen == 1);
+		Assert(buffer->cached == NULL);
 	}
 
 	/*
@@ -1453,7 +1596,7 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 	 */
 	if ((buffer->nitems > 0) &&
 		(ItemPointerCompare(&buffer->items[buffer->nitems - 1],
-							GinTupleGetFirst(tup)) == 0))
+							GinTupleGetFirst(tup)) <= 0))
 		buffer->nfrozen = buffer->nitems;
 
 	/*
@@ -1479,54 +1622,142 @@ GinBufferStoreTuple(GinBuffer *buffer, GinTuple *tup)
 		buffer->nfrozen++;
 	}
 
-	/* add the new TIDs into the buffer, combine using merge-sort */
+	/*
+	 * Grow the buffer if we need to.
+	 */
+	if (buffer->nitems + tup->nitems > buffer->maxitems)
+	{
+		Size	size = sizeof(ItemPointerData) * (buffer->nitems + tup->nitems);
+		if (buffer->items == NULL)
+			buffer->items = palloc(size);
+		else
+			buffer->items = repalloc(buffer->items, size);
+
+		buffer->maxitems = (buffer->nitems + tup->nitems);
+	}
+
+	Assert(buffer->maxitems >= buffer->nitems + tup->nitems);
+
+	/* Add the new TIDs into the buffer, combine using merge-sort if needed */
 	{
 		int			nnew;
 		ItemPointer new;
 
 		/*
-		 * Resize the array - we do this first, because we'll dereference the
-		 * first unfrozen TID, which would fail if the array is NULL. We'll
-		 * still pass 0 as number of elements in that array though.
+		 * If the array wasn't allocated yet, do so now.
+		 *
+		 * Note that by now we know that buffer->maxitems is large enough to
+		 * fit all tuples, so we only need to allocate the data.
 		 */
 		if (buffer->items == NULL)
-			buffer->items = palloc((buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		{
+			Assert(buffer->nitems == 0);
+			Assert(buffer->nfrozen == 0);
+
+			buffer->items = palloc(buffer->maxitems * sizeof(ItemPointerData));
+		}
+
+		/*
+		 * If the incoming data is completely after the current items, we can
+		 * just decode the TIDs directly into the buffer's items array, saving
+		 * allocations and memcpy's.
+		 */
+		if (likely(buffer->nfrozen == buffer->nitems))
+		{
+			_gin_parse_tuple_items_into(tup, &buffer->items[buffer->nitems],
+										buffer->maxitems - buffer->nitems);
+		}
 		else
-			buffer->items = repalloc(buffer->items,
-									 (buffer->nitems + tup->nitems) * sizeof(ItemPointerData));
+		{
+			ItemPointerData *items;
+			Assert(buffer->nfrozen < buffer->nitems);
+			items = _gin_parse_tuple_items(tup);
 
-		new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
-								   (buffer->nitems - buffer->nfrozen),	/* num of unfrozen */
-								   items, tup->nitems, &nnew);
+			new = ginMergeItemPointers(&buffer->items[buffer->nfrozen], /* first unfronzen */
+									   (buffer->nitems - buffer->nfrozen),    /* num of unfrozen */
+									   items, tup->nitems, &nnew);
 
-		Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+			Assert(nnew == (tup->nitems + (buffer->nitems - buffer->nfrozen)));
+			Assert(buffer->maxitems >= buffer->nfrozen + nnew);
 
-		memcpy(&buffer->items[buffer->nfrozen], new,
-			   nnew * sizeof(ItemPointerData));
+			memcpy(&buffer->items[buffer->nfrozen], new,
+				   nnew * sizeof(ItemPointerData));
 
-		pfree(new);
+			pfree(new);
+			/* free the decompressed TID list */
+			pfree(items);
+		}
 
 		buffer->nitems += tup->nitems;
+		/*
+		 * The first TID of the incoming item is the lowest we'll see
+		 * in this run, so we can always mark that one as frozen.
+		 */
+		buffer->nfrozen++;
 
+		/* Check the data is still consistent */
 		AssertCheckItemPointers(buffer);
 	}
 
-	/* free the decompressed TID list */
-	pfree(items);
+	MemoryContextSwitchTo(prev);
+}
+
+/*
+ * Build a GinTuple from the buffer's contents.
+ *
+ * On exit, the buffer has been reset.
+ */
+GinTuple *
+GinBufferBuildTuple(GinBuffer *buffer)
+{
+	MemoryContext prev = MemoryContextSwitchTo(buffer->context);
+	GinTuple   *result;
+
+	if (buffer->cached)
+	{
+		Assert(buffer->nitems == 0);
+		result = buffer->cached;
+		buffer->cached = NULL;
+	}
+	else
+	{
+		result = _gin_build_tuple(buffer->attnum, buffer->category,
+								  buffer->key, buffer->typlen,
+								  buffer->typbyval, buffer->items,
+								  buffer->nitems);
+
+		GinBufferReset(buffer);
+	}
+
+	MemoryContextSwitchTo(prev);
+	return result;
 }
 
 /*
  * GinBufferReset
  *		Reset the buffer into a state as if it contains no data.
  */
-static void
+void
 GinBufferReset(GinBuffer *buffer)
 {
 	Assert(!GinBufferIsEmpty(buffer));
 
-	/* release byref values, do nothing for by-val ones */
-	if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
-		pfree(DatumGetPointer(buffer->key));
+	/* release cached buffer tuple, if present */
+	if (buffer->cached)
+	{
+		Assert(buffer->nitems == 0);
+		pfree(buffer->cached);
+		buffer->cached = NULL;
+	}
+	else
+	{
+		/* release byref values, do nothing for by-val ones */
+		if ((buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval
+			&& PointerIsValid(DatumGetPointer(buffer->key)))
+		{
+			pfree(DatumGetPointer(buffer->key));
+		}
+	}
 
 	/*
 	 * Not required, but makes it more likely to trigger NULL derefefence if
@@ -1542,6 +1773,14 @@ GinBufferReset(GinBuffer *buffer)
 
 	buffer->typlen = 0;
 	buffer->typbyval = 0;
+
+	/*
+	 * We don't reset the memory context, as that contains the items array,
+	 * which we don't want to have to re-allocate every time it gets huge.
+	 *
+	 * That's not all that likely, but still too expensive to do repeatedly
+	 * inside tuplesort code.
+	 */
 }
 
 /*
@@ -1565,7 +1804,7 @@ GinBufferTrim(GinBuffer *buffer)
  * GinBufferFree
  *		Release memory associated with the GinBuffer (including TID array).
  */
-static void
+void
 GinBufferFree(GinBuffer *buffer)
 {
 	if (buffer->items)
@@ -1576,6 +1815,7 @@ GinBufferFree(GinBuffer *buffer)
 		(buffer->category == GIN_CAT_NORM_KEY) && !buffer->typbyval)
 		pfree(DatumGetPointer(buffer->key));
 
+	MemoryContextDelete(buffer->context);
 	pfree(buffer);
 }
 
@@ -1585,7 +1825,7 @@ GinBufferFree(GinBuffer *buffer)
  *
  * Returns true if the buffer is either empty or for the same index key.
  */
-static bool
+bool
 GinBufferCanAddKey(GinBuffer *buffer, GinTuple *tup)
 {
 	/* empty buffer can accept data for any key */
@@ -1682,6 +1922,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1708,6 +1949,7 @@ _gin_parallel_merge(GinBuildState *state)
 			 * GinTuple.
 			 */
 			AssertCheckItemPointers(buffer);
+			Assert(!PointerIsValid(buffer->cached));
 
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
@@ -1721,7 +1963,11 @@ _gin_parallel_merge(GinBuildState *state)
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferStoreOrMergeTuple(buffer, tup);
+
+		/* Unpack the cached tuple, if it got cached */
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, 0);
 
 		/* Report progress */
 		pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
@@ -1732,6 +1978,7 @@ _gin_parallel_merge(GinBuildState *state)
 	if (!GinBufferIsEmpty(buffer))
 	{
 		AssertCheckItemPointers(buffer);
+		Assert(!PointerIsValid(buffer->cached));
 
 		ginEntryInsert(&state->ginstate,
 					   buffer->attnum, buffer->key, buffer->category,
@@ -1871,6 +2118,9 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 			GinBufferReset(buffer);
 		}
 
+		if (buffer->cached)
+			GinBufferUnpackCached(buffer, tup->nitems);
+
 		/*
 		 * We're about to add a GIN tuple to the buffer - check the memory
 		 * limit first, and maybe write out some of the data into the index
@@ -1907,7 +2157,7 @@ _gin_process_worker_data(GinBuildState *state, Tuplesortstate *worker_sort,
 		 * Remember data for the current tuple (either remember the new key,
 		 * or append if to the existing data).
 		 */
-		GinBufferStoreTuple(buffer, tup);
+		GinBufferStoreOrMergeTuple(buffer, tup);
 	}
 
 	/* flush data remaining in the buffer (for the last key) */
@@ -2351,17 +2601,36 @@ _gin_parse_tuple_items(GinTuple *a)
 {
 	int			len;
 	char	   *ptr;
-	int			ndecoded;
 	ItemPointer items;
 
 	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
-	items = ginPostingListDecodeAllSegments((GinPostingList *) ptr, len, &ndecoded);
+	items = palloc(a->nitems * sizeof(ItemPointerData));
+
+	ginPostingListDecodeAllSegmentsInto((GinPostingList *) ptr, len,
+										items, a->nitems);
+
+	return items;
+}
+
+/*
+* _gin_parse_tuple_items_into
+ *		Decompress GinTuple's TIDs into the given TID array.
+ */
+static void
+_gin_parse_tuple_items_into(GinTuple *a, ItemPointer items, int nspace)
+{
+	int			len;
+	char	   *ptr;
+
+	Assert(nspace >= a->nitems && PointerIsValid(items));
 
-	Assert(ndecoded == a->nitems);
+	len = a->tuplen - SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
+	ptr = (char *) a + SHORTALIGN(offsetof(GinTuple, data) + a->keylen);
 
-	return (ItemPointer) items;
+	ginPostingListDecodeAllSegmentsInto((GinPostingList *) ptr, len,
+										items, a->nitems);
 }
 
 /*
diff --git a/src/backend/access/gin/ginpostinglist.c b/src/backend/access/gin/ginpostinglist.c
index 48eadec87b0..671d1009a11 100644
--- a/src/backend/access/gin/ginpostinglist.c
+++ b/src/backend/access/gin/ginpostinglist.c
@@ -288,6 +288,73 @@ ginPostingListDecode(GinPostingList *plist, int *ndecoded_out)
 										   ndecoded_out);
 }
 
+/*
+ * Decode compressed posting lists into a pre-allocated array of item
+ * pointers.
+ * The posting lists must contain a total of exactly nptrs item pointers.
+ *
+ * See also ginPostingListDecodeAllSegments, which allocates a new
+ * item pointer array.
+ */
+void
+ginPostingListDecodeAllSegmentsInto(GinPostingList *segment, int len,
+									ItemPointer out, int nptrs)
+{
+	uint64		val;
+	char	   *endseg = ((char *) segment) + len;
+	int			ndecoded;
+	unsigned char *ptr;
+	unsigned char *endptr;
+
+	ndecoded = 0;
+
+	while ((char *) segment < endseg)
+	{
+		/* enlarge output array if needed */
+		if (ndecoded >= nptrs)
+		{
+			elog(ERROR,
+				 "Too many items to decode, expected %u, now at %u and counting",
+				 nptrs, ndecoded);
+		}
+
+		/* copy the first item */
+		Assert(OffsetNumberIsValid(ItemPointerGetOffsetNumber(&segment->first)));
+		Assert(ndecoded == 0 || ginCompareItemPointers(&segment->first, &out[ndecoded - 1]) > 0);
+		out[ndecoded] = segment->first;
+		ndecoded++;
+
+		val = itemptr_to_uint64(&segment->first);
+		ptr = segment->bytes;
+		endptr = segment->bytes + segment->nbytes;
+		while (ptr < endptr)
+		{
+			/* enlarge output array if needed */
+			if (ndecoded >= nptrs)
+			{
+				unsigned int minremaining = ((endseg - (char *) ptr) / sizeof(ItemPointerData));
+
+				elog(ERROR,
+					 "Too many items to decode, expected %u, now at %u and counting",
+					 (unsigned int) nptrs,
+					 (unsigned int) ndecoded + minremaining);
+			}
+
+			val += decode_varbyte(&ptr);
+
+			uint64_to_itemptr(val, &out[ndecoded]);
+			ndecoded++;
+		}
+		segment = GinNextPostingListSegment(segment);
+	}
+	
+	if (ndecoded != nptrs)
+	{
+		elog(ERROR, "Invalid decode count: Expected %d, got %d",
+			 nptrs, ndecoded);
+	}
+}
+
 /*
  * Decode multiple posting list segments into an array of item pointers.
  * The number of items is returned in *ndecoded_out. The segments are stored
diff --git a/src/backend/utils/sort/tuplesortvariants.c b/src/backend/utils/sort/tuplesortvariants.c
index 79bd29aa90e..ff4e4405796 100644
--- a/src/backend/utils/sort/tuplesortvariants.c
+++ b/src/backend/utils/sort/tuplesortvariants.c
@@ -88,8 +88,11 @@ static void writetup_index_brin(Tuplesortstate *state, LogicalTape *tape,
 								SortTuple *stup);
 static void readtup_index_brin(Tuplesortstate *state, SortTuple *stup,
 							   LogicalTape *tape, unsigned int len);
-static void writetup_index_gin(Tuplesortstate *state, LogicalTape *tape,
-							   SortTuple *stup);
+static void writetup_index_gin_to_tape(Tuplesortstate *state,
+									   LogicalTape *tape, GinTuple *tuple);
+static void writetup_index_gin_buffered(Tuplesortstate *state,
+										LogicalTape *tape, SortTuple *stup);
+static void flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape);
 static void readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 							  LogicalTape *tape, unsigned int len);
 static int	comparetup_datum(const SortTuple *a, const SortTuple *b,
@@ -101,6 +104,7 @@ static void writetup_datum(Tuplesortstate *state, LogicalTape *tape,
 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
 						  LogicalTape *tape, unsigned int len);
 static void freestate_cluster(Tuplesortstate *state);
+static void freestate_index_gin(Tuplesortstate *state);
 
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the CLUSTER case.  Set by
@@ -135,6 +139,16 @@ typedef struct
 	bool		uniqueNullsNotDistinct; /* unique constraint null treatment */
 } TuplesortIndexBTreeArg;
 
+/*
+ * Data structure pointed by "TuplesortPublic.arg" for the index_gin subcase.
+ */
+typedef struct
+{
+	TuplesortIndexArg index;
+	GinBuffer  *buffer;
+} TuplesortIndexGinArg;
+
+
 /*
  * Data structure pointed by "TuplesortPublic.arg" for the index_hash subcase.
  */
@@ -593,6 +607,7 @@ tuplesort_begin_index_gin(Relation heapRel,
 	Tuplesortstate *state = tuplesort_begin_common(workMem, coordinate,
 												   sortopt);
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg;
 	MemoryContext oldcontext;
 	int			i;
 	TupleDesc	desc = RelationGetDescr(indexRel);
@@ -617,6 +632,10 @@ tuplesort_begin_index_gin(Relation heapRel,
 	/* Prepare SortSupport data for each column */
 	base->sortKeys = (SortSupport) palloc0(base->nKeys *
 										   sizeof(SortSupportData));
+	arg = palloc0(sizeof(TuplesortIndexGinArg));
+	arg->index.indexRel = indexRel;
+	arg->index.heapRel = heapRel;
+	arg->buffer = GinBufferInit(indexRel);
 
 	for (i = 0; i < base->nKeys; i++)
 	{
@@ -645,10 +664,12 @@ tuplesort_begin_index_gin(Relation heapRel,
 
 	base->removeabbrev = removeabbrev_index_gin;
 	base->comparetup = comparetup_index_gin;
-	base->writetup = writetup_index_gin;
+	base->writetup = writetup_index_gin_buffered;
+	base->flushwrites = flushwrites_index_gin;
 	base->readtup = readtup_index_gin;
+	base->freestate = freestate_index_gin;
 	base->haveDatum1 = false;
-	base->arg = NULL;
+	base->arg = arg;
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1930,11 +1951,13 @@ comparetup_index_gin(const SortTuple *a, const SortTuple *b,
 							   base->sortKeys);
 }
 
+/*
+ * Write the GinTuple to the tape.
+ */
 static void
-writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+writetup_index_gin_to_tape(Tuplesortstate *state, LogicalTape *tape, GinTuple *tuple)
 {
 	TuplesortPublic *base = TuplesortstateGetPublic(state);
-	GinTuple   *tuple = (GinTuple *) stup->tuple;
 	unsigned int tuplen = tuple->tuplen;
 
 	tuplen = tuplen + sizeof(tuplen);
@@ -1944,6 +1967,53 @@ writetup_index_gin(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
 		LogicalTapeWrite(tape, &tuplen, sizeof(tuplen));
 }
 
+/*
+ * Merge or write the tuple to the GinBuffer if possible, flushing any
+ * conflicting state to disk when and where required.
+ */
+static void
+writetup_index_gin_buffered(Tuplesortstate *state, LogicalTape *tape, SortTuple *stup)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	GinTuple   *otup;
+	GinTuple   *ntup = (GinTuple *) stup->tuple;
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(PointerIsValid(arg));
+
+	if (GinBufferCanAddKey(arg->buffer, ntup))
+	{
+		GinBufferStoreOrMergeTuple(arg->buffer, ntup);
+		return;
+	}
+
+	otup = GinBufferBuildTuple(arg->buffer);
+
+	writetup_index_gin_to_tape(state, tape, otup);
+
+	pfree(otup);
+
+	Assert(GinBufferCanAddKey(arg->buffer, ntup));
+
+	GinBufferStoreOrMergeTuple(arg->buffer, ntup);
+}
+
+static void
+flushwrites_index_gin(Tuplesortstate *state, LogicalTape *tape)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	if (!GinBufferIsEmpty(arg->buffer))
+	{
+		GinTuple   *tuple = GinBufferBuildTuple(arg->buffer);
+
+		writetup_index_gin_to_tape(state, tape, tuple);
+		pfree(tuple);
+		Assert(GinBufferIsEmpty(arg->buffer));
+	}
+}
+
 static void
 readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 				  LogicalTape *tape, unsigned int len)
@@ -1969,6 +2039,17 @@ readtup_index_gin(Tuplesortstate *state, SortTuple *stup,
 	stup->datum1 = (Datum) 0;
 }
 
+static void
+freestate_index_gin(Tuplesortstate *state)
+{
+	TuplesortPublic *base = TuplesortstateGetPublic(state);
+	TuplesortIndexGinArg *arg = (TuplesortIndexGinArg *) base->arg;
+
+	Assert(arg != NULL);
+	Assert(GinBufferIsEmpty(arg->buffer));
+	GinBufferFree(arg->buffer);
+}
+
 /*
  * Routines specialized for DatumTuple case
  */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9840060997f..522e98109ae 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3037,6 +3037,7 @@ TuplesortClusterArg
 TuplesortDatumArg
 TuplesortIndexArg
 TuplesortIndexBTreeArg
+TuplesortIndexGinArg
 TuplesortIndexHashArg
 TuplesortInstrumentation
 TuplesortMethod
-- 
2.45.2

#58

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: Tomas Vondra (#56)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas@vondra.me> writes:

I pushed the two smaller parts today.

Coverity is a little unhappy about this business in
_gin_begin_parallel:

bool leaderparticipates = true;
...
#ifdef DISABLE_LEADER_PARTICIPATION
leaderparticipates = false;
#endif
...
scantuplesortstates = leaderparticipates ? request + 1 : request;

It says

CID 1644203: Possible Control flow issues (DEADCODE)
Execution cannot reach the expression "request" inside this statement: "scantuplesortstates = (lead...".

924 scantuplesortstates = leaderparticipates ? request + 1 : request;

If this were just temporary code I'd let it pass, but I see nothing
replacing this logic in the follow-up patches, so I think we ought
to do something to shut it up.

It's not complaining about the later bits like

if (leaderparticipates)
ginleader->nparticipanttuplesorts++;

(perhaps because there's no dead code there?) So one idea is

scantuplesortstates = request;
if (leaderparticipates)
scantuplesortstates++;

which would look more like the other code anyway.

regards, tom lane

#59

Tomas Vondra

tomas@vondra.me

10 months ago

In reply to: Tom Lane (#58)

Re: Parallel CREATE INDEX for GIN indexes

On 3/9/25 17:38, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

I pushed the two smaller parts today.

Coverity is a little unhappy about this business in
_gin_begin_parallel:

bool leaderparticipates = true;
...
#ifdef DISABLE_LEADER_PARTICIPATION
leaderparticipates = false;
#endif
...
scantuplesortstates = leaderparticipates ? request + 1 : request;

It says

CID 1644203: Possible Control flow issues (DEADCODE)
Execution cannot reach the expression "request" inside this statement: "scantuplesortstates = (lead...".

924 scantuplesortstates = leaderparticipates ? request + 1 : request;

If this were just temporary code I'd let it pass, but I see nothing
replacing this logic in the follow-up patches, so I think we ought
to do something to shut it up.

It's not complaining about the later bits like

if (leaderparticipates)
ginleader->nparticipanttuplesorts++;

(perhaps because there's no dead code there?) So one idea is

scantuplesortstates = request;
if (leaderparticipates)
scantuplesortstates++;

which would look more like the other code anyway.

I don't mind doing it differently, but this code is just a copy from
_bt_begin_parallel. So how come coverity does not complain about that?
Or is that whitelisted?

thanks

--
Tomas Vondra

#60

Tom Lane

tgl@sss.pgh.pa.us

10 months ago

In reply to: Tomas Vondra (#59)

Re: Parallel CREATE INDEX for GIN indexes

Tomas Vondra <tomas@vondra.me> writes:

On 3/9/25 17:38, Tom Lane wrote:

Coverity is a little unhappy about this business in
_gin_begin_parallel:

I don't mind doing it differently, but this code is just a copy from
_bt_begin_parallel. So how come coverity does not complain about that?
Or is that whitelisted?

Ah. Most likely somebody dismissed it years ago. Given that
precedent, I'm content to dismiss this one too.

regards, tom lane

#61

Peter Geoghegan

pg@bowt.ie

10 months ago

In reply to: Tom Lane (#60)

Re: Parallel CREATE INDEX for GIN indexes

On Sun, Mar 9, 2025 at 6:23 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Ah. Most likely somebody dismissed it years ago. Given that
precedent, I'm content to dismiss this one too.

It is dead code, unless somebody decides to #define
DISABLE_LEADER_PARTICIPATION to debug a problem.

--
Peter Geoghegan

#62

Alexander Korotkov

aekorotkov@gmail.com

10 months ago

In reply to: Matthias van de Meent (#57)

Re: Parallel CREATE INDEX for GIN indexes

Hi Matthias,

On Fri, Mar 7, 2025 at 4:08 AM Matthias van de Meent
<boekewurm+postgres@gmail.com> wrote:

On Tue, 4 Mar 2025 at 20:50, Tomas Vondra <tomas@vondra.me> wrote:

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

As promised on- and off-list, here's the 0001 patch, polished, split,
and further adapted for performance.

As seen before, it reduces tempspace requirements by up to 50%. I've
not tested this against HEAD for performance.

It has been split into:

0001: Some API cleanup/changes that creaped into the patch. This
removes manual length-passing from the gin tuplesort APIs, instead
relying on GinTuple's tuplen field. It's not critical for anything,
and could be ignored if so desired.

0002: Tuplesort changes to allow TupleSort users to buffer and merge
tuples during the sort operations.
The patch was pulled directly from [0] (which was derived from earlier
work in this thread), is fairly easy to understand, and has no other
moving parts.

0003: Deduplication in tuplesort's flush-to-disk actions, utilizing
API introduced with 0002.
This improves temporary disk usage by deduplicating data even further,
for when there's a lot of duplicated data but the data has enough
distinct values to not fit in the available memory.

0004: Use a single tuplesort. This removes the worker-local tuplesort
in favor of only storing data in the global one.

This mainly reduces the code size and complexity of parallel GIN
builds; we already were using that global sort for various tasks.

Open questions and open items for this:
- I did not yet update the pg_stat_progress systems, nor docs.
- Maybe 0003 needs further splitting up, one for the optimizations in
GinBuffer, one for the tuplesort buffering.

Yes, please. That would simplify the detailed review.

- Maybe we need to trim the buffer in gin's tuplesort flush?

I didn't get it. Could you please elaborate more on this?

- Maybe we should grow the GinBuffer->items array superlinearly rather
than to the exact size requirement of the merge operation.

+1 for this

Apart from the complexities in 0003, I think the changes are fairly
straightforward.

Yes, 0001, 0002 and 0004 are pretty straightforward.

Regarding 0003, having separate deformed tuple in GinBuffer and cached
tuples looks a bit cumbersome. Could we simplify this? I understand
that we need to decompress items array lazily. But could we leave
just items-related fields in GinBuffer, but have the rest always in
GinBuffer.cached? So, if GinBuffer.items != NULL then we have items
decompressed already, otherwise have to decompress them when needed.

------
Regards,
Alexander Korotkov
Supabase

#63

Andres Freund

andres@anarazel.de

9 months ago

In reply to: Tomas Vondra (#56)

Re: Parallel CREATE INDEX for GIN indexes

Hi,

On 2025-03-04 20:50:43 +0100, Tomas Vondra wrote:

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

If that's the case, could we either close the CF entry, or move it to the next
fest?

Greetings,

Andres Freund

#64

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Andres Freund (#63)

Re: Parallel CREATE INDEX for GIN indexes

On 4/2/25 18:43, Andres Freund wrote:

Hi,

On 2025-03-04 20:50:43 +0100, Tomas Vondra wrote:

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

If that's the case, could we either close the CF entry, or move it to the next
fest?

Possibly. Let me check if I can get some of the patches posted by
Mattias after I wrote that. If not, I'll move it to the next CF.

thanks

--
Tomas Vondra

#65

Kirill Reshke

reshkekirill@gmail.com

9 months ago

In reply to: Tomas Vondra (#64)

Re: Parallel CREATE INDEX for GIN indexes

On Thu, 3 Apr 2025 at 01:19, Tomas Vondra <tomas@vondra.me> wrote:

On 4/2/25 18:43, Andres Freund wrote:

Hi,

On 2025-03-04 20:50:43 +0100, Tomas Vondra wrote:

I pushed the two smaller parts today.

Here's the remaining two parts, to keep cfbot happy. I don't expect to
get these into PG18, though.

If that's the case, could we either close the CF entry, or move it to the next
fest?

Possibly. Let me check if I can get some of the patches posted by
Mattias after I wrote that. If not, I'll move it to the next CF.

Looks like no more patches will get into v18 so i moved the CF

--
Best regards,
Kirill Reshke

#66

Vinod Sridharan

vsridh90@gmail.com

9 months ago

In reply to: Tomas Vondra (#24)

1 attachment(s)

Re: Parallel CREATE INDEX for GIN indexes

Hello,
As part of testing this change I believe I found a scenario where the
parallel build seems to trigger OOMs for larger indexes. Specifically,
the calls for ginEntryInsert seem to leak memory into
TopTransactionContext and OOM/crash the outer process.
For serial build, the calls for ginEntryInsert tend to happen in a
temporary memory context that gets reset at the end of the
ginBuildCallback.
For inserts, the call has a custom memory context and gets reset at
the end of the insert.
For parallel build, during the merge phase, the MemoryContext isn't
swapped - and so this happens on the TopTransactionContext, and ends
up growing (especially for larger indexes).

I believe at the very least these should happen inside the tmpCtx
found in the GinBuildState and reset periodically.

In the attached patch, I've tried to do this, and I'm able to build
the index without OOMing, and only consuming maintenance_work_mem
through the merge process.

Would appreciate your thoughts on this (and whether there's other approaches to
resolve this too).

Thanks,
Vinod

On Thu, 17 Apr 2025 at 13:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Show quoted text

On 7/3/24 20:36, Matthias van de Meent wrote:

On Mon, 24 Jun 2024 at 02:58, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Here's a bit more cleaned up version, clarifying a lot of comments,
removing a bunch of obsolete comments, or comments speculating about
possible solutions, that sort of thing. I've also removed couple more
XXX comments, etc.

The main change however is that the sorting no longer relies on memcmp()
to compare the values. I did that because it was enough for the initial
WIP patches, and it worked till now - but the comments explained this
may not be a good idea if the data type allows the same value to have
multiple binary representations, or something like that.

I don't have a practical example to show an issue, but I guess if using
memcmp() was safe we'd be doing it in a bunch of places already, and
AFAIK we're not. And even if it happened to be OK, this is a probably
not the place where to start doing it.

I think one such example would be the values '5.00'::jsonb and
'5'::jsonb when indexed using GIN's jsonb_ops, though I'm not sure if
they're treated as having the same value inside the opclass' ordering.

Yeah, possibly. But doing the comparison the "proper" way seems to be
working pretty well, so I don't plan to investigate this further.

So I've switched this to use the regular data-type comparisons, with
SortSupport etc. There's a bit more cleanup remaining and testing
needed, but I'm not aware of any bugs.

A review of patch 0001:

---

src/backend/access/gin/gininsert.c | 1449 +++++++++++++++++++-

The nbtree code has `nbtsort.c` for its sort- and (parallel) build
state handling, which is exclusively used during index creation. As
the changes here seem to be largely related to bulk insertion, how
much effort would it be to split the bulk insertion code path into a
separate file?

Hmmm. I haven't tried doing that, but I guess it's doable. I assume we'd
want to do the move first, because it involves pre-existing code, and
then do the patch on top of that.

But what would be the benefit of doing that? I'm not sure doing it just
to make it look more like btree code is really worth it. Do you expect
the result to be clearer?

I noticed that new fields in GinBuildState do get to have a
bs_*-prefix, but none of the other added or previous fields of the
modified structs in gininsert.c have such prefixes. Could this be
unified?

Yeah, these are inconsistencies from copying the infrastructure code to
make the parallel builds work, etc.
+/* Magic numbers for parallel state sharing */
+#define PARALLEL_KEY_GIN_SHARED            UINT64CONST(0xB000000000000001)
...
These overlap with BRIN's keys; can we make them unique while we're at it?
We could, and I recall we had a similar discussion in the parallel BRIN
thread, right?. But I'm somewhat unsure why would we actually want/need
these keys to be unique. Surely, we don't need to mix those keys in the
single shm segment, right? So it seems more like an aesthetic thing. Or
is there some policy to have unique values for these keys?

+ * mutex protects all fields before heapdesc.

I can't find the field that this `heapdesc` might refer to.

Yeah, likely a leftover from copying a bunch of code and then removing
it without updating the comment. Will fix.
+_gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
...
+     if (!isconcurrent)
+        snapshot = SnapshotAny;
+    else
+        snapshot = RegisterSnapshot(GetTransactionSnapshot());
grumble: I know this is required from the index with the current APIs,
but I'm kind of annoyed that each index AM has to construct the table
scan and snapshot in their own code. I mean, this shouldn't be
meaningfully different across AMs, so every AM implementing this same
code makes me feel like we've got the wrong abstraction.

I'm not asking you to change this, but it's one more case where I'm
annoyed by the state of the system, but not quite enough yet to change
it.
Yeah, it's not great, but not something I intend to rework.
---
+++ b/src/backend/utils/sort/tuplesortvariants.c
I was thinking some more about merging tuples inside the tuplesort. I
realized that this could be implemented by allowing buffering of tuple
writes in writetup. This would require adding a flush operation at the
end of mergeonerun to store the final unflushed tuple on the tape, but
that shouldn't be too expensive. This buffering, when implemented
through e.g. a GinBuffer in TuplesortPublic->arg, could allow us to
merge the TID lists of same-valued GIN tuples while they're getting
stored and re-sorted, thus reducing the temporary space usage of the
tuplesort by some amount with limited overhead for other
non-deduplicating tuplesorts.

I've not yet spent the time to get this to work though, but I'm fairly
sure it'd use less temporary space than the current approach with the
2 tuplesorts, and could have lower overall CPU overhead as well
because the number of sortable items gets reduced much earlier in the
process.
Will respond to your later message about this.
---
+++ b/src/include/access/gin_tuple.h
+ typedef struct GinTuple
I think this needs some more care: currently, each GinTuple is at
least 36 bytes in size on 64-bit systems. By using int instead of Size
(no normal indexable tuple can be larger than MaxAllocSize), and
packing the fields better we can shave off 10 bytes; or 12 bytes if
GinTuple.keylen is further adjusted to (u)int16: a key needs to fit on
a page, so we can probably safely assume that the key size fits in
(u)int16.
Yeah, I guess using int64 is a bit excessive - you're right about that.
I'm not sure this is necessarily about "indexable tuples" (GinTuple is
not indexed, it's more an intermediate representation). But if we can
make it smaller, that probably can't hurt.

I don't have a great intuition on how beneficial this might be. For
cases with many TIDs per index key, it probably won't matter much. But
if there's many keys (so that GinTuples store only very few TIDs), it
might make a difference.

I'll try to measure the impact on the same "realistic" cases I used for
the earlier steps.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

v1-0001-Fix-memory-leak-in-_gin_parallel_merge.patchapplication/octet-stream; name=v1-0001-Fix-memory-leak-in-_gin_parallel_merge.patchDownload

From 09614c5dc09e9f05e8aa157c029cf87de9cd4b72 Mon Sep 17 00:00:00 2001
From: Vinod Sridharan <vsridh90@gmail.com>
Date: Thu, 17 Apr 2025 13:51:41 -0700
Subject: [PATCH v1] Fix memory leak in _gin_parallel_merge

---
 src/backend/access/gin/gininsert.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index cfab93ec30c..05a1aea51c1 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -1616,6 +1616,7 @@ _gin_parallel_merge(GinBuildState *state)
 	Size		tuplen;
 	double		reltuples = 0;
 	GinBuffer  *buffer;
+	MemoryContext oldCtx;
 
 	/* GIN tuples from workers, merged by leader */
 	double		numtuples = 0;
@@ -1685,9 +1686,12 @@ _gin_parallel_merge(GinBuildState *state)
 			 */
 			AssertCheckItemPointers(buffer);
 
+			oldCtx = MemoryContextSwitchTo(state->tmpCtx);
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
 						   buffer->items, buffer->nitems, &state->buildStats);
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextReset(state->tmpCtx);
 
 			/* discard the existing data */
 			GinBufferReset(buffer);
@@ -1711,9 +1715,12 @@ _gin_parallel_merge(GinBuildState *state)
 			 */
 			AssertCheckItemPointers(buffer);
 
+			oldCtx = MemoryContextSwitchTo(state->tmpCtx);
 			ginEntryInsert(&state->ginstate,
 						   buffer->attnum, buffer->key, buffer->category,
 						   buffer->items, buffer->nfrozen, &state->buildStats);
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextReset(state->tmpCtx);
 
 			/* truncate the data we've just discarded */
 			GinBufferTrim(buffer);
-- 
2.25.1

#67

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Vinod Sridharan (#66)

Re: Parallel CREATE INDEX for GIN indexes

On 4/18/25 03:03, Vinod Sridharan wrote:

Hello,
As part of testing this change I believe I found a scenario where the
parallel build seems to trigger OOMs for larger indexes. Specifically,
the calls for ginEntryInsert seem to leak memory into
TopTransactionContext and OOM/crash the outer process.
For serial build, the calls for ginEntryInsert tend to happen in a
temporary memory context that gets reset at the end of the
ginBuildCallback.
For inserts, the call has a custom memory context and gets reset at
the end of the insert.
For parallel build, during the merge phase, the MemoryContext isn't
swapped - and so this happens on the TopTransactionContext, and ends
up growing (especially for larger indexes).

I believe at the very least these should happen inside the tmpCtx
found in the GinBuildState and reset periodically.

In the attached patch, I've tried to do this, and I'm able to build
the index without OOMing, and only consuming maintenance_work_mem
through the merge process.

Would appreciate your thoughts on this (and whether there's other approaches to
resolve this too).

Thanks for the report. I didn't have time to look at this in detail yet,
but the fix looks roughly correct. I've added this to the list of open
items for PG18.

regards

--
Tomas Vondra

#68

Tomas Vondra

tomas@vondra.me

9 months ago

In reply to: Vinod Sridharan (#66)

Re: Parallel CREATE INDEX for GIN indexes

On 4/18/25 03:03, Vinod Sridharan wrote:

Hello,
As part of testing this change I believe I found a scenario where the
parallel build seems to trigger OOMs for larger indexes. Specifically,
the calls for ginEntryInsert seem to leak memory into
TopTransactionContext and OOM/crash the outer process.
For serial build, the calls for ginEntryInsert tend to happen in a
temporary memory context that gets reset at the end of the
ginBuildCallback.
For inserts, the call has a custom memory context and gets reset at
the end of the insert.
For parallel build, during the merge phase, the MemoryContext isn't
swapped - and so this happens on the TopTransactionContext, and ends
up growing (especially for larger indexes).

Yes, that's true. The ginBuildCallbackParallel() already releases memory
after flushing the in-memory state, but I missed _gin_parallel_merge()
needs to be careful about memory usage too.

I haven't been able to trigger OOM (or even particularly bad) memory
usage, but I suppose it might be an issue with custom GIN opclasses with
much wider keys.

I believe at the very least these should happen inside the tmpCtx
found in the GinBuildState and reset periodically.

In the attached patch, I've tried to do this, and I'm able to build
the index without OOMing, and only consuming maintenance_work_mem
through the merge process.

Would appreciate your thoughts on this (and whether there's other approaches to
resolve this too).

The patch seems fine to me - I repeated the tests with mailing list
archives, with MemoryContextStats() in _gin_parallel_merge, and it
reliably minimizes the memory usage. So that's fine.

I was also worried if this might have performance impact, but it
actually seems to make it a little bit faster.

I'll get this pushed.

thanks

--
Tomas Vondra

#69

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Tomas Vondra (#68)

Re: Parallel CREATE INDEX for GIN indexes

On 4/30/25 14:39, Tomas Vondra wrote:

On 4/18/25 03:03, Vinod Sridharan wrote:

...

The patch seems fine to me - I repeated the tests with mailing list
archives, with MemoryContextStats() in _gin_parallel_merge, and it
reliably minimizes the memory usage. So that's fine.

I was also worried if this might have performance impact, but it
actually seems to make it a little bit faster.

I'll get this pushed.

And pushed, so it'll be in beta1.

Thanks!

--
Tomas Vondra

#70

Heikki Linnakangas

hlinnaka@iki.fi

about 17 hours ago

In reply to: Tomas Vondra (#44)

Re: Parallel CREATE INDEX for GIN indexes

(Replying to old thread, because I happened to spot this while looking
at David Geier's proposal at:
/messages/by-id/5d366878-2007-4d31-861e-19294b7a583b@gmail.com)

On 07/01/2025 13:59, Tomas Vondra wrote:

On 1/6/25 20:13, Matthias van de Meent wrote:

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

Good point! I fixed this by copying the logic from initGinState.

I think tuplesort_begin_index_gin() has the same issue. It does this to
look up the comparison function:

/*
* Look for an ordering for the index key data type, and then the sort
* support function.
*/
typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);

That also looks up the key type's b-tree operator rather than the GIN
opclass's compare function.

- Heikki

#71

Tomas Vondra

tomas@vondra.me

about 16 hours ago

In reply to: Heikki Linnakangas (#70)

Re: Parallel CREATE INDEX for GIN indexes

On 1/11/26 21:31, Heikki Linnakangas wrote:

(Replying to old thread, because I happened to spot this while looking
at David Geier's proposal at: https://www.postgresql.org/message-
id/5d366878-2007-4d31-861e-19294b7a583b%40gmail.com)

On 07/01/2025 13:59, Tomas Vondra wrote:

On 1/6/25 20:13, Matthias van de Meent wrote:

GinBufferInit

This seems to depend on the btree operator classes to get sortsupport
functions, bypassing the GIN compare support function (support
function 1) and adding a dependency on the btree opclasses for
indexable types. This can cause "bad" ordering, or failure to build
the index when the parallel path is chosen and no default btree
opclass is defined for the type. I think it'd be better if we allowed
users to specify which sortsupport function to use, or at least use
the correct compare function when it's defined on the attribute's
operator class.

Good point! I fixed this by copying the logic from initGinState.

I think tuplesort_begin_index_gin() has the same issue. It does this to
look up the comparison function:

/*
   * Look for an ordering for the index key data type, and then the sort
   * support function.
   */
typentry = lookup_type_cache(att->atttypid, TYPECACHE_LT_OPR);
PrepareSortSupportFromOrderingOp(typentry->lt_opr, sortKey);

That also looks up the key type's b-tree operator rather than the GIN
opclass's compare function.

Thanks for noticing this, I'll get this fixed next week.

Funny, you noticed that almost exactly one year after I fixed the other
incorrect place in the patch ;-)

regards

--
Tomas Vondra