GiST VACUUM

Started by Andrey Borodinalmost 8 years ago82 messages
#1Andrey Borodin
x4mmm@yandex-team.ru
2 attachment(s)

Hi, hackers!

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.
Original thread can be found at [0]/messages/by-id/1147341441925550@web17j.yandex.ru

==Purpose==
Currently GiST never reuse pages even if they are completely empty. This can lead to a bloat, if a lot of index tuples are deleted from key space, which do not receive newly inserted tuples.
First patch fixes that issue: empty pages are collected and reused for new page splits.

Second patch improves scanning algorithm to read GiST blocks in physical order. This can dramatically increase speed of scanning, especially on HDD.
This scan is using relatively big chunk of memory to build map of whole GiST graph. If there is not enough maintenance memory, patch had the fallback to old GiST VACUUM (from first patch).

==How logical scan works==
GiST VACUUM scans graph in DFS search to find removed tuples on leaf pages. It remembers internal pages that are referencing completely empty leaf pages.
Then in additional step, these pages are rescanned to delete references and mark leaf pages as free.

==How physical scan works==
Scan builds array of GistPSItem (Physical scan item). GistPSItem contains List of offsets pointing to potentially empty leaf pages and the information necessary to collect that list in one sequential file read.
Array of GistPSItems has one item for each GiST block.
When we encounter leaf page, if it is empty - we mark it empty and jump to parent (if known) to update it's list.
When we encounter internal page, we check GistPSItem of every child block to check if it is empty leaf and to mark parent ptr there.

==Limitations==
At least one reference on each internal pages is left undeleted to preserve balancing of the tree.
Pages that has FOLLOW-RIGHT flag also are not deleted, even if empty.

Thank you for your attention, any thoughts are welcome.

Best regards, Andrey Borodin.

[0]: /messages/by-id/1147341441925550@web17j.yandex.ru

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v3.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v3.patch; x-unix-mode=0644Download
From 525bcdc0676ee0040a912d736cf436729c3947c8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 7 Mar 2018 11:28:47 +0500
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v3

---
 src/backend/access/gist/gist.c       |   7 +++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   4 +-
 src/backend/access/gist/gistvacuum.c | 119 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  47 +++++++++++++-
 src/include/access/gist_private.h    |  23 ++++---
 src/include/access/gistxlog.h        |  16 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 9 files changed, 203 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 51c32e4afe..f7973a2e15 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -694,6 +694,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 55cccd247a..52c1b6fa69 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -24,6 +24,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -801,13 +802,14 @@ gistNewBuffer(Relation r)
 		if (ConditionalLockBuffer(buffer))
 		{
 			Page		page = BufferGetPage(buffer);
+			PageHeader	p = (PageHeader) page;
 
 			if (PageIsNew(page))
 				return buffer;	/* OK to use, if never initialized */
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(p->pd_prune_xid, RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..213e01202b 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,6 +20,8 @@
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +127,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +136,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +236,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +280,106 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/* Drop references from internal page */
+			TransactionId txid = GetCurrentTransactionId();
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+				PageIndexMultiDelete(page, todelete, ntodelete);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer, todelete, ntodelete, NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			/* Mark pages as deleted */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+				PageHeader	header = (PageHeader)leafPage;
+
+				header->pd_prune_xid = txid;
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr = gistXLogSetDeleted(rel->rd_node, buftodelete[i], header->pd_prune_xid);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..7d515f3e7e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,29 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	PageHeader		header;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+		header = (PageHeader) page;
+
+		header->pd_prune_xid = xldata->id;
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +135,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +347,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +468,23 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..fe4b87084e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,16 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..8df7f4064d 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,18 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.14.3 (Apple Git-98)

0002-Physical-GiST-scan-during-VACUUM-v3.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v3.patch; x-unix-mode=0644Download
From ba8c73c2aff010703a97b10065ecc978f41cd77d Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Wed, 7 Mar 2018 11:30:11 +0500
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v3

---
 src/backend/access/gist/README       |  35 ++++
 src/backend/access/gist/gistvacuum.c | 337 ++++++++++++++++++++++++++++++-----
 2 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 213e01202b..e2c37a55a6 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -102,8 +102,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -128,30 +129,204 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16_t     flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	Relation	     rel = info->index;
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+		/*
+		 * We are not going to stay here for a long time, calling recursive algorithms.
+		 * Especially for an internal page. So, agressivly grab an exclusive lock.
+		 */
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (GistPageIsLeaf(page))
+		{
+			OffsetNumber todelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/*
+			 * Remove deletable tuples from page
+			 */
+
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = i;
+				else
+					stats->num_index_tuples += 1;
+			}
+
+			stats->tuples_removed += ntodelete;
+
+			/* We have dead tuples on the page */
+			if (ntodelete)
+			{
+				START_CRIT_SECTION();
+
+				MarkBufferDirty(buffer);
+
+				PageIndexMultiDelete(page, todelete, ntodelete);
+				GistMarkTuplesDeleted(page);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr	recptr;
+
+					recptr = gistXLogUpdate(buffer,
+											todelete, ntodelete,
+											NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+				}
+				else
+					PageSetLSN(page, gistGetFakeLSN(rel));
 
+				END_CRIT_SECTION();
+			}
+
+			/* The page is completely empty */
+			if (ntodelete == maxoff)
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+				{
+					/* Go to parent and append myself */
+					BlockNumber parentblockno = graph[blkno].parent;
+					graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+				}
+				else
+				{
+					/* Parent will collect me later */
+					graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+				}
+			}
+		}
+		else
+		{
+			/* For internal pages we remember stucture of the tree */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				BlockNumber childblkno;
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+				if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+				{
+					/* Child has been scanned earlier and is ready to be picked up */
+					graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+				}
+				else
+				{
+					/* Collect leaf when scan will come close */
+					graph[childblkno].parent = blkno;
+					graph[childblkno].parentOffset = i;
+					graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+				}
+
+
+				if (GistTupleIsInvalid(idxtuple))
+					ereport(LOG,
+							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+									RelationGetRelationName(rel)),
+							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							 errhint("Please REINDEX it.")));
+			}
+		}
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -236,11 +411,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -260,6 +442,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -280,20 +463,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets for PageIndexMultiDelete
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -303,11 +548,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -337,7 +589,9 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
-				PageIndexMultiDelete(page, todelete, ntodelete);
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+			PageIndexMultiDelete(page, todelete, ntodelete);
 
 			if (RelationNeedsWAL(rel))
 			{
@@ -375,11 +629,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.14.3 (Apple Git-98)

#2Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andrey Borodin (#1)
Re: GiST VACUUM

On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1]https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16.

+ uint16_t flags;

I think that needs to be uint16?

[1]: https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.16

--
Thomas Munro
http://www.enterprisedb.com

#3Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Thomas Munro (#2)
2 attachment(s)
Re: GiST VACUUM

Hi, Thomas!

17 мая 2018 г., в 9:40, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

On Wed, Mar 7, 2018 at 7:50 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Here's new version of GiST VACUUM patch set aimed at CF 2018-09.

Hi Andrey,

FYI Windows doesn't like this patch[1].

+ uint16_t flags;

I think that needs to be uint16?

Thanks! Fixed.

Best regards, Andrey Borodin.

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v4.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v4.patch; x-unix-mode=0644Download
From 5eedf95a8a6f26f8d7b6ce99225427b6338fb4e6 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 17 May 2018 10:19:23 +0500
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v4

---
 src/backend/access/gist/gist.c       |   7 +++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   4 +-
 src/backend/access/gist/gistvacuum.c | 119 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  47 +++++++++++++-
 src/include/access/gist_private.h    |  23 ++++---
 src/include/access/gistxlog.h        |  16 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 9 files changed, 203 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..ba74dda523 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 55cccd247a..52c1b6fa69 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -24,6 +24,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -801,13 +802,14 @@ gistNewBuffer(Relation r)
 		if (ConditionalLockBuffer(buffer))
 		{
 			Page		page = BufferGetPage(buffer);
+			PageHeader	p = (PageHeader) page;
 
 			if (PageIsNew(page))
 				return buffer;	/* OK to use, if never initialized */
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(p->pd_prune_xid, RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..213e01202b 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,6 +20,8 @@
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +127,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +136,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +236,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +280,106 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/* Drop references from internal page */
+			TransactionId txid = GetCurrentTransactionId();
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+				PageIndexMultiDelete(page, todelete, ntodelete);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer, todelete, ntodelete, NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			/* Mark pages as deleted */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+				PageHeader	header = (PageHeader)leafPage;
+
+				header->pd_prune_xid = txid;
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr = gistXLogSetDeleted(rel->rd_node, buftodelete[i], header->pd_prune_xid);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..7d515f3e7e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,29 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	PageHeader		header;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+		header = (PageHeader) page;
+
+		header->pd_prune_xid = xldata->id;
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +135,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +347,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +468,23 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..fe4b87084e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,16 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..8df7f4064d 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,18 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.1 (Apple Git-101)

0002-Physical-GiST-scan-during-VACUUM-v4.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v4.patch; x-unix-mode=0644Download
From 9b4d140b8379b059c48da483985bf60ba31a1aeb Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 17 May 2018 10:21:13 +0500
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v4

---
 src/backend/access/gist/README       |  35 ++++
 src/backend/access/gist/gistvacuum.c | 337 ++++++++++++++++++++++++++++++-----
 2 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 213e01202b..21aa4ed484 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -102,8 +102,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -128,30 +129,204 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	Relation	     rel = info->index;
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+		/*
+		 * We are not going to stay here for a long time, calling recursive algorithms.
+		 * Especially for an internal page. So, agressivly grab an exclusive lock.
+		 */
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (GistPageIsLeaf(page))
+		{
+			OffsetNumber todelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/*
+			 * Remove deletable tuples from page
+			 */
+
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = i;
+				else
+					stats->num_index_tuples += 1;
+			}
+
+			stats->tuples_removed += ntodelete;
+
+			/* We have dead tuples on the page */
+			if (ntodelete)
+			{
+				START_CRIT_SECTION();
+
+				MarkBufferDirty(buffer);
+
+				PageIndexMultiDelete(page, todelete, ntodelete);
+				GistMarkTuplesDeleted(page);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr	recptr;
+
+					recptr = gistXLogUpdate(buffer,
+											todelete, ntodelete,
+											NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+				}
+				else
+					PageSetLSN(page, gistGetFakeLSN(rel));
 
+				END_CRIT_SECTION();
+			}
+
+			/* The page is completely empty */
+			if (ntodelete == maxoff)
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+				{
+					/* Go to parent and append myself */
+					BlockNumber parentblockno = graph[blkno].parent;
+					graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+				}
+				else
+				{
+					/* Parent will collect me later */
+					graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+				}
+			}
+		}
+		else
+		{
+			/* For internal pages we remember stucture of the tree */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				BlockNumber childblkno;
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+				if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+				{
+					/* Child has been scanned earlier and is ready to be picked up */
+					graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+				}
+				else
+				{
+					/* Collect leaf when scan will come close */
+					graph[childblkno].parent = blkno;
+					graph[childblkno].parentOffset = i;
+					graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+				}
+
+
+				if (GistTupleIsInvalid(idxtuple))
+					ereport(LOG,
+							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+									RelationGetRelationName(rel)),
+							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							 errhint("Please REINDEX it.")));
+			}
+		}
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -236,11 +411,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -260,6 +442,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -280,20 +463,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets for PageIndexMultiDelete
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -303,11 +548,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -337,7 +589,9 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
-				PageIndexMultiDelete(page, todelete, ntodelete);
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+			PageIndexMultiDelete(page, todelete, ntodelete);
 
 			if (RelationNeedsWAL(rel))
 			{
@@ -375,11 +629,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.1 (Apple Git-101)

#4Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#3)
Re: GiST VACUUM

I'm now looking at the first patch in this series, to allow completely
empty GiST pages to be recycled. I've got some questions:

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as
I can see, this patch never marks internal pages as deleted, only leaf
pages. However, we should have something like this in the leaf-page
branch, to deal with the case that an insertion lands on a page that was
concurrently deleted. Did you have any tests, where an insertion runs
concurrently with vacuum, that would exercise this?

The code in gistbulkdelete() seems pretty expensive. In the first phase,
it records the parent of every empty leaf page it encounters. In the
second phase, it scans every leaf page of that parent, not only those
leaves that were seen as empty.

I'm a bit wary of using pd_prune_xid for the checks to determine if a
deleted page can be recycled yet. In heap pages, pd_prune_xid is just a
hint, but here it's used for a critical check. This seems to be the same
mechanism we use in B-trees, but in B-trees, we store the XID in
BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use
ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See
comments in _bt_unlink_halfdead_page() for explanation. This patch is
missing any comments to explain how this works in GiST.

If you crash in the middle of gistbulkdelete(), after it has removed the
downlink from the parent, but before it has marked the leaf page as
deleted, the leaf page is "leaked". I think that's acceptable, but a
comment at least would be good.

- Heikki

#5Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#4)
Re: GiST VACUUM

Hi, Heikki!

Thanks for looking into the patch!

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

I'm now looking at the first patch in this series, to allow completely empty GiST pages to be recycled. I've got some questions:

--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
GISTInsertStack *item;
OffsetNumber downlinkoffnum;
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
iid = PageGetItemId(stack->page, downlinkoffnum);
idxtuple = (IndexTuple) PageGetItem(stack->page, iid);

This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted.

That's a bug. Will fix this.

Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?

Yes, I've tried to test this, but, obviously, not enough. I'll think more about how to deal with it.

The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty.

Yes, in first patch there is simplest gistbulkdelete(), second patch will remember line pointers of downlinks to empty leafs.

I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST.

Will look into this. I remember it was OK half a year ago, but now it is clear to me that I had to document that part when I understood it....

If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good.

I was considering doing reverse-split (page merge) concurrency like in Lanin and Shasha's paper, but it is just too complex for little purpose. Will add comments on possible orphan pages.

Many thanks! I hope to post updated patch series this week.

Best regards, Andrey Borodin.

#6Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#4)
2 attachment(s)
Re: GiST VACUUM

Hi!

PFA v5 of the patch series.

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

This seems misplaced. This code deals with internal pages, and as far as I can see, this patch never marks internal pages as deleted, only leaf pages. However, we should have something like this in the leaf-page branch, to deal with the case that an insertion lands on a page that was concurrently deleted. Did you have any tests, where an insertion runs concurrently with vacuum, that would exercise this?

That bug could manifest only in case of crash between removing downlinks and marking pages deleted. I do not know how to test this reliably.
Internal pages are locked before leafs and locks are coupled. No cuncurrent backend can see downlinks to pages being deleted, unless crash happens.

I've replaced code covering this situation into leaf code path and added comment.

The code in gistbulkdelete() seems pretty expensive. In the first phase, it records the parent of every empty leaf page it encounters. In the second phase, it scans every leaf page of that parent, not only those leaves that were seen as empty.

It is fixed in second patch of the series.

I'm a bit wary of using pd_prune_xid for the checks to determine if a deleted page can be recycled yet. In heap pages, pd_prune_xid is just a hint, but here it's used for a critical check. This seems to be the same mechanism we use in B-trees, but in B-trees, we store the XID in BTPageOpaqueData.xact, not pd_prune_xid. Also, in B-trees, we use ReadNewTransactionId() to set it, not GetCurrentTransactionId(). See comments in _bt_unlink_halfdead_page() for explanation. This patch is missing any comments to explain how this works in GiST.

I've replaced usage of GetCurrentTransactionId() with ReadNewTransactionId() and added explanation of what is going on. Also, I've added comments about that pd_prune_xid usage. There is no other use in GiST for this field. And there is no other room to place this xid on a page without pg_upgrade.

If you crash in the middle of gistbulkdelete(), after it has removed the downlink from the parent, but before it has marked the leaf page as deleted, the leaf page is "leaked". I think that's acceptable, but a comment at least would be good.

Added explanatory comment between WAL-logging downlink removal and marking pages deleted.

Thank you for reviewing the patch!

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v5.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v5.patch; x-unix-mode=0644Download
From b31decde365a602c64bbd56648c77b7822dd158d Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 12 Jul 2018 19:56:53 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v5

---
 src/backend/access/gist/README       |  35 ++++
 src/backend/access/gist/gistvacuum.c | 337 ++++++++++++++++++++++++++++++-----
 2 files changed, 332 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index f7e274701f..3f7471e45c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -102,8 +102,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -128,30 +129,204 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	Relation	     rel = info->index;
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+		/*
+		 * We are not going to stay here for a long time, calling recursive algorithms.
+		 * Especially for an internal page. So, agressivly grab an exclusive lock.
+		 */
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (GistPageIsLeaf(page))
+		{
+			OffsetNumber todelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/*
+			 * Remove deletable tuples from page
+			 */
+
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = i;
+				else
+					stats->num_index_tuples += 1;
+			}
+
+			stats->tuples_removed += ntodelete;
+
+			/* We have dead tuples on the page */
+			if (ntodelete)
+			{
+				START_CRIT_SECTION();
+
+				MarkBufferDirty(buffer);
+
+				PageIndexMultiDelete(page, todelete, ntodelete);
+				GistMarkTuplesDeleted(page);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr	recptr;
+
+					recptr = gistXLogUpdate(buffer,
+											todelete, ntodelete,
+											NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+				}
+				else
+					PageSetLSN(page, gistGetFakeLSN(rel));
 
+				END_CRIT_SECTION();
+			}
+
+			/* The page is completely empty */
+			if (ntodelete == maxoff)
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+				{
+					/* Go to parent and append myself */
+					BlockNumber parentblockno = graph[blkno].parent;
+					graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+				}
+				else
+				{
+					/* Parent will collect me later */
+					graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+				}
+			}
+		}
+		else
+		{
+			/* For internal pages we remember stucture of the tree */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				BlockNumber childblkno;
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+				if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+				{
+					/* Child has been scanned earlier and is ready to be picked up */
+					graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+				}
+				else
+				{
+					/* Collect leaf when scan will come close */
+					graph[childblkno].parent = blkno;
+					graph[childblkno].parentOffset = i;
+					graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+				}
+
+
+				if (GistTupleIsInvalid(idxtuple))
+					ereport(LOG,
+							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+									RelationGetRelationName(rel)),
+							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							 errhint("Please REINDEX it.")));
+			}
+		}
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -236,11 +411,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -260,6 +442,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -280,20 +463,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets for PageIndexMultiDelete
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -303,11 +548,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -343,7 +595,9 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 			START_CRIT_SECTION();
 
 			MarkBufferDirty(buffer);
-				PageIndexMultiDelete(page, todelete, ntodelete);
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+			PageIndexMultiDelete(page, todelete, ntodelete);
 
 			if (RelationNeedsWAL(rel))
 			{
@@ -387,11 +641,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

0001-Delete-pages-during-GiST-VACUUM-v5.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v5.patch; x-unix-mode=0644Download
From ccfbf6184f66ccada3eeb761d6058ede31e078e4 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 12 Jul 2018 19:51:53 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v5

---
 src/backend/access/gist/gist.c       |  17 +++++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   8 ++-
 src/backend/access/gist/gistvacuum.c | 131 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  47 ++++++++++++-
 src/include/access/gist_private.h    |  23 ++++--
 src/include/access/gistxlog.h        |  16 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 9 files changed, 229 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..66466c1373 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/* 
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -786,6 +791,18 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			 * it doesn't fit. gistinserthere() will take care of that.
 			 */
 
+			/* 
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/*
 			 * Swap shared lock for an exclusive one. Be careful, the page may
 			 * change while we unlock/lock the page...
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..eaac881a1d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -800,13 +801,18 @@ gistNewBuffer(Relation r)
 		if (ConditionalLockBuffer(buffer))
 		{
 			Page		page = BufferGetPage(buffer);
+			PageHeader	p = (PageHeader) page;
 
 			if (PageIsNew(page))
 				return buffer;	/* OK to use, if never initialized */
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			/*
+			 * We use pd_prune_xid to store upper bound for xid that could
+			 * downlinks to this page
+			 */
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(p->pd_prune_xid, RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..f7e274701f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,6 +20,8 @@
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +127,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +136,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +236,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +280,118 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/* 
+			 * Drop references from internal page
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+				PageIndexMultiDelete(page, todelete, ntodelete);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer, todelete, ntodelete, NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			/* In case of crash here we risk to leave unreferenced leaf pages */
+
+			/* Mark pages as deleted */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+				PageHeader	header = (PageHeader)leafPage;
+
+				/*
+				 * We use pd_prune_xid to store upper bound for xid that could
+				 * downlinks to this page
+				 */
+				header->pd_prune_xid = txid;
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr = gistXLogSetDeleted(rel->rd_node, buftodelete[i], header->pd_prune_xid);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..7d515f3e7e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,29 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	PageHeader		header;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+		header = (PageHeader) page;
+
+		header->pd_prune_xid = xldata->id;
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +135,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +347,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +468,23 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..fe4b87084e 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,16 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..8df7f4064d 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,18 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

#7Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#6)
Re: GiST VACUUM

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?

That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that
situation after a crash. After a crash, you might have some empty,
orphaned, pages that have already been unlinked from the parent, but a
search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving
orphaned pages like that behind. Let's WAL-log the removal of the
downlink, and marking the leaf pages as deleted, in one WAL record, to
avoid that.

But the situation in gistdoinsert(), where you encounter a deleted leaf
page, could happen during normal operation, if vacuum runs concurrently
with an insert. Insertion locks only one page at a time, as it descends
the tree, so after it has released the lock on the parent, but before it
has locked the child, vacuum might have deleted the page. In the latest
patch, you're checking for that just before swapping the shared lock for
an exclusive one, but I think that's wrong; you need to check for that
after swapping the lock, because otherwise vacuum might delete the page
while you're not holding the lock.

I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

We do need some way of testing this..

- Heikki

#8Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#7)
Re: GiST VACUUM

12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 12/07/18 19:06, Andrey Borodin wrote:

11 июля 2018 г., в 0:07, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):
This seems misplaced. This code deals with internal pages, and as
far as I can see, this patch never marks internal pages as deleted,
only leaf pages. However, we should have something like this in the
leaf-page branch, to deal with the case that an insertion lands on
a page that was concurrently deleted. Did you have any tests, where
an insertion runs concurrently with vacuum, that would exercise
this?

That bug could manifest only in case of crash between removing
downlinks and marking pages deleted.

Hmm. The downlink is removed first, so I don't think you can see that situation after a crash. After a crash, you might have some empty, orphaned, pages that have already been unlinked from the parent, but a search/insert should never encounter them.

Actually, now that I think about it more, I'm not happy with leaving orphaned pages like that behind. Let's WAL-log the removal of the downlink, and marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not know a proper way to test that (BTW there is GiST amcheck in progress, but I decided to leave it for a while).

But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

I do not know how to test this
reliably. Internal pages are locked before leafs and locks are
coupled. No cuncurrent backend can see downlinks to pages being
deleted, unless crash happens.

Are you sure? At a quick glance, I don't think the locks are coupled.

Sorry for overquoting

+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
Here's first lock
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
now locks are coupled in a top-down descent

We do need some way of testing this..

Can we test replication of concurrent VACUUM and inserts in existing test suit? I just do not know.
I can do this tests manually if this is enough.

Best regards, Andrey Borodin.

#9Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#8)
2 attachment(s)
Re: GiST VACUUM

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v6.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v6.patch; x-unix-mode=0644Download
From 6a84dc325d3bf7f5e66e621e74b2b71faba03679 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 17:29:56 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v6

---
 src/backend/access/gist/README       |  35 ++++
 src/backend/access/gist/gistvacuum.c | 336 +++++++++++++++++++++++++++++++----
 2 files changed, 332 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 71b632a4ff..50207be0b7 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -103,8 +103,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -129,30 +130,204 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	Relation	     rel = info->index;
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+		/*
+		 * We are not going to stay here for a long time, calling recursive algorithms.
+		 * Especially for an internal page. So, agressivly grab an exclusive lock.
+		 */
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (GistPageIsLeaf(page))
+		{
+			OffsetNumber todelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/*
+			 * Remove deletable tuples from page
+			 */
+
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = i;
+				else
+					stats->num_index_tuples += 1;
+			}
+
+			stats->tuples_removed += ntodelete;
+
+			/* We have dead tuples on the page */
+			if (ntodelete)
+			{
+				START_CRIT_SECTION();
+
+				MarkBufferDirty(buffer);
+
+				PageIndexMultiDelete(page, todelete, ntodelete);
+				GistMarkTuplesDeleted(page);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr	recptr;
+
+					recptr = gistXLogUpdate(buffer,
+											todelete, ntodelete,
+											NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+				}
+				else
+					PageSetLSN(page, gistGetFakeLSN(rel));
 
+				END_CRIT_SECTION();
+			}
+
+			/* The page is completely empty */
+			if (ntodelete == maxoff)
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+				{
+					/* Go to parent and append myself */
+					BlockNumber parentblockno = graph[blkno].parent;
+					graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+				}
+				else
+				{
+					/* Parent will collect me later */
+					graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+				}
+			}
+		}
+		else
+		{
+			/* For internal pages we remember stucture of the tree */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				BlockNumber childblkno;
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+				if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+				{
+					/* Child has been scanned earlier and is ready to be picked up */
+					graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+				}
+				else
+				{
+					/* Collect leaf when scan will come close */
+					graph[childblkno].parent = blkno;
+					graph[childblkno].parentOffset = i;
+					graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+				}
+
+
+				if (GistTupleIsInvalid(idxtuple))
+					ereport(LOG,
+							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+									RelationGetRelationName(rel)),
+							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							 errhint("Please REINDEX it.")));
+			}
+		}
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -237,11 +412,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -261,6 +443,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -281,20 +464,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -304,11 +549,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -333,6 +585,9 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		if (ntodelete)
 		{
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
 			/* 
 			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
 			 * that could hold downlinks to this page. We use
@@ -383,11 +638,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

0001-Delete-pages-during-GiST-VACUUM-v6.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v6.patch; x-unix-mode=0644Download
From 8b1d093415f9a11eae182682fb8e6d9c768cce45 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 16:07:45 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v6

---
 src/backend/access/gist/gist.c       |  17 +++++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   8 ++-
 src/backend/access/gist/gistvacuum.c | 127 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  65 +++++++++++++++++-
 src/include/access/gist_private.h    |  24 +++++--
 src/include/access/gistxlog.h        |  17 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 9 files changed, 245 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..941899c89f 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/* 
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,18 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..eaac881a1d 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -800,13 +801,18 @@ gistNewBuffer(Relation r)
 		if (ConditionalLockBuffer(buffer))
 		{
 			Page		page = BufferGetPage(buffer);
+			PageHeader	p = (PageHeader) page;
 
 			if (PageIsNew(page))
 				return buffer;	/* OK to use, if never initialized */
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			/*
+			 * We use pd_prune_xid to store upper bound for xid that could
+			 * downlinks to this page
+			 */
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(p->pd_prune_xid, RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..71b632a4ff 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,10 +16,13 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +128,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +137,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +237,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +281,113 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/* 
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+				PageHeader	header = (PageHeader)leafPage;
+
+				/*
+				 * We use pd_prune_xid to store upper bound for xid that could
+				 * downlinks to this page
+				 */
+				header->pd_prune_xid = txid;
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr =
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+						header->pd_prune_xid, buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..a0c42b5fbb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,41 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+	PageHeader		header;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+		header = (PageHeader) page;
+
+		header->pd_prune_xid = xldata->id;
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +147,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +359,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +480,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+	xlrec.downlinkOffset = internalPageBuffer;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..f93024ab25 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#9)
Re: GiST VACUUM

On 13/07/18 16:41, Andrey Borodin wrote:

12 июля 2018 г., в 21:07, Andrey Borodin <x4mmm@yandex-team.ru
<mailto:x4mmm@yandex-team.ru>> написал(а):
12 июля 2018 г., в 20:40, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>> написал(а):

Actually, now that I think about it more, I'm not happy with leaving orphaned
pages like that behind. Let's WAL-log the removal of the downlink, and
marking the leaf pages as deleted, in one WAL record, to avoid that.

OK, will do this. But this will complicate WAL replay seriously, and I do not
know a proper way to test that (BTW there is GiST amcheck in progress, but I
decided to leave it for a while).

Done. Now WAL record for deleted page also removes downlink from internal page.
I had to use IndexPageTupleDelete() instead of IndexPageTupleMultiDelete(), but
I do not think it will have any impact on performance.

Yeah, I think that's fine, this isn't that performance critical

But the situation in gistdoinsert(), where you encounter a deleted leaf page,
could happen during normal operation, if vacuum runs concurrently with an
insert. Insertion locks only one page at a time, as it descends the tree, so
after it has released the lock on the parent, but before it has locked the
child, vacuum might have deleted the page. In the latest patch, you're
checking for that just before swapping the shared lock for an exclusive one,
but I think that's wrong; you need to check for that after swapping the lock,
because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.

Ok, the comment now says:

+			/*
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */

But that's not accurate, right? You should never see deleted pages after
a crash, because the parent is updated in the same WAL record as the
child page, right?

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

- Heikki

#11Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#9)
Re: GiST VACUUM

Looking at the second patch, to scan the GiST index in physical order,
that seems totally unsafe, if there are any concurrent page splits. In
the logical scan, pushStackIfSplited() deals with that, by comparing the
page's NSN with the parent's LSN. But I don't see anything like that in
the physical scan code.

I think we can do something similar in the physical scan: remember the
current LSN position at the beginning of the vacuum, and compare with
that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the
index in physical order in the bulkdelete pass already?

- Heikki

#12Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#10)
2 attachment(s)
Re: GiST VACUUM

13 июля 2018 г., в 18:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

But the situation in gistdoinsert(), where you encounter a deleted leaf page, could happen during normal operation, if vacuum runs concurrently with an insert. Insertion locks only one page at a time, as it descends the tree, so after it has released the lock on the parent, but before it has locked the child, vacuum might have deleted the page. In the latest patch, you're checking for that just before swapping the shared lock for an exclusive one, but I think that's wrong; you need to check for that after swapping the lock, because otherwise vacuum might delete the page while you're not holding the lock.

Looks like a valid concern, I'll move that code again.

Done.

Ok, the comment now says:

+			/*
+			 * Leaf pages can be left deleted but still referenced in case of
+			 * crash during VACUUM's gistbulkdelete()
+			 */

But that's not accurate, right? You should never see deleted pages after a crash, because the parent is updated in the same WAL record as the child page, right?

Fixed the comment.

I'm still a bit scared about using pd_prune_xid to store the XID that prevents recycling the page too early. Can we use some field in GISTPageOpaqueData for that, similar to how the B-tree stores it in BTPageOpaqueData?

There is no room in opaque data, but, technically all page is just a tombstone until reused, so we can pick arbitrary place. PFA v7 where xid after delete is stored in opaque data, but we can use other places, like line pointer array or opaque-1

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Looking at the second patch, to scan the GiST index in physical order, that seems totally unsafe, if there are any concurrent page splits. In the logical scan, pushStackIfSplited() deals with that, by comparing the page's NSN with the parent's LSN. But I don't see anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink simultaneously. The purpose of NSN is to visit this page exactly once by following only on of two links in a scan. This is achieved naturally if we read everything from the beginning to the end. (That is how I understand, I can be wrong)

I think we can do something similar in the physical scan: remember the current LSN position at the beginning of the vacuum, and compare with that. The B-tree code uses the "cycle ID" for similar purposes.

Do we still need the separate gistvacuumcleanup() pass, if we scan the index in physical order in the bulkdelete pass already?

We do not need to gather stats there, but we are doing RecordFreeIndexPage() and IndexFreeSpaceMapVacuum(). Is it correct to move this to first scan?

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v7.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v7.patch; x-unix-mode=0644Download
From 731d5314f209a68dbc35e64722a63d26c26b2391 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:19:57 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v7

---
 src/backend/access/gist/gistvacuum.c | 338 ++++++++++++++++++++++++++++++-----
 1 file changed, 298 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a96d91eb1d..8c27f123b4 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -103,8 +103,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -129,30 +130,204 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	Relation	     rel = info->index;
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+
+		vacuum_delay_point();
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+		/*
+		 * We are not going to stay here for a long time, calling recursive algorithms.
+		 * Especially for an internal page. So, agressivly grab an exclusive lock.
+		 */
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		if (GistPageIsLeaf(page))
+		{
+			OffsetNumber todelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/*
+			 * Remove deletable tuples from page
+			 */
+
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = i;
+				else
+					stats->num_index_tuples += 1;
+			}
+
+			stats->tuples_removed += ntodelete;
+
+			/* We have dead tuples on the page */
+			if (ntodelete)
+			{
+				START_CRIT_SECTION();
+
+				MarkBufferDirty(buffer);
+
+				PageIndexMultiDelete(page, todelete, ntodelete);
+				GistMarkTuplesDeleted(page);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr	recptr;
+
+					recptr = gistXLogUpdate(buffer,
+											todelete, ntodelete,
+											NULL, 0, InvalidBuffer);
+					PageSetLSN(page, recptr);
+				}
+				else
+					PageSetLSN(page, gistGetFakeLSN(rel));
+
+				END_CRIT_SECTION();
+			}
+
+			/* The page is completely empty */
+			if (ntodelete == maxoff)
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+				{
+					/* Go to parent and append myself */
+					BlockNumber parentblockno = graph[blkno].parent;
+					graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+				}
+				else
+				{
+					/* Parent will collect me later */
+					graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+				}
+			}
+		}
+		else
+		{
+			/* For internal pages we remember stucture of the tree */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				BlockNumber childblkno;
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+				if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+				{
+					/* Child has been scanned earlier and is ready to be picked up */
+					graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+				}
+				else
+				{
+					/* Collect leaf when scan will come close */
+					graph[childblkno].parent = blkno;
+					graph[childblkno].parentOffset = i;
+					graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+				}
+
+
+				if (GistTupleIsInvalid(idxtuple))
+					ereport(LOG,
+							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+									RelationGetRelationName(rel)),
+							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							 errhint("Please REINDEX it.")));
+			}
+		}
+		UnlockReleaseBuffer(buffer);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
 
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -237,11 +412,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -261,6 +443,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -281,20 +464,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -304,11 +549,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -333,7 +585,10 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		if (ntodelete)
 		{
-			/*
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+			/* 
 			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
 			 * that could hold downlinks to this page. We use
 			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
@@ -378,11 +633,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

0001-Delete-pages-during-GiST-VACUUM-v7.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v7.patch; x-unix-mode=0644Download
From 403ff258a881eda28122a4a3a42a58c9e9a4c9a0 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:12:17 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v7

---
 src/backend/access/gist/README       |  35 ++++++++++
 src/backend/access/gist/gist.c       |  18 ++++++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   3 +-
 src/backend/access/gist/gistvacuum.c | 122 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  63 +++++++++++++++++-
 src/include/access/gist.h            |  11 ++++
 src/include/access/gist_private.h    |  24 +++++--
 src/include/access/gistxlog.h        |  17 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 11 files changed, 280 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..5297e1691d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced 
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..41978bb5e5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..a96d91eb1d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,10 +16,13 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +128,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +137,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +237,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +281,108 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/*
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr =
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+						GistPageGetDeleteXid(leafPage), buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..a48a8341c0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->id);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +145,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +357,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +478,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+	xlrec.downlinkOffset = internalPageBuffer;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..d2b1edfc81 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -63,7 +63,14 @@ typedef struct GISTPageOpaqueData
 	uint16		gist_page_id;	/* for identification of GiST indexes */
 } GISTPageOpaqueData;
 
+typedef struct GISTDeletedPageOpaqueData
+{
+	PageGistNSN nsn;			/* this value must change on page split */
+	TransactionId deleteXid;	/* xid of transaction after deleting VACUUM */
+} GISTDeletedPageOpaqueData;
+
 typedef GISTPageOpaqueData *GISTPageOpaque;
+typedef GISTDeletedPageOpaqueData *GISTDeletedPageOpaque;
 
 /*
  * The page ID is for the convenience of pg_filedump and similar utilities,
@@ -151,6 +158,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteOpaque(page) ( (GISTDeletedPageOpaque) PageGetSpecialPointer(page) )
+#define GistPageGetDeleteXid(page) ( GistPageGetDeleteOpaque(page)->deleteXid)
+#define GistPageSetDeleteXid(page, val) ( GistPageGetDeleteOpaque(page)->deleteXid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..f93024ab25 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#12)
Re: GiST VACUUM

On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on
page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved
to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page
is stamped with current "vacuum cycle id". When the vacuum scan
encounters a page with the current cycle id, whose right-link points to
a lower-numbered page, it immediately follows the right link, and
re-scans it. I.e. in the above example, if it was a B-tree, in step 3
when vacuum scans page 15, it would see that it was concurrently split.
It would immediately vacuum page 5 again, before continuing the scan in
physical order.

We'll need to do something similar in GiST.

- Heikki

#14Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#13)
Re: GiST VACUUM

14 июля 2018 г., в 0:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 13/07/18 21:28, Andrey Borodin wrote:

13 июля 2018 г., в 18:25, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):
Looking at the second patch, to scan the GiST index in physical
order, that seems totally unsafe, if there are any concurrent page
splits. In the logical scan, pushStackIfSplited() deals with that,
by comparing the page's NSN with the parent's LSN. But I don't see
anything like that in the physical scan code.

Leaf page can be pointed by internal page and rightlink
simultaneously. The purpose of NSN is to visit this page exactly once
by following only on of two links in a scan. This is achieved
naturally if we read everything from the beginning to the end. (That
is how I understand, I can be wrong)

The scenario where this fails goes like this:

1. Vacuum scans physical pages 1-10
2. A concurrent insertion splits page 15. The new left half stays on page 15, but the new right half goes to page 5
3. Vacuum scans pages 11-20

Now, if there were any dead tuples on the right half of the split, moved to page 5, the vacuum would miss them.

The way this is handled in B-tree is that when a page is split, the page is stamped with current "vacuum cycle id". When the vacuum scan encounters a page with the current cycle id, whose right-link points to a lower-numbered page, it immediately follows the right link, and re-scans it. I.e. in the above example, if it was a B-tree, in step 3 when vacuum scans page 15, it would see that it was concurrently split. It would immediately vacuum page 5 again, before continuing the scan in physical order.

We'll need to do something similar in GiST.

OK, I will do this.

This is tradeoff between complex concurrency feature and possibility of few dead tuples left after VACUUM. I want to understand: is it something dangerous in this dead tuples?
There is one more serious race condition: result of first scan is just a hint where to look for downlinks to empty pages. If internal page splits between scan and cleanup, offsets of downlinks will be changed, cleanup will lock pages, see non-empty pages and will not delete them (though there are not dead tuples, just not deleted empty leafs).

Best regards, Andrey Borodin.

#15Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#14)
Re: GiST VACUUM

On 14/07/18 10:26, Andrey Borodin wrote:

This is tradeoff between complex concurrency feature and possibility
of few dead tuples left after VACUUM. I want to understand: is it
something dangerous in this dead tuples?

Yeah, that's bad. When a new heap tuple is inserted in the same
location, the old index tuple will point to the new, unrelated, tuple,
and you will get incorrect query results.

There is one more serious race condition: result of first scan is
just a hint where to look for downlinks to empty pages. If internal
page splits between scan and cleanup, offsets of downlinks will be
changed, cleanup will lock pages, see non-empty pages and will not
delete them (though there are not dead tuples, just not deleted empty
leafs).

That's fine. Leaving behind a few empty pages is harmless, the next
vacuum will pick them up.

- Heikki

#16Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#15)
2 attachment(s)
Re: GiST VACUUM

14 июля 2018 г., в 14:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 14/07/18 10:26, Andrey Borodin wrote:

This is tradeoff between complex concurrency feature and possibility
of few dead tuples left after VACUUM. I want to understand: is it
something dangerous in this dead tuples?

Yeah, that's bad. When a new heap tuple is inserted in the same location, the old index tuple will point to the new, unrelated, tuple, and you will get incorrect query results.

PFA v8: at the beginning of physical scan I grab GetInsertRecPtr() and if leaf page have rightlink lefter in file and NSN higher than LSN of start we get back to scan that page one more time. This is recursive.

I'm still not very comfortable with storing deletion xid in opaque data: we reuse rightlink field under very specific conditions instead of using totally unused pd_prune_xid. We have a remark in bufpage.h
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
Or may be we should use some other place on those empty 8Kb page.

Deleted page do not have rightlink now, but in B-trees rightlink on deleted pages is actively used.
We either cannot reuse NSN: rightlink is useless without NSN anyway. We cannot reuse flags, page is deleted in flags after all. uint16 gist_page_id is just not enough.

Best regards, Andrey Borodin.

Attachments:

0001-Delete-pages-during-GiST-VACUUM-v8.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v8.patch; x-unix-mode=0644Download
From a4c39cbd97127bdf89dca625af222f88764e5d6b Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:12:17 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v8

---
 src/backend/access/gist/README       |  35 ++++++++++
 src/backend/access/gist/gist.c       |  18 ++++++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   3 +-
 src/backend/access/gist/gistvacuum.c | 122 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  63 +++++++++++++++++-
 src/include/access/gist.h            |   9 ++-
 src/include/access/gist_private.h    |  24 +++++--
 src/include/access/gistxlog.h        |  17 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 11 files changed, 277 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..5297e1691d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced 
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..41978bb5e5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..a96d91eb1d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,10 +16,13 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +128,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +137,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +237,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +281,108 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/*
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr =
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+						GistPageGetDeleteXid(leafPage), buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..a48a8341c0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->id);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +145,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +357,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +478,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+	xlrec.downlinkOffset = internalPageBuffer;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..8ca78a95ab 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -58,7 +58,11 @@ typedef PageXLogRecPtr PageGistNSN;
 typedef struct GISTPageOpaqueData
 {
 	PageGistNSN nsn;			/* this value must change on page split */
-	BlockNumber rightlink;		/* next page if any */
+	union
+	{
+		BlockNumber rightlink;		/* next page if any */
+		TransactionId deleteXid;	/* xid of transaction after deleting VACUUM */
+	};
 	uint16		flags;			/* see bit definitions above */
 	uint16		gist_page_id;	/* for identification of GiST indexes */
 } GISTPageOpaqueData;
@@ -151,6 +155,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( GistPageGetOpaque(page)->deleteXid)
+#define GistPageSetDeleteXid(page, val) ( GistPageGetOpaque(page)->deleteXid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..f93024ab25 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0002-Physical-GiST-scan-during-VACUUM-v8.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v8.patch; x-unix-mode=0644Download
From 170725dde85232d0e8beb85c661526ecb71b4608 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:19:57 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v8

---
 src/backend/access/gist/gistvacuum.c | 366 +++++++++++++++++++++++++++++++----
 1 file changed, 326 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a96d91eb1d..bb3fc7d00e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -103,8 +103,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -129,30 +130,232 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+static void
+gistbulkdeletephysicalcanpage(IndexVacuumInfo * info, IndexBulkDeleteResult * stats,
+								IndexBulkDeleteCallback callback, void* callback_state,
+								BlockNumber blkno, GistNSN startNSN, GistPSItem *graph)
+{
+	Relation	 rel = info->index;
+	Buffer		 buffer;
+	Page		 page;
+	OffsetNumber i,
+					maxoff;
+	IndexTuple   idxtuple;
+	ItemId	     iid;
+
+	/*
+	 * This is recursive call, should almost never be deeper than
+	 * GIST_MAX_SPLIT_PAGES, but check anyway.
+	 */
+	check_stack_depth();
+
+	vacuum_delay_point();
+
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, calling recursive algorithms.
+	 * Especially for an internal page. So, agressivly grab an exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
+
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		/* TODO: Should not we record free page here? */
+		return;
+	}
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+			{
+				gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
+			}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			if (callback(&(idxtuple->t_tid), callback_state))
+				todelete[ntodelete++] = i;
+			else
+				stats->num_index_tuples += 1;
+		}
+
+		stats->tuples_removed += ntodelete;
+
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			END_CRIT_SECTION();
+		}
+
+		/* The page is completely empty */
+		if (ntodelete == maxoff)
+		{
+			/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+			if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+			{
+				/* Go to parent and append myself */
+				BlockNumber parentblockno = graph[blkno].parent;
+				graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+			}
+			else
+			{
+				/* Parent will collect me later */
+				graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+			}
+		}
+	}
+	else
+	{
+		/* For internal pages we remember stucture of the tree */
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			BlockNumber childblkno;
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+			if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+			{
+				/* Child has been scanned earlier and is ready to be picked up */
+				graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+			}
+			else
+			{
+				/* Collect leaf when scan will come close */
+				graph[childblkno].parent = blkno;
+				graph[childblkno].parentOffset = i;
+				graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+			}
 
+
+			if (GistTupleIsInvalid(idxtuple))
+				ereport(LOG,
+						(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+								RelationGetRelationName(rel)),
+							errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							errhint("Please REINDEX it.")));
+		}
+	}
+	UnlockReleaseBuffer(buffer);
+}
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
+	GistNSN			 startNSN = GetInsertRecPtr();
+
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, blkno, startNSN, graph);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -237,11 +440,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -261,6 +471,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -281,20 +492,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -304,11 +577,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -333,7 +613,10 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		if (ntodelete)
 		{
-			/*
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+			/* 
 			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
 			 * that could hold downlinks to this page. We use
 			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
@@ -378,11 +661,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

#17Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: GiST VACUUM

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

What's your reason for being scared? It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

If I had my druthers, I'd nuke pd_prune_xid from orbit -- it's a piece
of seemingly heap-specific data that is kept in the generic page
header rather than in the heap's special space. Other AMs like btree
or zheap may have different needs; one size does not fit all. That
said, since getting rid of pd_prune_xid seems impractical, maybe it
makes more sense to reuse it than to insist on leaving it idle and
consuming space elsewhere in the page.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#18Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Robert Haas (#17)
2 attachment(s)
Re: GiST VACUUM

16 июля 2018 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а):

On Fri, Jul 13, 2018 at 10:10 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I'm still a bit scared about using pd_prune_xid to store the XID that
prevents recycling the page too early. Can we use some field in
GISTPageOpaqueData for that, similar to how the B-tree stores it in
BTPageOpaqueData?

What's your reason for being scared? It seems to me that the
alternatives being proposed (altering the size of the special space,
or sometimes repurposing a field for some other purpose) have their
own associated scariness.

Thanks, that's exactly what I'm thinking about where to store this xid.

Here's v9 of the patch, it uses pd_prune_xid, but it is abstracted to GistPageGetDeleteXid() \ GistPageSetDeleteXid() so that we can change the way we store it easily.

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v9.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v9.patch; x-unix-mode=0644Download
From fb560260cc0c2ad02fd64842b18ff45f1f192e19 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:19:57 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v9

---
 src/backend/access/gist/gistvacuum.c | 366 +++++++++++++++++++++++++++++++----
 1 file changed, 326 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index a96d91eb1d..bb3fc7d00e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -103,8 +103,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -129,30 +130,232 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+static void
+gistbulkdeletephysicalcanpage(IndexVacuumInfo * info, IndexBulkDeleteResult * stats,
+								IndexBulkDeleteCallback callback, void* callback_state,
+								BlockNumber blkno, GistNSN startNSN, GistPSItem *graph)
+{
+	Relation	 rel = info->index;
+	Buffer		 buffer;
+	Page		 page;
+	OffsetNumber i,
+					maxoff;
+	IndexTuple   idxtuple;
+	ItemId	     iid;
+
+	/*
+	 * This is recursive call, should almost never be deeper than
+	 * GIST_MAX_SPLIT_PAGES, but check anyway.
+	 */
+	check_stack_depth();
+
+	vacuum_delay_point();
+
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, calling recursive algorithms.
+	 * Especially for an internal page. So, agressivly grab an exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
+
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		/* TODO: Should not we record free page here? */
+		return;
+	}
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+			{
+				gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
+			}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			if (callback(&(idxtuple->t_tid), callback_state))
+				todelete[ntodelete++] = i;
+			else
+				stats->num_index_tuples += 1;
+		}
+
+		stats->tuples_removed += ntodelete;
+
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			END_CRIT_SECTION();
+		}
+
+		/* The page is completely empty */
+		if (ntodelete == maxoff)
+		{
+			/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+			if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+			{
+				/* Go to parent and append myself */
+				BlockNumber parentblockno = graph[blkno].parent;
+				graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+			}
+			else
+			{
+				/* Parent will collect me later */
+				graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+			}
+		}
+	}
+	else
+	{
+		/* For internal pages we remember stucture of the tree */
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			BlockNumber childblkno;
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+			if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+			{
+				/* Child has been scanned earlier and is ready to be picked up */
+				graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+			}
+			else
+			{
+				/* Collect leaf when scan will come close */
+				graph[childblkno].parent = blkno;
+				graph[childblkno].parentOffset = i;
+				graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+			}
 
+
+			if (GistTupleIsInvalid(idxtuple))
+				ereport(LOG,
+						(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+								RelationGetRelationName(rel)),
+							errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							errhint("Please REINDEX it.")));
+		}
+	}
+	UnlockReleaseBuffer(buffer);
+}
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
+	GistNSN			 startNSN = GetInsertRecPtr();
+
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, blkno, startNSN, graph);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -237,11 +440,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -261,6 +471,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -281,20 +492,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -304,11 +577,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -333,7 +613,10 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		if (ntodelete)
 		{
-			/*
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+			/* 
 			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
 			 * that could hold downlinks to this page. We use
 			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
@@ -378,11 +661,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

0001-Delete-pages-during-GiST-VACUUM-v9.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v9.patch; x-unix-mode=0644Download
From 4ce6b43685db913c30f863a87126410c690c4e75 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Fri, 13 Jul 2018 22:12:17 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v9

---
 src/backend/access/gist/README       |  35 ++++++++++
 src/backend/access/gist/gist.c       |  18 ++++++
 src/backend/access/gist/gistbuild.c  |   5 --
 src/backend/access/gist/gistutil.c   |   3 +-
 src/backend/access/gist/gistvacuum.c | 122 +++++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c   |  63 +++++++++++++++++-
 src/include/access/gist.h            |   3 +
 src/include/access/gist_private.h    |  24 +++++--
 src/include/access/gistxlog.h        |  17 ++++-
 src/test/regress/expected/gist.out   |   4 +-
 src/test/regress/sql/gist.sql        |   4 +-
 11 files changed, 272 insertions(+), 26 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..5297e1691d 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced 
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..41978bb5e5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..a96d91eb1d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,10 +16,13 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +128,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +137,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +237,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +281,108 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/*
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr =
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+						GistPageGetDeleteXid(leafPage), buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..a48a8341c0 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->id);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+
+		UnlockReleaseBuffer(buffer);
+	}
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,8 +145,8 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
-			if (GistPageIsLeaf(page))
-				GistMarkTuplesDeleted(page);
+
+			GistMarkTuplesDeleted(page);
 		}
 
 		/* Add new tuples if any */
@@ -324,6 +357,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +478,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.id = xid;
+	xlrec.downlinkOffset = internalPageBuffer;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..f93024ab25 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId id;
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

#19Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#18)
2 attachment(s)
Re: GiST VACUUM

Hi!

16 июля 2018 г., в 21:24, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I was checking WAL replay of new scheme to log page deletes and found a bug there (incorrect value of deleted downlink in WAL record). Here's fixed patch v10.

Also I've added support to WAL identification for new record, done some improvements to comments and naming in data structures.

Thanks!

Best regards, Andrey Borodin.

Attachments:

0002-Physical-GiST-scan-during-VACUUM-v10.patchapplication/octet-stream; name=0002-Physical-GiST-scan-during-VACUUM-v10.patch; x-unix-mode=0644Download
From 21e5d4b629cca1ad3416efe6a3e978cca244b368 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 17 Jul 2018 22:34:58 +0400
Subject: [PATCH 2/2] Physical GiST scan during VACUUM v10

---
 src/backend/access/gist/gistvacuum.c | 366 +++++++++++++++++++++++++++++++----
 1 file changed, 326 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 8d97c44..778c806 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -103,8 +103,9 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	GistNSN		 parentlsn;
+	BlockNumber  blkno;
+	OffsetNumber parentoffset;
 	struct GistBDItem *next;
 } GistBDItem;
 
@@ -129,30 +130,232 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 }
 
 /*
- * Bulk deletion of all index entries pointing to a set of heap tuples and
- * check invalid tuples left after upgrade.
- * The set of target tuples is specified via a callback routine that tells
- * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ * During physical scan for every pair parent-child we can either find parent
+ * first or child first. Every time we open internal page - we mark parent
+ * block no for every child and set GIST_PS_HAS_PARENT. When scan will get to
+ * child page, if this page turns out to be empty - we will get back by
+ * parent link. If we find child first (still without parent link), we mark
+ * the page as GIST_PS_EMPTY_LEAF if it is ready to be deleted. When we will
+ * scan it's parent - we will pick it to rescan list.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+#define GIST_PS_HAS_PARENT 1
+#define GIST_PS_EMPTY_LEAF 2
+
+
+/* Physiscal scan item */
+typedef struct GistPSItem
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
-	BlockNumber recentParent = InvalidBlockNumber;
-	List	   *rescanList = NULL;
-	ListCell   *cell;
+	BlockNumber  parent;
+	List*        emptyLeafOffsets;
+	OffsetNumber parentOffset;
+	uint16       flags;
+} GistPSItem;
+
+/* Blocknumber of internal pages with offsets to rescan for deletion */
+typedef struct GistRescanItem
+{
+	BlockNumber       blkno;
+	List*             emptyLeafOffsets;
+	struct GistRescanItem* next;
+} GistRescanItem;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
+static void
+gistbulkdeletephysicalcanpage(IndexVacuumInfo * info, IndexBulkDeleteResult * stats,
+								IndexBulkDeleteCallback callback, void* callback_state,
+								BlockNumber blkno, GistNSN startNSN, GistPSItem *graph)
+{
+	Relation	 rel = info->index;
+	Buffer		 buffer;
+	Page		 page;
+	OffsetNumber i,
+					maxoff;
+	IndexTuple   idxtuple;
+	ItemId	     iid;
+
+	/*
+	 * This is recursive call, should almost never be deeper than
+	 * GIST_MAX_SPLIT_PAGES, but check anyway.
+	 */
+	check_stack_depth();
+
+	vacuum_delay_point();
+
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, calling recursive algorithms.
+	 * Especially for an internal page. So, agressivly grab an exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
+
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		/* TODO: Should not we record free page here? */
+		return;
+	}
+
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+			{
+				gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
+			}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			if (callback(&(idxtuple->t_tid), callback_state))
+				todelete[ntodelete++] = i;
+			else
+				stats->num_index_tuples += 1;
+		}
+
+		stats->tuples_removed += ntodelete;
+
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
+
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
+			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
+
+			END_CRIT_SECTION();
+		}
+
+		/* The page is completely empty */
+		if (ntodelete == maxoff)
+		{
+			/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+			if (graph[blkno].flags & GIST_PS_HAS_PARENT)
+			{
+				/* Go to parent and append myself */
+				BlockNumber parentblockno = graph[blkno].parent;
+				graph[parentblockno].emptyLeafOffsets = lappend_int(graph[parentblockno].emptyLeafOffsets, (int)graph[blkno].parentOffset);
+			}
+			else
+			{
+				/* Parent will collect me later */
+				graph[blkno].flags |= GIST_PS_EMPTY_LEAF;
+			}
+		}
+	}
+	else
+	{
+		/* For internal pages we remember stucture of the tree */
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			BlockNumber childblkno;
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			childblkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+
+			if (graph[childblkno].flags & GIST_PS_EMPTY_LEAF)
+			{
+				/* Child has been scanned earlier and is ready to be picked up */
+				graph[blkno].emptyLeafOffsets = lappend_int(graph[blkno].emptyLeafOffsets, i);
+			}
+			else
+			{
+				/* Collect leaf when scan will come close */
+				graph[childblkno].parent = blkno;
+				graph[childblkno].parentOffset = i;
+				graph[childblkno].flags |= GIST_PS_HAS_PARENT;
+			}
 
+
+			if (GistTupleIsInvalid(idxtuple))
+				ereport(LOG,
+						(errmsg("index \"%s\" contains an inner tuple marked as invalid",
+								RelationGetRelationName(rel)),
+							errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
+							errhint("Please REINDEX it.")));
+		}
+	}
+	UnlockReleaseBuffer(buffer);
+}
+
+/* Read all pages sequentially populating array of GistPSItem */
+static GistRescanItem*
+gistbulkdeletephysicalcan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state, BlockNumber npages)
+{
+	GistRescanItem *result = NULL;
+	BlockNumber      blkno;
+	GistNSN			 startNSN = GetInsertRecPtr();
+
+	/* Here we will store whole graph of the index */
+	GistPSItem *graph = palloc0(npages * sizeof(GistPSItem));
+
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, blkno, startNSN, graph);
+	}
+
+	/* Search for internal pages pointing to empty leafs */
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		if (graph[blkno].emptyLeafOffsets)
+		{
+			GistRescanItem *next = palloc(sizeof(GistRescanItem));
+			next->blkno = blkno;
+			next->emptyLeafOffsets = graph[blkno].emptyLeafOffsets;
+			next->next = result;
+			result = next;
+		}
+	}
+
+	pfree(graph);
+
+	return result;
+}
+
+/* Logical scan descends from root to leafs in DFS search */
+static GistRescanItem*
+gistbulkdeletelogicalscan(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation        rel = info->index;
+	BlockNumber     recentParent = InvalidBlockNumber;
+	GistBDItem     *stack,
+				   *ptr;
+	GistRescanItem *result = NULL;
+
+	/* This stack is used to organize DFS */
 	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
 	stack->blkno = GIST_ROOT_BLKNO;
 
@@ -237,11 +440,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				END_CRIT_SECTION();
 			}
 
-			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
-				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber)
 			{
 				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
-				rescanList = lappend_int(rescanList, recentParent);
+				if (result == NULL || result->blkno != recentParent)
+				{
+					GistRescanItem *next = palloc(sizeof(GistRescanItem));
+					next->blkno = recentParent;
+					next->emptyLeafOffsets = NULL;
+					next->next = result;
+					result = next;
+				}
+				result->emptyLeafOffsets = lappend_int(result->emptyLeafOffsets, stack->parentoffset);
 			}
 		}
 		else
@@ -261,6 +471,7 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 				ptr->parentlsn = BufferGetLSNAtomic(buffer);
 				ptr->next = stack->next;
+				ptr->parentoffset = i;
 				stack->next = ptr;
 
 				if (GistTupleIsInvalid(idxtuple))
@@ -281,20 +492,82 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		vacuum_delay_point();
 	}
 
-	/* rescan inner pages that had empty child pages */
-	foreach(cell,rescanList)
+	return result;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
+/*
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
+ *
+ * Result: a palloc'd struct containing statistical info for VACUUM displays.
+ */
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
+{
+	Relation		rel = info->index;
+	GistRescanItem *rescan;
+	BlockNumber		npages;
+	bool			needLock;
+
+	/* first time through? */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	/* we'll re-count the tuples each time */
+	stats->estimated_count = false;
+	stats->num_index_tuples = 0;
+
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* If we have enough space to contruct map of whole graph, then we can do sequential reading of all index */
+	if (npages * (sizeof(GistPSItem)) > maintenance_work_mem * 1024)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-		OffsetNumber todelete[MaxOffsetNumber];
-		Buffer		buftodelete[MaxOffsetNumber];
-		int			ntodelete = 0;
+		rescan = gistbulkdeletelogicalscan(info, stats, callback, callback_state);
+	}
+	else
+	{
+		rescan = gistbulkdeletephysicalcan(info, stats, callback, callback_state, npages);
+	}
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+	/* rescan inner pages that had empty child pages */
+	while (rescan)
+	{
+		Buffer			 buffer;
+		Page			 page;
+		OffsetNumber 	 i,
+						 maxoff;
+		IndexTuple		 idxtuple;
+		ItemId			 iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+		ListCell  		*cell;
+		GistRescanItem	*oldRescan;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, rescan->blkno,
 									RBM_NORMAL, info->strategy);
 		LockBuffer(buffer, GIST_EXCLUSIVE);
 		gistcheckpage(rel, buffer);
@@ -304,11 +577,18 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		maxoff = PageGetMaxOffsetNumber(page);
 
-		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		/* Check that leafs are still empty and decide what to delete */
+		foreach(cell, rescan->emptyLeafOffsets)
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
 
+			i = (OffsetNumber)lfirst_int(cell);
+			if(i > maxoff)
+			{
+				continue;
+			}
+
 			iid = PageGetItemId(page, i);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
 
@@ -333,7 +613,10 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 
 		if (ntodelete)
 		{
-			/*
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+			/* 
 			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
 			 * that could hold downlinks to this page. We use
 			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
@@ -378,11 +661,14 @@ gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkD
 		}
 
 		UnlockReleaseBuffer(buffer);
+		oldRescan = rescan;
+		rescan = rescan->next;
+		list_free(oldRescan->emptyLeafOffsets);
+		pfree(oldRescan);
 
 		vacuum_delay_point();
 	}
 
-	list_free(rescanList);
 
 	return stats;
 }
\ No newline at end of file
-- 
2.15.2 (Apple Git-101.1)

0001-Delete-pages-during-GiST-VACUUM-v10.patchapplication/octet-stream; name=0001-Delete-pages-during-GiST-VACUUM-v10.patch; x-unix-mode=0644Download
From d00b3d79881135bd68986afc4927137cfc28f410 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 17 Jul 2018 22:02:02 +0400
Subject: [PATCH 1/2] Delete pages during GiST VACUUM v10

---
 src/backend/access/gist/README         |  35 ++++++++++
 src/backend/access/gist/gist.c         |  18 +++++
 src/backend/access/gist/gistbuild.c    |   5 --
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 122 +++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  60 ++++++++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 +++++--
 src/include/access/gistxlog.h          |  17 ++++-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 12 files changed, 274 insertions(+), 24 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 0222866..9548872 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..3a6b5c7 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f..f26f139 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c3..41978bb 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218..8d97c44 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,10 +16,13 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
+#include "utils/snapmgr.h"
+#include "access/xact.h"
 
 
 /*
@@ -125,7 +128,6 @@ pushStackIfSplited(Page page, GistBDItem *stack)
 	}
 }
 
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -135,12 +137,14 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+gistbulkdelete(IndexVacuumInfo * info, IndexBulkDeleteResult * stats, IndexBulkDeleteCallback callback, void* callback_state)
 {
 	Relation	rel = info->index;
 	GistBDItem *stack,
 			   *ptr;
+	BlockNumber recentParent = InvalidBlockNumber;
+	List	   *rescanList = NULL;
+	ListCell   *cell;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -233,9 +237,16 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				END_CRIT_SECTION();
 			}
 
+			if (ntodelete == maxoff && recentParent!=InvalidBlockNumber &&
+				(rescanList == NULL || (BlockNumber)llast_int(rescanList) != recentParent))
+			{
+				/* This page is a candidate to be deleted. Remember it's parent to rescan it later with xlock */
+				rescanList = lappend_int(rescanList, recentParent);
+			}
 		}
 		else
 		{
+			recentParent = stack->blkno;
 			/* check for split proceeded after look at parent */
 			pushStackIfSplited(page, stack);
 
@@ -270,5 +281,108 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vacuum_delay_point();
 	}
 
+	/* rescan inner pages that had empty child pages */
+	foreach(cell,rescanList)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber i,
+					maxoff;
+		IndexTuple	idxtuple;
+		ItemId		iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, (BlockNumber)lfirst_int(cell),
+									RBM_NORMAL, info->strategy);
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		gistcheckpage(rel, buffer);
+		page = (Page) BufferGetPage(buffer);
+
+		Assert(!GistPageIsLeaf(page));
+
+		maxoff = PageGetMaxOffsetNumber(page);
+
+		for (i = OffsetNumberNext(FirstOffsetNumber); i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/*
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr 	=
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+											txid, buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+		vacuum_delay_point();
+	}
+
+	list_free(rescanList);
+
 	return stats;
-}
+}
\ No newline at end of file
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126..80108f6 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e..f494db6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566d..0dd2bf4 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed724..1f82695 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b949..ad0b742 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993..5b92f08 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722f..e66396e 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

#20Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#19)
Re: GiST VACUUM

On 17/07/18 21:41, Andrey Borodin wrote:

I was checking WAL replay of new scheme to log page deletes and found
a bug there (incorrect value of deleted downlink in WAL record).
Here's fixed patch v10.

Also I've added support to WAL identification for new record, done
some improvements to comments and naming in data structures.

Thanks!

+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+			{
+				gistbulkdeletephysicalcanpage(info, stats, callback, callback_state, opaque->rightlink, startNSN, graph);
+			}

In the corresponding B-tree code, we use don't do actual recursion, but
a hand-optimized "tail recursion", to avoid stack overflow if there are
a lot of splits. I think we need to do something like tha there, too. I
don't think it's safe to assume that we have enough stack space for the
recursion. You have a check_stack_depth() in the function, so you'll get
ERROR, but it sucks if VACUUM errors out because of that.

It's not cool to use up to 'maintenance_work_mem' of memory for holding
the in-memory graph. VACUUM already uses up to that much memory to hold
the list of dead TIDs, so we would use up to 2x maintenance_work_mem in
total.

The only reason we still need the logical scan is to support page
deletion, when there is not enough memory available. Can we get rid of
that altogether? I think I'd prefer some other scheme to deal with that
situation. For example, we could make note, in memory, of the empty
pages during the physical scan, and perform a second physical scan of
the index to find their downlinks. Two full-index scans is not nice, but
it's actually not that bad if it's done in physical order. And you could
have some heuristics, e.g. only do it if more than 10% of the pages were
deletable.

Sorry to ask you to rewrite this again, but I think it would be better
to split this into two patches as follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch,
which just switches to doing physical-order scan, first.

- Heikki

#21Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#20)
1 attachment(s)
Re: GiST VACUUM

Hi!

18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

In the corresponding B-tree code, we use don't do actual recursion, but a hand-optimized "tail recursion", to avoid stack overflow if there are a lot of splits. I think we need to do something like tha there, too. I don't think it's safe to assume that we have enough stack space for the recursion. You have a check_stack_depth() in the function, so you'll get ERROR, but it sucks if VACUUM errors out because of that.

Ok, will do that.

It's not cool to use up to 'maintenance_work_mem' of memory for holding the in-memory graph. VACUUM already uses up to that much memory to hold the list of dead TIDs, so we would use up to 2x maintenance_work_mem in total.

The only reason we still need the logical scan is to support page deletion, when there is not enough memory available. Can we get rid of that altogether? I think I'd prefer some other scheme to deal with that situation. For example, we could make note, in memory, of the empty pages during the physical scan, and perform a second physical scan of the index to find their downlinks. Two full-index scans is not nice, but it's actually not that bad if it's done in physical order.

I think this is a good idea. I will implement this.

And you could have some heuristics, e.g. only do it if more than 10% of the pages were deletable.

Sorry to ask you to rewrite this again

Piece of cake :)

, but I think it would be better to split this into two patches as follows:

1st patch: Scan the index in physical rather than logical order. No attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first patch, which just switches to doing physical-order scan, first.

This seems very unproportional division of complexity. First patch (PFA) is very simple. All work is done in one cycle, without memorizing anything. Actually, you do not even need to rescan rightlinks: there may be no splits to the left when no pages are deleted.
If you think it is proper way to go - OK, I'll prepare better version of attached diff (by omitting tail recursion and adding more comments).

Best regards, Andrey Borodin.

Attachments:

gist_phisical_vacuum_v1.diffapplication/octet-stream; name=gist_phisical_vacuum_v1.diff; x-unix-mode=0644Download
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..e47f50c04c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -100,32 +100,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	Buffer	buffer;
 	struct GistBDItem *next;
 } GistBDItem;
 
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -138,9 +116,12 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	Relation		 rel = info->index;
+	BlockNumber		 npages;
+	bool			 needLock;
+	BlockNumber      blkno;
+	GistNSN			 startNSN = GetInsertRecPtr();
+	GistBDItem 		*bufferStack = NULL;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -149,125 +130,132 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-	while (stack)
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
+		BlockNumber nextBlock = blkno;
+		bool		needScan = true;
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
+		vacuum_delay_point();
 
-		if (GistPageIsLeaf(page))
+		while (needScan)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+			needScan = false;
 
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+			/*
+			 * We are not going to stay here for a long time, calling recursive algorithms.
+			 * Especially for an internal page. So, agressivly grab an exclusive lock.
+			 */
+			LockBuffer(buffer, GIST_EXCLUSIVE);
 			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
+
+			if (PageIsNew(page) || GistPageIsDeleted(page))
 			{
-				/* only the root can become non-leaf during relock */
 				UnlockReleaseBuffer(buffer);
-				/* one more check */
+				/* TODO: Should not we record free page here? */
 				continue;
 			}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
-
-			/*
-			 * Remove deletable tuples from page
-			 */
-
 			maxoff = PageGetMaxOffsetNumber(page);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			if (GistPageIsLeaf(page))
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
-
-			stats->tuples_removed += ntodelete;
-
-			if (ntodelete)
-			{
-				START_CRIT_SECTION();
+				OffsetNumber todelete[MaxOffsetNumber];
+				int			ntodelete = 0;
+				GISTPageOpaque opaque = GistPageGetOpaque(page);
+
+				/*
+				 * If this page was splitted after start of the VACUUM we have to
+				 * revisit rightlink, if it points to block we already scanned.
+				 * This is recursive revisit, should not be deep, but we check
+				 * the possibility of stack overflow anyway.
+				 */
+				if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+					(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < blkno))
+					{
+						nextBlock = opaque->rightlink;
+						needScan = true;
+					}
+
+				/*
+				 * Remove deletable tuples from page
+				 */
+
+				for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+				{
+					iid = PageGetItemId(page, i);
+					idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+					if (callback(&(idxtuple->t_tid), callback_state))
+						todelete[ntodelete++] = i;
+					else
+						stats->num_index_tuples += 1;
+				}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+				stats->tuples_removed += ntodelete;
 
-				if (RelationNeedsWAL(rel))
+				/* We have dead tuples on the page */
+				if (ntodelete)
 				{
-					XLogRecPtr	recptr;
+					START_CRIT_SECTION();
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+					MarkBufferDirty(buffer);
 
-				END_CRIT_SECTION();
-			}
+					PageIndexMultiDelete(page, todelete, ntodelete);
+					GistMarkTuplesDeleted(page);
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+						recptr = gistXLogUpdate(buffer,
+												todelete, ntodelete,
+												NULL, 0, InvalidBuffer);
+						PageSetLSN(page, recptr);
+					}
+					else
+						PageSetLSN(page, gistGetFakeLSN(rel));
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+					END_CRIT_SECTION();
+				}
+			}
+			if (needScan)
+			{
+				GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
+				ptr->buffer = buffer;
+				ptr->next = bufferStack;
+				bufferStack = ptr;
+			}
+			else
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
+				UnlockReleaseBuffer(buffer);
 			}
 		}
-
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
-
-		vacuum_delay_point();
+		while (bufferStack)
+		{
+			UnlockReleaseBuffer(bufferStack->buffer);
+			bufferStack = bufferStack->next;
+		}
 	}
 
 	return stats;
#22Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#21)
Re: GiST VACUUM

On 18/07/18 21:27, Andrey Borodin wrote:

Hi!

18 июля 2018 г., в 16:02, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

, but I think it would be better to split this into two patches as
follows:

1st patch: Scan the index in physical rather than logical order. No
attempt at deleting empty pages yet.

2nd patch: Add support for deleting empty pages.

I would be more comfortable reviewing and committing that first
patch, which just switches to doing physical-order scan, first.

This seems very unproportional division of complexity. First patch
(PFA) is very simple. All work is done in one cycle, without
memorizing anything. Actually, you do not even need to rescan
rightlinks: there may be no splits to the left when no pages are
deleted.

Heh, good point.

I googled around and bumped into an older patch to do this:
/messages/by-id/1135121410099068@web30j.yandex.ru.
Unfortunately, Костя never got around to update the patch, and it was
forgotten. But the idea seemed sound even back then.

As noted in that thread, there might be deleted pages in the index in
some rare circumstances, even though we don't recycled empty pages: if
the index was upgraded from a very old version, as VACUUM FULL used to
recycle empty pages, or if you crash just when extending the index, and
end up with newly-initialized but unused pages that way. So we do need
to handle the concurrent split scenario, even without empty page recycling.

If you think it is proper way to go - OK, I'll prepare
better version of attached diff (by omitting tail recursion and
adding more comments).

Yeah, please, I think this is the way to go.

- Heikki

#23Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#22)
2 attachment(s)
Re: GiST VACUUM

Hi!

19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

In second step I still use paloc's memory, but only to store two bitmaps: bitmap of internal pages and bitmap of empty leafs. Second physical scan only reads internal pages. I can omit that bitmap, if I'll scan everything.
Also, I can replace emptyLeafs bitmap with array\list, but I do not really think it will be big.

Anyway I propose focusing on first step.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v11.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v11.patch; x-unix-mode=0644Download
From 4a2ca3ec30b1743a669b21da80fbf589f6a23be8 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 19 Jul 2018 14:28:25 +0400
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v11

---
 src/backend/access/gist/README         |  35 ++++++++
 src/backend/access/gist/gist.c         |  18 ++++
 src/backend/access/gist/gistbuild.c    |   5 --
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 154 +++++++++++++++++++++++++++++++++
 src/backend/access/gist/gistxlog.c     |  60 +++++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 +++--
 src/include/access/gistxlog.h          |  17 +++-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 12 files changed, 310 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..3a6b5c7ed3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..41978bb5e5 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ff86f4491f..26ac4dd30e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
@@ -104,6 +105,27 @@ typedef struct GistBDItem
 	struct GistBDItem *next;
 } GistBDItem;
 
+static void gistmapset(char *map, BlockNumber blkno)
+{
+	map[blkno / 8] |= 1 << (blkno % 8);
+}
+static bool gistmapget(char *map, BlockNumber blkno)
+{
+	return (map[blkno / 8] & 1 << (blkno % 8)) != 0;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -121,6 +143,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	bool			 needLock;
 	BlockNumber      blkno;
 	GistNSN			 startNSN = GetInsertRecPtr();
+	void			*internals;
+	void			*emptyLeafs;
 
 	/* first time through? */
 	if (stats == NULL)
@@ -140,6 +164,9 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	npages = RelationGetNumberOfBlocks(rel);
 	if (needLock)
 		UnlockRelationForExtension(rel, ExclusiveLock);
+	
+	internals = palloc0(npages / 8 + 1);
+	emptyLeafs = palloc0(npages / 8 + 1);
 
 	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
 	{
@@ -248,7 +275,14 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 					END_CRIT_SECTION();
 				}
+				/* The page is completely empty */
+				if (ntodelete == maxoff)
+				{
+					gistmapset(emptyLeafs, nextBlock);
+				}
 			}
+			else
+				gistmapset(internals, nextBlock);
 
 			/* We should not unlock buffer if we are going to jump left */
 			if (needScan)
@@ -269,5 +303,125 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
+	{
+		Buffer		 buffer;
+		Page		 page;
+		OffsetNumber i,
+					 maxoff;
+		IndexTuple   idxtuple;
+		ItemId	     iid;
+		OffsetNumber 	 todelete[MaxOffsetNumber];
+		Buffer			 buftodelete[MaxOffsetNumber];
+		int				 ntodelete = 0;
+
+		if (!gistmapget(internals, blkno))
+			continue; /* second scan is for internal pages */
+
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+		/* Currently block of internal page cannot become leaf */
+		Assert(!GistPageIsLeaf(page));
+
+		if (PageIsNew(page) || GistPageIsDeleted(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			/* TODO: Should not we record free page here? */
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+		/* Check that leafs are still empty and decide what to delete */
+		for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+			/* if this page was not empty in previous scan - we do not consider it */
+			if(!gistmapget(emptyLeafs, i))
+			{
+				continue;
+			}
+
+			iid = PageGetItemId(page, i);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+								RBM_NORMAL, info->strategy);
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			Assert(GistPageIsLeaf(leafPage));
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = i;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+
+		if (ntodelete)
+		{
+			/* Prepare possibly onurdered offsets */
+			qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+			/* 
+			 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->pages_deleted++;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+				{
+					XLogRecPtr recptr 	=
+						gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+											txid, buffer, todelete[i] - i);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+				}
+				else
+				{
+					PageSetLSN(page, gistGetFakeLSN(rel));
+					PageSetLSN(leafPage, gistGetFakeLSN(rel));
+				}
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+		}
+
+		UnlockReleaseBuffer(buffer);
+	}
+
+	pfree(internals);
+	pfree(emptyLeafs);
+
 	return stats;
 }
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0001-Physical-GiST-scan-in-VACUUM-v11.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v11.patch; x-unix-mode=0644Download
From fbb728a265464bd1c3d5e823f6bb23d98c8b3df0 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Thu, 19 Jul 2018 14:09:38 +0400
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v11

---
 src/backend/access/gist/gistvacuum.c | 231 +++++++++++++++++------------------
 1 file changed, 115 insertions(+), 116 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..ff86f4491f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -100,32 +100,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 typedef struct GistBDItem
 {
-	GistNSN		parentlsn;
-	BlockNumber blkno;
+	Buffer	buffer;
 	struct GistBDItem *next;
 } GistBDItem;
 
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -138,9 +116,11 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
-	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	Relation		 rel = info->index;
+	BlockNumber		 npages;
+	bool			 needLock;
+	BlockNumber      blkno;
+	GistNSN			 startNSN = GetInsertRecPtr();
 
 	/* first time through? */
 	if (stats == NULL)
@@ -149,125 +129,144 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-	while (stack)
+	/* try to find deleted pages */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	for (blkno = GIST_ROOT_BLKNO; blkno < npages; blkno++)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
+		/*
+		 * In case of concurrent splits we may need to jump back
+		 * and vacuum pages again to prevent left dead tuples.
+		 * These fields are here to implement tail recursion of
+		 * these jumps.
+		 */
+		BlockNumber nextBlock = blkno;
+		bool		needScan = true;
+		GistBDItem *bufferStack = NULL;
 
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
+		vacuum_delay_point();
 
-		if (GistPageIsLeaf(page))
+		while (needScan)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+			needScan = false;
 
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, nextBlock, RBM_NORMAL,
+										info->strategy);
+			/*
+			 * We are not going to stay here for a long time, calling recursive algorithms.
+			 * Especially for an internal page. So, agressivly grab an exclusive lock.
+			 */
+			LockBuffer(buffer, GIST_EXCLUSIVE);
 			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
+
+			if (PageIsNew(page) || GistPageIsDeleted(page))
 			{
-				/* only the root can become non-leaf during relock */
 				UnlockReleaseBuffer(buffer);
-				/* one more check */
+				/* TODO: Should not we record free page here? */
 				continue;
 			}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
-
-			/*
-			 * Remove deletable tuples from page
-			 */
-
 			maxoff = PageGetMaxOffsetNumber(page);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			if (GistPageIsLeaf(page))
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				OffsetNumber todelete[MaxOffsetNumber];
+				int			ntodelete = 0;
+				GISTPageOpaque opaque = GistPageGetOpaque(page);
+
+				/*
+				 * If this page was splitted after start of the VACUUM we have to
+				 * revisit rightlink, if it points to block we already scanned.
+				 * This is recursive revisit, should not be deep, but we check
+				 * the possibility of stack overflow anyway.
+				 */
+				if ((GistFollowRight(page) || startNSN < GistPageGetNSN(page)) &&
+					(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < nextBlock))
+					{
+						/*
+						 * we will aquire locks by following rightlinks
+						 * this lock coupling is allowed in GiST
+						 */
+						nextBlock = opaque->rightlink;
+						needScan = true;
+					}
+
+				/*
+				 * Remove deletable tuples from page
+				 */
+
+				for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+				{
+					iid = PageGetItemId(page, i);
+					idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+					if (callback(&(idxtuple->t_tid), callback_state))
+						todelete[ntodelete++] = i;
+					else
+						stats->num_index_tuples += 1;
+				}
 
-			stats->tuples_removed += ntodelete;
+				stats->tuples_removed += ntodelete;
 
-			if (ntodelete)
-			{
-				START_CRIT_SECTION();
+				/* We have dead tuples on the page */
+				if (ntodelete)
+				{
+					START_CRIT_SECTION();
 
-				MarkBufferDirty(buffer);
+					MarkBufferDirty(buffer);
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+					PageIndexMultiDelete(page, todelete, ntodelete);
+					GistMarkTuplesDeleted(page);
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr	recptr;
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+						recptr = gistXLogUpdate(buffer,
+												todelete, ntodelete,
+												NULL, 0, InvalidBuffer);
+						PageSetLSN(page, recptr);
+					}
+					else
+						PageSetLSN(page, gistGetFakeLSN(rel));
 
-				END_CRIT_SECTION();
+					END_CRIT_SECTION();
+				}
 			}
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
-
-			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			/* We should not unlock buffer if we are going to jump left */
+			if (needScan)
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
+				GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
+				ptr->buffer = buffer;
+				ptr->next = bufferStack;
+				bufferStack = ptr;
 			}
+			else
+				UnlockReleaseBuffer(buffer);
+		}
+		/* unlock stacked buffers */
+		while (bufferStack)
+		{
+			UnlockReleaseBuffer(bufferStack->buffer);
+			bufferStack = bufferStack->next;
 		}
-
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
-
-		vacuum_delay_point();
 	}
 
 	return stats;
-- 
2.15.2 (Apple Git-101.1)

#24Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#23)
Re: GiST VACUUM

On 19/07/18 13:52, Andrey Borodin wrote:

Hi!

19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

Thanks, one quick question:

/* We should not unlock buffer if we are going to jump left */
if (needScan)
{
GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
ptr->buffer = buffer;
ptr->next = bufferStack;
bufferStack = ptr;
}
else
UnlockReleaseBuffer(buffer);

Why? I don't see any need to keep the page locked, when we "jump left".

- Heikki

#25Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#24)
Re: GiST VACUUM

<br />19.07.2018, 15:20, "Heikki Linnakangas" &lt;hlinnaka@iki.fi&gt;:<br /><blockquote type="cite"><p>On 19/07/18 13:52, Andrey Borodin wrote:<br /></p><blockquote> Hi!<br /><br /><blockquote> 19 июля 2018 г., в 1:12, Heikki Linnakangas &lt;<a href="mailto:hlinnaka@iki.fi">hlinnaka@iki.fi</a>&gt;<br /> написал(а):<br /><br /> Yeah, please, I think this is the way to go.<br /></blockquote><br /> Here's v11 divided into proposed steps.<br /></blockquote><p><br />Thanks, one quick question:<br /><br /></p><blockquote>                         /* We should not unlock buffer if we are going to jump left */<br />                         if (needScan)<br />                         {<br />                                 GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));<br />                                 ptr-&gt;buffer = buffer;<br />                                 ptr-&gt;next = bufferStack;<br />                                 bufferStack = ptr;<br />                         }<br />                         else<br />                                 UnlockReleaseBuffer(buffer);<br /></blockquote><p><br />Why? I don't see any need to keep the page locked, when we "jump left".</p></blockquote><div>Because it can split to the left again, given that we release lock.</div><div><br /></div><div>Best regards, Andrey Borodin.</div>

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#25)
1 attachment(s)
Re: GiST VACUUM

On 19/07/18 14:42, Andrey Borodin wrote:

19.07.2018, 15:20, "Heikki Linnakangas" <hlinnaka@iki.fi>:

On 19/07/18 13:52, Andrey Borodin wrote:

Hi!

19 июля 2018 г., в 1:12, Heikki Linnakangas <hlinnaka@iki.fi
<mailto:hlinnaka@iki.fi>>
написал(а):

Yeah, please, I think this is the way to go.

Here's v11 divided into proposed steps.

Thanks, one quick question:

/* We should not unlock buffer if we are going to
jump left */
if (needScan)
{
GistBDItem *ptr = (GistBDItem *)
palloc(sizeof(GistBDItem));
ptr->buffer = buffer;
ptr->next = bufferStack;
bufferStack = ptr;
}
else
UnlockReleaseBuffer(buffer);

Why? I don't see any need to keep the page locked, when we "jump left".

Because it can split to the left again, given that we release lock.

Hmm. So, while we are scanning the right sibling, which was moved to
lower-numbered block because of a concurrent split, the original page is
split again? That's OK, we've already scanned all the tuples on the
original page, before we recurse to deal with the right sibling. (The
corresponding B-tree code also releases the lock on the original page
when recursing)

I did some refactoring, to bring this closer to the B-tree code, for the
sake of consistency. See attached patch. This also eliminates the 2nd
pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase
already.

There was one crucial thing missing: in the outer loop, we must ensure
that we scan all pages, even those that were added after the vacuum
started. There's a comment explaining that in btvacuumscan(). This
version fixes that.

I haven't done any testing on this. Do you have any test scripts you
could share? I think we need some repeatable tests for the concurrent
split cases. Even if it involves gdb or some other hacks that we can't
include in the regression test suite, we need something now, while we're
hacking on this.

One subtle point, that I think is OK, but gave me a pause, and probably
deserves comment somewhere: A concurrent root split can turn a leaf page
into one internal (root) page, and two new leaf pages. The new root page
is placed in the same block as the old page, while both new leaf pages
go to freshly allocated blocks. If that happens while vacuum is running,
might we miss the new leaf pages? As the code stands, we don't do the
"follow-right" dance on internal pages, so we would not recurse into the
new leaf pages. At first, I thought that's a problem, but I think we can
get away with it. The only scenario where a root split happens on a leaf
page, is when the index has exactly one page, a single leaf page. Any
subsequent root splits will split an internal page rather than a leaf
page, and we're not bothered by those. In the case that a root split
happens on a single-page index, we're OK, because we will always scan
that page either before, or after the split. If we scan the single page
before the split, we see all the leaf tuples on that page. If we scan
the single page after the split, it means that we start the scan after
the split, and we will see both leaf pages as we continue the scan.

- Heikki

Attachments:

0001-Physical-GiST-scan-in-VACUUM-v12.patchtext/x-patch; name=0001-Physical-GiST-scan-in-VACUUM-v12.patchDownload
From 9978fd22dd7b52b1b3f509f53fbafa505f68b573 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 19 Jul 2018 15:25:58 +0300
Subject: [PATCH 1/1] Physical GiST scan in VACUUM v12

---
 src/backend/access/gist/gistvacuum.c | 431 ++++++++++++++++++++---------------
 1 file changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..180cc6c63a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,38 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   GistNSN startNSN);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	GistNSN		startNSN;
+
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	startNSN = GetInsertRecPtr();
+	gistvacuumscan(info, stats, callback, callback_state, startNSN);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +60,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If btbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest btbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL, 0);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +98,234 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   GistNSN startNSN)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to btvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST!
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by _bt_getbuf(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or _bt_getbuf
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer btvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, calling recursive
+	 * algorithms.  Especially for an internal page. So, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 * This is recursive revisit, should not be deep, but we check
+		 * the possibility of stack overflow anyway.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
+
+			MarkBufferDirty(buffer);
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
 
-				END_CRIT_SECTION();
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.11.0

#27Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#26)
Re: GiST VACUUM

19 июля 2018 г., в 16:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
Hmm. So, while we are scanning the right sibling, which was moved to lower-numbered block because of a concurrent split, the original page is split again? That's OK, we've already scanned all the tuples on the original page, before we recurse to deal with the right sibling. (The corresponding B-tree code also releases the lock on the original page when recursing)

Seems right.

I did some refactoring, to bring this closer to the B-tree code, for the sake of consistency. See attached patch. This also eliminates the 2nd pass by gistvacuumcleanup(), in case we did that in the bulkdelete-phase already.

Thanks!

There was one crucial thing missing: in the outer loop, we must ensure that we scan all pages, even those that were added after the vacuum started.

Correct. Quite a neat logic behind the order of acquiring npages, comparing and vacuuming page. Notes in FIXME look correct except function names.

There's a comment explaining that in btvacuumscan(). This version fixes that.

I haven't done any testing on this. Do you have any test scripts you could share?

I use just a simple tests that setup replication and does random inserts and vaccums. Not a rocket science, just a mutated script
for i in $(seq 1 12); do
size=$((100 * 2**$i))
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "drop table x;"
./psql postgres -c "create table x as select cube(random()) c from generate_series(1,$size) y; create index on x using gist(c);"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "insert into x select cube(random()) c from generate_series(1,$size) y;"
./psql postgres -c "VACUUM x;"
./psql postgres -c "delete from x where (c~>1)>0.1;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "VACUUM FULL x;"
./psql postgres -c "select pg_size_pretty(pg_relation_size('x_c_idx'));"
./psql postgres -c "drop table x;"
done

I think we need some repeatable tests for the concurrent split cases.

It is hard to trigger left splits until we delete pages. I'll try to hack gistNewBuffer() to cause something similar.

Even if it involves gdb or some other hacks that we can't include in the regression test suite, we need something now, while we're hacking on this.

One subtle point, that I think is OK, but gave me a pause, and probably deserves comment somewhere: A concurrent root split can turn a leaf page into one internal (root) page, and two new leaf pages. The new root page is placed in the same block as the old page, while both new leaf pages go to freshly allocated blocks. If that happens while vacuum is running, might we miss the new leaf pages? As the code stands, we don't do the "follow-right" dance on internal pages, so we would not recurse into the new leaf pages. At first, I thought that's a problem, but I think we can get away with it. The only scenario where a root split happens on a leaf page, is when the index has exactly one page, a single leaf page. Any subsequent root splits will split an internal page rather than a leaf page, and we're not bothered by those. In the case that a root split happens on a single-page index, we're OK, because we will always scan that page either before, or after the split. If we scan the single page before the split, we see all the leaf tuples on that page. If we scan the single page after the split, it means that we start the scan after the split, and we will see both leaf pages as we continue the scan.

Yes, only page 0 may change type, and page 0 cannot split to left.

I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Best regards, Andrey Borodin.

#28Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#27)
1 attachment(s)
Re: GiST VACUUM

Hi!

19 июля 2018 г., в 23:26, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

I'm working on triggering left split during vacuum. Will get back when done. Thanks!

Here's patch including some messy hacks to trigger NSN and FollowRight jumps during VACUUM.

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

gistvacuumcleanup() was constantly generating left jumps because there was used 0 instead of real start NSN, I moved NSN acquisition to gistvacuumscan(). Also fixed some comments.

gistvacuumcleanup() will have same effect as gistbulkdelete(), is it OK?

To reproduce left-jumps run ./rescantest.sh
Script contain variables for my local paths.

Best regards, Andrey Borodin.

Attachments:

0001-Physical-GiST-scan-in-VACUUM-v13.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v13.patch; x-unix-mode=0644Download
From 6d8dd2b62d84c67e74bf63fe7fd6c80d4fe52e08 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 21 Jul 2018 15:21:13 +0500
Subject: [PATCH] Physical GiST scan in VACUUM v13

---
 insert.sql                           |   1 +
 rescantest.sh                        |  23 ++
 src/backend/access/gist/gist.c       |   2 +-
 src/backend/access/gist/gistutil.c   |   2 +
 src/backend/access/gist/gistvacuum.c | 427 ++++++++++++++++++++---------------
 vacuum.sql                           |   2 +
 6 files changed, 270 insertions(+), 187 deletions(-)
 create mode 100644 insert.sql
 create mode 100755 rescantest.sh
 create mode 100644 vacuum.sql

diff --git a/insert.sql b/insert.sql
new file mode 100644
index 0000000000..3028aa336c
--- /dev/null
+++ b/insert.sql
@@ -0,0 +1 @@
+insert into x select cube(random()) c from generate_series(1,10000) y;
diff --git a/rescantest.sh b/rescantest.sh
new file mode 100755
index 0000000000..2478599c70
--- /dev/null
+++ b/rescantest.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -e
+pkill -9 postgres || true
+make -j 16 && make install
+
+DB=~/DemoDb
+BINDIR=~/project/bin
+
+rm -rf $DB
+cp *.sql $BINDIR
+cd $BINDIR
+./initdb $DB
+./pg_ctl -D $DB start
+./psql postgres -c "create extension cube;"
+
+
+./psql postgres -c "create table x as select cube(random()) c from generate_series(1,10000) y; create index on x using gist(c);"
+./psql postgres -c "delete from x where (c~>1)>0.1;"
+./pgbench -f insert.sql postgres -T 30 &
+./pgbench -f vacuum.sql postgres -T 30
+
+./pg_ctl -D $DB stop
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..b9ba6e1241 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..59fd554e80 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -828,6 +828,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..9006bf535d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,236 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST! 
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
diff --git a/vacuum.sql b/vacuum.sql
new file mode 100644
index 0000000000..f30150bf01
--- /dev/null
+++ b/vacuum.sql
@@ -0,0 +1,2 @@
+delete from x where (c~>1)>0.1;
+vacuum x;
-- 
2.15.2 (Apple Git-101.1)

#29Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#28)
2 attachment(s)
Re: GiST VACUUM

Hi!

21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

<0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

I was considering further decreasing memory footprint by using bloom filters instead of bitmap, but it will create seriously more work for cpu to compute hashes.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v13.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v13.patch; x-unix-mode=0644Download
From 927528954fed4870c9f7fcf7930c0c48bf8d2f8a Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Mon, 23 Jul 2018 22:57:52 +0500
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v13

---
 src/backend/access/gist/README         |  35 ++++++
 src/backend/access/gist/gist.c         |  18 +++
 src/backend/access/gist/gistbuild.c    |   5 -
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 209 +++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  60 ++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 ++--
 src/include/access/gistxlog.h          |  17 ++-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 12 files changed, 354 insertions(+), 31 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b9ba6e1241..31f5abc0ca 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 59fd554e80..d8f8a4a29c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 9006bf535d..d0fbf3e56c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
@@ -30,10 +31,55 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+
+	char	   *internalPagesMap;
+	char	   *emptyLeafPagesMap;
+	BlockNumber mapNumPages;
 } GistVacState;
 
+static void setinternal(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages > blkno)
+		return;
+	state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
+}
+
+static void setempty(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages > blkno)
+		return;
+	state->emptyLeafPagesMap[blkno / 8] |= 1 << (blkno % 8);
+}
+
+static bool isinternal(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages > blkno)
+		return false;
+	return (state->internalPagesMap[blkno / 8] & 1 << (blkno % 8)) != 0;
+}
+
+static bool isemptyleaf(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages > blkno)
+		return false;
+	return (state->internalPagesMap[blkno / 8] & 1 << (blkno % 8)) != 0;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state);
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
 
@@ -45,7 +91,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, stats, callback, callback_state, true);
 
 	return stats;
 }
@@ -68,7 +114,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gistvacuumscan(info, stats, NULL, NULL, false);
 	}
 
 	/*
@@ -91,12 +137,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages)
 {
 	Relation	rel = info->index;
 	GistVacState vstate;
@@ -155,9 +200,27 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	needLock = !RELATION_IS_LOCAL(rel);
 
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	num_pages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	if (deletePages)
+	{
+		vstate.mapNumPages = num_pages;
+		vstate.internalPagesMap = palloc0(num_pages / 8 + 1);
+		vstate.emptyLeafPagesMap = palloc0(num_pages / 8 + 1);
+	}
+
 	blkno = GIST_ROOT_BLKNO;
 	for (;;)
 	{
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
+		{
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -168,11 +231,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
-		/* Iterate over pages, then loop back to recheck length */
-		for (; blkno < num_pages; blkno++)
-		{
-			gistvacuumpage(&vstate, blkno, blkno);
-		}
 	}
 
 	/*
@@ -193,6 +251,129 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	if (deletePages)
+	{
+		for (blkno = GIST_ROOT_BLKNO; blkno < vstate.mapNumPages; blkno++)
+		{
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		 buftodelete[MaxOffsetNumber];
+			int			 ntodelete = 0;
+
+			if (!isinternal(&vstate, blkno))
+				continue; /* second scan is for internal pages */
+
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			/* Currently block of internal page cannot become leaf */
+			Assert(!GistPageIsLeaf(page));
+
+			if (PageIsNew(page) || GistPageIsDeleted(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				/* TODO: Should not we record free page here? */
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				/* if this page was not empty in previous scan - we do not consider it */
+				if(!isemptyleaf(&vstate, i))
+				{
+					continue;
+				}
+
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+									RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				Assert(GistPageIsLeaf(leafPage));
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = i;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+
+			if (ntodelete)
+			{
+				/* Prepare possibly onurdered offsets */
+				qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr recptr 	=
+							gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+												txid, buffer, todelete[i] - i);
+						PageSetLSN(page, recptr);
+						PageSetLSN(leafPage, recptr);
+					}
+					else
+					{
+						PageSetLSN(page, gistGetFakeLSN(rel));
+						PageSetLSN(leafPage, gistGetFakeLSN(rel));
+					}
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		pfree(vstate.emptyLeafPagesMap);
+		pfree(vstate.internalPagesMap);
+	}
 }
 
 /*
@@ -310,6 +491,12 @@ restart:
 
 		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
+		if (maxoff - FirstOffsetNumber + 1 == 0)
+			setempty(vstate, blkno);
+	}
+	else
+	{
+		setinternal(vstate, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0001-Physical-GiST-scan-in-VACUUM-v13.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v13.patch; x-unix-mode=0644Download
From 6d8dd2b62d84c67e74bf63fe7fd6c80d4fe52e08 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sat, 21 Jul 2018 15:21:13 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v13

---
 insert.sql                           |   1 +
 rescantest.sh                        |  23 ++
 src/backend/access/gist/gist.c       |   2 +-
 src/backend/access/gist/gistutil.c   |   2 +
 src/backend/access/gist/gistvacuum.c | 427 ++++++++++++++++++++---------------
 vacuum.sql                           |   2 +
 6 files changed, 270 insertions(+), 187 deletions(-)
 create mode 100644 insert.sql
 create mode 100755 rescantest.sh
 create mode 100644 vacuum.sql

diff --git a/insert.sql b/insert.sql
new file mode 100644
index 0000000000..3028aa336c
--- /dev/null
+++ b/insert.sql
@@ -0,0 +1 @@
+insert into x select cube(random()) c from generate_series(1,10000) y;
diff --git a/rescantest.sh b/rescantest.sh
new file mode 100755
index 0000000000..2478599c70
--- /dev/null
+++ b/rescantest.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -e
+pkill -9 postgres || true
+make -j 16 && make install
+
+DB=~/DemoDb
+BINDIR=~/project/bin
+
+rm -rf $DB
+cp *.sql $BINDIR
+cd $BINDIR
+./initdb $DB
+./pg_ctl -D $DB start
+./psql postgres -c "create extension cube;"
+
+
+./psql postgres -c "create table x as select cube(random()) c from generate_series(1,10000) y; create index on x using gist(c);"
+./psql postgres -c "delete from x where (c~>1)>0.1;"
+./pgbench -f insert.sql postgres -T 30 &
+./pgbench -f vacuum.sql postgres -T 30
+
+./pg_ctl -D $DB stop
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..b9ba6e1241 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 12804c321c..59fd554e80 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -828,6 +828,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..9006bf535d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,236 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST! 
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
diff --git a/vacuum.sql b/vacuum.sql
new file mode 100644
index 0000000000..f30150bf01
--- /dev/null
+++ b/vacuum.sql
@@ -0,0 +1,2 @@
+delete from x where (c~>1)>0.1;
+vacuum x;
-- 
2.15.2 (Apple Git-101.1)

#30Thomas Munro
thomas.munro@enterprisedb.com
In reply to: Andrey Borodin (#29)
Re: GiST VACUUM

On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
<0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements. You can't do that in old school C89.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x00000000004ab620 in setinternal (state=<optimized out>,
state=<optimized out>, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

--
Thomas Munro
http://www.enterprisedb.com

#31Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Thomas Munro (#30)
2 attachment(s)
Re: GiST VACUUM

Hi!

Thank you!

29 июля 2018 г., в 14:04, Thomas Munro <thomas.munro@enterprisedb.com> написал(а):

On Tue, Jul 24, 2018 at 6:04 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:

21 июля 2018 г., в 17:11, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):
<0001-Physical-GiST-scan-in-VACUUM-v13.patch>

Just in case, here's second part of patch series with actual page deletion.

Hi Andrey,

Cfbot says:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.7146

That's because you declared a new variable after some other
statements. You can't do that in old school C89.

Yep, mismerged patch steps and created misplaced declaration

https://travis-ci.org/postgresql-cfbot/postgresql/builds/409401951

That segfaulted here:

#0 0x00000000004ab620 in setinternal (state=<optimized out>,
state=<optimized out>, blkno=368) at gistvacuum.c:44
44 state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);

internalPagesMap was NULL, or pointed to memory that was too small and
happened to be near an unmapped region (unlikely?), or was some other
corrupted address?

Yes, there was conditionally uninitialized variable mapNumPages. I do not know why it didn't trigger failure last time, but now it was reproduced on my machine reliably.

Fixed both problems. PFA v14.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v14.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v14.patch; x-unix-mode=0644Download
From c03764de41a7afd6dc3f7a095f8bc2a008732136 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 29 Jul 2018 16:43:02 +0500
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v14

---
 src/backend/access/gist/README         |  35 ++++++
 src/backend/access/gist/gist.c         |  20 ++-
 src/backend/access/gist/gistbuild.c    |   5 -
 src/backend/access/gist/gistutil.c     |   5 +-
 src/backend/access/gist/gistvacuum.c   | 216 ++++++++++++++++++++++++++++++---
 src/backend/access/gist/gistxlog.c     |  60 +++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 ++--
 src/include/access/gistxlog.h          |  17 ++-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 12 files changed, 357 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..9548872be8 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,41 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function can choose between two strategies: logical scan or physical scan.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains graph structure in palloc'ed array to collect block numbers of
+internal pages that need cleansing from references to empty leafs. Also, the
+array contains offsets on the internal page to potentially free leaf page. This
+scan method is chosen when maintenance work memory is sufficient to hold
+necessary graph structure.
+
+The logical scan is chosen when there is not enough maintenance memory to
+execute the physical scan. Logical scan traverses GiST index in DFS, looking up
+into incomplete split branches. The logical scan can be slower on hard disk
+drives.
+
+The result of both scans are the same: the stack of block numbers of internal
+pages with the list of offsets potentially referencing empty leaf pages. After
+the scan, for each internal pages under exclusive lock, each potentially free
+leaf page is examined. gistbulkdelete() never delete last one reference from
+internal page to preserve balanced tree properties.
+
+The physical scan can return empty leaf pages offsets unordered. Thus, before
+executing PageIndexMultiDelete offsets (already locked and checked) are sorted.
+This step is not necessary for the logical scan.
+
+Both scans hold only one lock at a time. Physical scan grabs exclusive lock
+instantly, while logical scan takes shared lock and then swaps it to exclusive.
+This is done because amount of work on internal page done by physical scan is
+lower and amount of internal pages is relatively low compared to the amount of
+leaf pages.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b9ba6e1241..3a6b5c7ed3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
+	if (BufferIsValid(leftchildbuf))
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 2c08e0dab0..db67b3c5ad 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
@@ -828,8 +829,6 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
-	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
-	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 9006bf535d..489a3f3604 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
@@ -30,10 +31,55 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+
+	char	   *internalPagesMap;
+	char	   *emptyLeafPagesMap;
+	BlockNumber mapNumPages;
 } GistVacState;
 
+static void setinternal(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages <= blkno)
+		return;
+	state->internalPagesMap[blkno / 8] |= 1 << (blkno % 8);
+}
+
+static void setempty(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages <= blkno)
+		return;
+	state->emptyLeafPagesMap[blkno / 8] |= 1 << (blkno % 8);
+}
+
+static bool isinternal(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages <= blkno)
+		return false;
+	return (state->internalPagesMap[blkno / 8] & 1 << (blkno % 8)) != 0;
+}
+
+static bool isemptyleaf(GistVacState *state, BlockNumber blkno)
+{
+	if (state->mapNumPages <= blkno)
+		return false;
+	return (state->internalPagesMap[blkno / 8] & 1 << (blkno % 8)) != 0;
+}
+
+/*
+ * This function is used to sort offsets
+ * When employing physical scan rescan offsets are not ordered.
+ */
+static int
+compare_offsetnumber(const void *x, const void *y)
+{
+	OffsetNumber a = *((OffsetNumber *)x);
+	OffsetNumber b = *((OffsetNumber *)y);
+	return a - b;
+}
+
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state);
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
 
@@ -45,7 +91,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, stats, callback, callback_state, true);
 
 	return stats;
 }
@@ -68,7 +114,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gistvacuumscan(info, stats, NULL, NULL, false);
 	}
 
 	/*
@@ -91,12 +137,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages)
 {
 	Relation	rel = info->index;
 	GistVacState vstate;
@@ -120,6 +165,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.startNSN = startNSN;
 	vstate.totFreePages = 0;
+	vstate.mapNumPages = 0;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -128,7 +174,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	/*
 	 * FIXME: copied from btvacuumscan. Check that all this also holds for
-	 * GiST! 
+	 * GiST!
 	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
 	 *
 	 * The outer loop iterates over all index pages, in
@@ -155,9 +201,27 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	needLock = !RELATION_IS_LOCAL(rel);
 
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	num_pages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	if (deletePages)
+	{
+		vstate.mapNumPages = num_pages;
+		vstate.internalPagesMap = palloc0(num_pages / 8 + 1);
+		vstate.emptyLeafPagesMap = palloc0(num_pages / 8 + 1);
+	}
+
 	blkno = GIST_ROOT_BLKNO;
 	for (;;)
 	{
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
+		{
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
 		/* Get the current relation length */
 		if (needLock)
 			LockRelationForExtension(rel, ExclusiveLock);
@@ -168,11 +232,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
-		/* Iterate over pages, then loop back to recheck length */
-		for (; blkno < num_pages; blkno++)
-		{
-			gistvacuumpage(&vstate, blkno, blkno);
-		}
 	}
 
 	/*
@@ -193,6 +252,129 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	if (deletePages)
+	{
+		for (blkno = GIST_ROOT_BLKNO; blkno < vstate.mapNumPages; blkno++)
+		{
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		 buftodelete[MaxOffsetNumber];
+			int			 ntodelete = 0;
+
+			if (!isinternal(&vstate, blkno))
+				continue; /* second scan is for internal pages */
+
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			/* Currently block of internal page cannot become leaf */
+			Assert(!GistPageIsLeaf(page));
+
+			if (PageIsNew(page) || GistPageIsDeleted(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				/* if this page was not empty in previous scan - we do not consider it */
+				if(!isemptyleaf(&vstate, i))
+				{
+					continue;
+				}
+
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(&(idxtuple->t_tid)),
+									RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				Assert(GistPageIsLeaf(leafPage));
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = i;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+
+			if (ntodelete)
+			{
+				TransactionId txid;
+				/* Prepare possibly onurdered offsets */
+				qsort(todelete, ntodelete, sizeof(OffsetNumber), compare_offsetnumber);
+
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr recptr 	=
+							gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+												txid, buffer, todelete[i] - i);
+						PageSetLSN(page, recptr);
+						PageSetLSN(leafPage, recptr);
+					}
+					else
+					{
+						PageSetLSN(page, gistGetFakeLSN(rel));
+						PageSetLSN(leafPage, gistGetFakeLSN(rel));
+					}
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		pfree(vstate.emptyLeafPagesMap);
+		pfree(vstate.internalPagesMap);
+	}
 }
 
 /*
@@ -255,10 +437,6 @@ restart:
 		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
 			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
 		{
-			if (GistFollowRight(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
-			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
 			recurse_to = opaque->rightlink;
 		}
 
@@ -310,6 +488,12 @@ restart:
 
 		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
+		if (maxoff - FirstOffsetNumber + 1 == 0)
+			setempty(vstate, blkno);
+	}
+	else
+	{
+		setinternal(vstate, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0001-Physical-GiST-scan-in-VACUUM-v14.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v14.patch; x-unix-mode=0644Download
From bbc5c64124c69cce4d226bd06f9d9e475851f0fa Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 29 Jul 2018 16:09:07 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v14

---
 src/backend/access/gist/gist.c       |   2 +-
 src/backend/access/gist/gistutil.c   |   2 +
 src/backend/access/gist/gistvacuum.c | 427 ++++++++++++++++++++---------------
 3 files changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..b9ba6e1241 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dddfe0ae2c..2c08e0dab0 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -828,6 +828,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..9006bf535d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,236 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST! 
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.15.2 (Apple Git-101.1)

#32Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#31)
Re: GiST VACUUM

On 29/07/18 14:47, Andrey Borodin wrote:

Fixed both problems. PFA v14.

Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.

In second step I still use paloc's memory, but only to store two
bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
physical scan only reads internal pages. I can omit that bitmap, if
I'll scan everything. Also, I can replace emptyLeafs bitmap with
array\list, but I do not really think it will be big.

On a typical GiST index, what's the ratio of leaf vs. internal pages?
Perhaps an array would indeed be better. If you have a really large
index, the bitmaps can take a fair amount of memory, on top of the
memory used for tracking the dead TIDs. I.e. that memory will be in
addition to maintenance_work_mem. That's not nice, but I think it's OK
in practice, and not worth spending too much effort to eliminate. For a
1 TB index with default 8k block size, the two bitmaps will take 32 MB
of memory in total. If you're dealing with a database of that size, you
ought to have some memory to spare. But if an array would use less
memory, that'd be better.

If you go with bitmaps, please use the existing Bitmapset instead of
rolling your own. Saves some code, and it provides more optimized
routines for iterating through all the set bits, too
(bms_next_member()). Another possibility would be to use Tidbitmap, in
the "lossy" mode, i.e. add the pages with tbm_add_page(). That might
save some memory, compared to Bitmapset, if the bitmap is very sparse.
Not sure how it compares with a plain array.

A straightforward little optimization would be to skip scanning the
internal pages, when the first scan didn't find any empty pages. And
stop the scan of the internal pages as soon as all the empty pages have
been recycled.

- Heikki

#33Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#32)
2 attachment(s)
Re: GiST VACUUM

Hi! Thanks for looking into the patch!

30 июля 2018 г., в 18:39, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 29/07/18 14:47, Andrey Borodin wrote:

Fixed both problems. PFA v14.

Thanks, took a really quick look at this.

The text being added to README is outdated for these latest changes.

Fixed.

In second step I still use paloc's memory, but only to store two
bitmaps: bitmap of internal pages and bitmap of empty leafs. Second
physical scan only reads internal pages. I can omit that bitmap, if
I'll scan everything. Also, I can replace emptyLeafs bitmap with
array\list, but I do not really think it will be big.

On a typical GiST index, what's the ratio of leaf vs. internal pages? Perhaps an array would indeed be better.

Typical GiST has around 200 tuples per internal page. I've switched to List since it's more efficient than bitmap. Is

If you have a really large index, the bitmaps can take a fair amount of memory, on top of the memory used for tracking the dead TIDs. I.e. that memory will be in addition to maintenance_work_mem. That's not nice, but I think it's OK in practice, and not worth spending too much effort to eliminate. For a 1 TB index with default 8k block size, the two bitmaps will take 32 MB of memory in total. If you're dealing with a database of that size, you ought to have some memory to spare. But if an array would use less memory, that'd be better.

If you go with bitmaps, please use the existing Bitmapset instead of rolling your own. Saves some code, and it provides more optimized routines for iterating through all the set bits, too (bms_next_member()). Another possibility would be to use Tidbitmap, in the "lossy" mode, i.e. add the pages with tbm_add_page(). That might save some memory, compared to Bitmapset, if the bitmap is very sparse. Not sure how it compares with a plain array.

Yeah, I've stopped reinventing that bicycle. But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.

A straightforward little optimization would be to skip scanning the internal pages, when the first scan didn't find any empty pages. And stop the scan of the internal pages as soon as all the empty pages have been recycled.

Done.

PFA v15.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v15.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v15.patch; x-unix-mode=0644Download
From 664c69536d3f362748f46525952566acd1ebabab Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 31 Jul 2018 22:47:16 +0300
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v15

---
 src/backend/access/gist/README         |  14 +++
 src/backend/access/gist/gist.c         |  20 +++-
 src/backend/access/gist/gistbuild.c    |   5 -
 src/backend/access/gist/gistutil.c     |   5 +-
 src/backend/access/gist/gistvacuum.c   | 161 +++++++++++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  60 ++++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 +++--
 src/include/access/gistxlog.h          |  17 +++-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 12 files changed, 287 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b9ba6e1241..3a6b5c7ed3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
+	if (BufferIsValid(leftchildbuf))
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 2c08e0dab0..db67b3c5ad 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
@@ -828,8 +829,6 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
-	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
-	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0b78a5398a..57c58dc32c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,6 +16,7 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
@@ -30,10 +31,15 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	List	   *internalPages;
+	Bitmapset  *emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state);
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
 
@@ -45,7 +51,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, stats, callback, callback_state, true);
 
 	return stats;
 }
@@ -68,7 +74,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gistvacuumscan(info, stats, NULL, NULL, false);
 	}
 
 	/*
@@ -91,12 +97,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages)
 {
 	Relation	rel = info->index;
 	GistVacState vstate;
@@ -120,6 +125,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.startNSN = startNSN;
 	vstate.totFreePages = 0;
+	vstate.internalPages = NULL;
+	vstate.emptyLeafPagesMap = NULL;
+	vstate.emptyPages = 0;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -168,6 +176,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 		{
@@ -193,6 +202,133 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	if (deletePages)
+	{
+		ListCell   *cell;
+		/* rescan inner pages that had empty child pages */
+		foreach(cell, vstate.internalPages)
+		{
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		 buftodelete[MaxOffsetNumber];
+			int			 ntodelete = 0;
+
+			if (vstate.emptyPages == 0)
+				break;
+
+			blkno = (BlockNumber)lfirst_int(cell);
+
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if(!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+				{
+					continue;
+				}
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+									RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = i;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr recptr 	=
+							gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+												txid, buffer, todelete[i] - i);
+						PageSetLSN(page, recptr);
+						PageSetLSN(leafPage, recptr);
+					}
+					else
+					{
+						PageSetLSN(page, gistGetFakeLSN(rel));
+						PageSetLSN(leafPage, gistGetFakeLSN(rel));
+					}
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		bms_free(vstate.emptyLeafPagesMap);
+		list_free(vstate.internalPages);
+	}
 }
 
 /*
@@ -255,10 +391,6 @@ restart:
 		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
 			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
 		{
-			if (GistFollowRight(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
-			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
 			recurse_to = opaque->rightlink;
 		}
 
@@ -310,6 +442,15 @@ restart:
 
 		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
+		if (maxoff - FirstOffsetNumber + 1 == 0)
+		{
+			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+	}
+	else
+	{
+		vstate->internalPages = lappend_int(vstate->internalPages, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0001-Physical-GiST-scan-in-VACUUM-v15.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v15.patch; x-unix-mode=0644Download
From af0c353557aa77b82b666a6f69f3635dff131665 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 31 Jul 2018 21:58:46 +0300
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v15

---
 src/backend/access/gist/gist.c       |   2 +-
 src/backend/access/gist/gistutil.c   |   2 +
 src/backend/access/gist/gistvacuum.c | 427 ++++++++++++++++++++---------------
 3 files changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..b9ba6e1241 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dddfe0ae2c..2c08e0dab0 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -828,6 +828,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..0b78a5398a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,236 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST!
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.15.2 (Apple Git-101.1)

#34Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#33)
Re: GiST VACUUM

On 31/07/18 23:06, Andrey Borodin wrote:

On a typical GiST index, what's the ratio of leaf vs. internal
pages? Perhaps an array would indeed be better.

Typical GiST has around 200 tuples per internal page. I've switched
to List since it's more efficient than bitmap.

Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per
page, leaf or internal. If I'm doing
my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

But I have to note that default growth strategy of Bitmap is not
good: we will be repallocing byte by byte.

True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

- Heikki

#35Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#34)
Re: GiST VACUUM

Hi!

5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 31/07/18 23:06, Andrey Borodin wrote:

On a typical GiST index, what's the ratio of leaf vs. internal
pages? Perhaps an array would indeed be better.

Typical GiST has around 200 tuples per internal page. I've switched
to List since it's more efficient than bitmap.

Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or linked list of chunks. Or anything from data structures zoo.
Or just stick with bitmap (my preferred way).

But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.

True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

OK, I'll think of proper resize function (ensure capacity, to be precise).

Best regards, Andrey Borodin.

#36Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#35)
2 attachment(s)
Re: GiST VACUUM

Hi!

PFA v16.

5 авг. 2018 г., в 21:45, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

5 авг. 2018 г., в 16:18, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Hmm. A ListCell is 16 bytes, plus the AllocChunk header, 16 bytes. 32
bytes per internal page in total, while a bitmap consumes one bit per page, leaf or internal. If I'm doing
my math right, assuming the ratio of leaf pages vs internal pages is
1:200, a List actually consumes more memory than a bitmap; 256 bits per
internal page, vs. 200 bits. An array, with 4 bytes per block number,
would be the winner here.

We do not know size of this array beforehand. I can implement normal ArrayList though (with repallocing array) or linked list of chunks. Or anything from data structures zoo.
Or just stick with bitmap (my preferred way).

Done.

But I have to note that default growth strategy of Bitmap is not good: we will be repallocing byte by byte.

True, that repallocing seems bad. You could force it to be pre-allocated
by setting the last bit. Or add a function to explicitly enlarge the bitmap.

OK, I'll think of proper resize function (ensure capacity, to be precise).

Done. Added function bms_make_empty(int size)

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v16.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v16.patch; x-unix-mode=0644Download
From 94dbdc5eaeec50dacf9e1ed995571abbce08457d Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 31 Jul 2018 22:47:16 +0300
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v16

---
 src/backend/access/gist/README         |  14 +++
 src/backend/access/gist/gist.c         |  20 +++-
 src/backend/access/gist/gistbuild.c    |   5 -
 src/backend/access/gist/gistutil.c     |   5 +-
 src/backend/access/gist/gistvacuum.c   | 185 ++++++++++++++++++++++++++++++---
 src/backend/access/gist/gistxlog.c     |  60 +++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/backend/nodes/bitmapset.c          |  16 +++
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 +++--
 src/include/access/gistxlog.h          |  17 ++-
 src/include/nodes/bitmapset.h          |   1 +
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 14 files changed, 321 insertions(+), 40 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b9ba6e1241..3a6b5c7ed3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
+	if (BufferIsValid(leftchildbuf))
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 2c08e0dab0..db67b3c5ad 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 
 
 /*
@@ -806,7 +807,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
@@ -828,8 +829,6 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
-	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
-	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0b78a5398a..896fd58760 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,8 +16,10 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -30,10 +32,15 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	Bitmapset  *internalPagesMap;
+	Bitmapset  *emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state);
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
 
@@ -45,7 +52,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, stats, callback, callback_state, true);
 
 	return stats;
 }
@@ -68,7 +75,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gistvacuumscan(info, stats, NULL, NULL, false);
 	}
 
 	/*
@@ -91,12 +98,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages)
 {
 	Relation	rel = info->index;
 	GistVacState vstate;
@@ -120,6 +126,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.startNSN = startNSN;
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -155,24 +162,35 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	needLock = !RELATION_IS_LOCAL(rel);
 
+	/* Get the current relation length */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	num_pages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	vstate.internalPagesMap = bms_make_empty(num_pages);
+	vstate.emptyLeafPagesMap = bms_make_empty(num_pages);
+
 	blkno = GIST_ROOT_BLKNO;
 	for (;;)
 	{
-		/* Get the current relation length */
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-		num_pages = RelationGetNumberOfBlocks(rel);
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
-
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 		{
 			gistvacuumpage(&vstate, blkno, blkno);
 		}
+
+		/* Update the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
 	}
 
 	/*
@@ -193,6 +211,134 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	if (deletePages)
+	{
+		/* rescan inner pages that had empty child pages */
+		for (blkno = GIST_ROOT_BLKNO; blkno < num_pages; blkno++)
+		{
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		 buftodelete[MaxOffsetNumber];
+			int			 ntodelete = 0;
+
+			if (vstate.emptyPages == 0)
+				break;
+
+			if(!bms_is_member(blkno, vstate.internalPagesMap))
+			{
+				continue;
+			}
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if(!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+				{
+					continue;
+				}
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+									RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = i;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr recptr 	=
+							gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+												txid, buffer, todelete[i] - i);
+						PageSetLSN(page, recptr);
+						PageSetLSN(leafPage, recptr);
+					}
+					else
+					{
+						PageSetLSN(page, gistGetFakeLSN(rel));
+						PageSetLSN(leafPage, gistGetFakeLSN(rel));
+					}
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		bms_free(vstate.emptyLeafPagesMap);
+		bms_free(vstate.internalPagesMap);
+	}
 }
 
 /*
@@ -255,10 +401,6 @@ restart:
 		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
 			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
 		{
-			if (GistFollowRight(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
-			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
-				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
 			recurse_to = opaque->rightlink;
 		}
 
@@ -310,6 +452,15 @@ restart:
 
 		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
+		if (maxoff - FirstOffsetNumber + 1 == 0)
+		{
+			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+	}
+	else
+	{
+		vstate->internalPagesMap = bms_add_member(vstate->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 8ce253c88d..29cfcd7898 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -258,6 +258,22 @@ bms_make_singleton(int x)
 	return result;
 }
 
+/*
+ * bms_make_singleton - preallocate an empty bitmapset
+ */
+Bitmapset *
+bms_make_empty(int size)
+{
+	Bitmapset  *result;
+	int			wordnum;
+
+	if (size < 0)
+		elog(ERROR, "negative bitmapset member not allowed");
+	wordnum = WORDNUM(size - 1);
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(wordnum + 1));
+	return result;
+}
+
 /*
  * bms_free - free a bitmapset
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index b6f1a9e6e5..92a3d9200f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -67,6 +67,7 @@ extern Bitmapset *bms_copy(const Bitmapset *a);
 extern bool bms_equal(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_compare(const Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_make_singleton(int x);
+extern Bitmapset *bms_make_empty(int size);
 extern void bms_free(Bitmapset *a);
 
 extern Bitmapset *bms_union(const Bitmapset *a, const Bitmapset *b);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.15.2 (Apple Git-101.1)

0001-Physical-GiST-scan-in-VACUUM-v16.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v16.patch; x-unix-mode=0644Download
From 0e323a85c06463653034d4d0b66bb0275b7c84e6 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Tue, 31 Jul 2018 21:58:46 +0300
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v16

---
 src/backend/access/gist/gist.c       |   2 +-
 src/backend/access/gist/gistutil.c   |   2 +
 src/backend/access/gist/gistvacuum.c | 427 ++++++++++++++++++++---------------
 3 files changed, 244 insertions(+), 187 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..b9ba6e1241 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -583,7 +583,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index dddfe0ae2c..2c08e0dab0 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -828,6 +828,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+	
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..0b78a5398a 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,236 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST!
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.15.2 (Apple Git-101.1)

#37Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#36)
Re: GiST VACUUM

On Mon, Aug 06, 2018 at 11:12:00PM +0500, Andrey Borodin wrote:

Done. Added function bms_make_empty(int size)

Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.
--
Michael

#38Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Michael Paquier (#37)
2 attachment(s)
Re: GiST VACUUM

Hi everyone!

2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM-v17.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v17.patch; x-unix-mode=0644Download
From 1eb33288a9b75a2956d14ab7968082a9f3a0a708 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 28 Oct 2018 22:30:35 +0500
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v17

---
 src/backend/access/gist/README         |  14 ++
 src/backend/access/gist/gist.c         |  18 +++
 src/backend/access/gist/gistbuild.c    |   5 -
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 181 +++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  60 ++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/backend/nodes/bitmapset.c          |  16 +++
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |  24 +++-
 src/include/access/gistxlog.h          |  17 ++-
 src/include/nodes/bitmapset.h          |   1 +
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 14 files changed, 320 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42effdf7..3a6b5c7ed3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -700,6 +700,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -834,6 +839,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index 434f15f014..f26f139a9e 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -1126,11 +1126,6 @@ gistGetMaxLevel(Relation index)
  * but will be added there the first time we visit them.
  */
 
-typedef struct
-{
-	BlockNumber childblkno;		/* hash key */
-	BlockNumber parentblkno;
-} ParentMapEntry;
 
 static void
 gistInitParentMap(GISTBuildState *buildstate)
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 70627e5df6..adb316c6af 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -807,7 +808,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index b6b9ee0ae3..896fd58760 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,8 +16,10 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -30,10 +32,15 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	Bitmapset  *internalPagesMap;
+	Bitmapset  *emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state);
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
 
@@ -45,7 +52,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (stats == NULL)
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, stats, callback, callback_state, true);
 
 	return stats;
 }
@@ -68,7 +75,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	if (stats == NULL)
 	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gistvacuumscan(info, stats, NULL, NULL, false);
 	}
 
 	/*
@@ -91,12 +98,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
-			   IndexBulkDeleteCallback callback, void *callback_state)
+			   IndexBulkDeleteCallback callback, void *callback_state,
+			   bool deletePages)
 {
 	Relation	rel = info->index;
 	GistVacState vstate;
@@ -120,6 +126,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback_state = callback_state;
 	vstate.startNSN = startNSN;
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -155,24 +162,35 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 */
 	needLock = !RELATION_IS_LOCAL(rel);
 
+	/* Get the current relation length */
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	num_pages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	vstate.internalPagesMap = bms_make_empty(num_pages);
+	vstate.emptyLeafPagesMap = bms_make_empty(num_pages);
+
 	blkno = GIST_ROOT_BLKNO;
 	for (;;)
 	{
-		/* Get the current relation length */
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-		num_pages = RelationGetNumberOfBlocks(rel);
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
-
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 		{
 			gistvacuumpage(&vstate, blkno, blkno);
 		}
+
+		/* Update the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
 	}
 
 	/*
@@ -193,6 +211,134 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	if (deletePages)
+	{
+		/* rescan inner pages that had empty child pages */
+		for (blkno = GIST_ROOT_BLKNO; blkno < num_pages; blkno++)
+		{
+			Buffer		 buffer;
+			Page		 page;
+			OffsetNumber i,
+						 maxoff;
+			IndexTuple   idxtuple;
+			ItemId	     iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		 buftodelete[MaxOffsetNumber];
+			int			 ntodelete = 0;
+
+			if (vstate.emptyPages == 0)
+				break;
+
+			if(!bms_is_member(blkno, vstate.internalPagesMap))
+			{
+				continue;
+			}
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, i);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if(!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+				{
+					continue;
+				}
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+									RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = i;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page)BufferGetPage(buftodelete[i]);
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+					{
+						XLogRecPtr recptr 	=
+							gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+												txid, buffer, todelete[i] - i);
+						PageSetLSN(page, recptr);
+						PageSetLSN(leafPage, recptr);
+					}
+					else
+					{
+						PageSetLSN(page, gistGetFakeLSN(rel));
+						PageSetLSN(leafPage, gistGetFakeLSN(rel));
+					}
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		bms_free(vstate.emptyLeafPagesMap);
+		bms_free(vstate.internalPagesMap);
+	}
 }
 
 /*
@@ -306,6 +452,15 @@ restart:
 
 		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
+		if (maxoff - FirstOffsetNumber + 1 == 0)
+		{
+			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+	}
+	else
+	{
+		vstate->internalPagesMap = bms_add_member(vstate->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 1e09126978..80108f6bfb 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -60,6 +60,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -112,6 +145,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -324,6 +358,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -442,6 +479,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e5e925e0c5..f494db63f6 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -65,6 +65,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 8ce253c88d..29cfcd7898 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -258,6 +258,22 @@ bms_make_singleton(int x)
 	return result;
 }
 
+/*
+ * bms_make_singleton - preallocate an empty bitmapset
+ */
+Bitmapset *
+bms_make_empty(int size)
+{
+	Bitmapset  *result;
+	int			wordnum;
+
+	if (size < 0)
+		elog(ERROR, "negative bitmapset member not allowed");
+	wordnum = WORDNUM(size - 1);
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(wordnum + 1));
+	return result;
+}
+
 /*
  * bms_free - free a bitmapset
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e..0dd2bf47c8 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 36ed7244ba..1f82695b1d 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -16,6 +16,7 @@
 
 #include "access/amapi.h"
 #include "access/gist.h"
+#include "access/gistxlog.h"
 #include "access/itup.h"
 #include "fmgr.h"
 #include "lib/pairingheap.h"
@@ -51,6 +52,11 @@ typedef struct
 	char		tupledata[FLEXIBLE_ARRAY_MEMBER];
 } GISTNodeBufferPage;
 
+typedef struct
+{
+	BlockNumber childblkno;		/* hash key */
+	BlockNumber parentblkno;
+} ParentMapEntry;
 #define BUFFER_PAGE_DATA_OFFSET MAXALIGN(offsetof(GISTNodeBufferPage, tupledata))
 /* Returns free space in node buffer page */
 #define PAGE_FREE_SPACE(nbp) (nbp->freespace)
@@ -176,13 +182,6 @@ typedef struct GISTScanOpaqueData
 
 typedef GISTScanOpaqueData *GISTScanOpaque;
 
-/* despite the name, gistxlogPage is not part of any xlog record */
-typedef struct gistxlogPage
-{
-	BlockNumber blkno;
-	int			num;			/* number of index tuples following */
-} gistxlogPage;
-
 /* SplitedPageLayout - gistSplit function result */
 typedef struct SplitedPageLayout
 {
@@ -409,6 +408,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern void gist_redo(XLogReaderState *record);
+extern void gist_desc(StringInfo buf, XLogReaderState *record);
+extern const char *gist_identify(uint8 info);
+extern void gist_xlog_startup(void);
+extern void gist_xlog_cleanup(void);
+
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 1a2b9496d0..ad0b742dbb 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,12 +17,14 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		 0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -59,6 +61,19 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
+/* despite the name, gistxlogPage is not part of any xlog record */
+typedef struct gistxlogPage
+{
+   BlockNumber blkno;
+   int			num;			/* number of index tuples following */
+} gistxlogPage;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index b6f1a9e6e5..92a3d9200f 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -67,6 +67,7 @@ extern Bitmapset *bms_copy(const Bitmapset *a);
 extern bool bms_equal(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_compare(const Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_make_singleton(int x);
+extern Bitmapset *bms_make_empty(int size);
 extern void bms_free(Bitmapset *a);
 
 extern Bitmapset *bms_union(const Bitmapset *a, const Bitmapset *b);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.17.1 (Apple Git-112)

0001-Physical-GiST-scan-in-VACUUM-v17.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v17.patch; x-unix-mode=0644Download
From 8eb69789dc1d64ea238c8ed4529a20b884ca3d90 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 28 Oct 2018 22:20:58 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v17

---
 src/backend/access/gist/gistvacuum.c | 423 +++++++++++++++------------
 1 file changed, 237 insertions(+), 186 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c77..b6b9ee0ae3 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we still
+	 * need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,232 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber	blkno;
+	GistNSN startNSN = GetInsertRecPtr();
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	vstate.startNSN = startNSN;
+	vstate.totFreePages = 0;
 
-	while (stack)
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	/*
+	 * FIXME: copied from btvacuumscan. Check that all this also holds for
+	 * GiST!
+	 * AB: Yes, gistNewBuffer() takes LockRelationForExtension()
+	 *
+	 * The outer loop iterates over all index pages, in
+	 * physical order (we hope the kernel will cooperate in providing
+	 * read-ahead for speed).  It is critical that we visit all leaf pages,
+	 * including ones added after we start the scan, else we might fail to
+	 * delete some deletable tuples.  Hence, we must repeatedly check the
+	 * relation length.  We must acquire the relation-extension lock while
+	 * doing so to avoid a race condition: if someone else is extending the
+	 * relation, there is a window where bufmgr/smgr have created a new
+	 * all-zero page but it hasn't yet been write-locked by gistNewBuffer(). If
+	 * we manage to scan such a page here, we'll improperly assume it can be
+	 * recycled.  Taking the lock synchronizes things enough to prevent a
+	 * problem: either num_pages won't include the new page, or gistNewBuffer
+	 * already has write lock on the buffer and it will be fully initialized
+	 * before we can examine it.  (See also vacuumlazy.c, which has the same
+	 * issue.)	Also, we need not worry if a page is added immediately after
+	 * we look; the page splitting code already has write-lock on the left
+	 * page before it adds a right page, so we must already have processed any
+	 * tuples due to be moved into such a page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
+
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
 	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
 		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+			gistvacuumpage(&vstate, blkno, blkno);
+		}
+	}
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+restart:
+	recurse_to = InvalidBlockNumber;
 
-			maxoff = PageGetMaxOffsetNumber(page);
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-			stats->tuples_removed += ntodelete;
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * If this page was splitted after start of the VACUUM we have to
+		 * revisit rightlink, if it points to block we already scanned.
+		 */
+		if ((GistFollowRight(page) || vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) && (opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
+
+		/*
+		 * Remove deletable tuples from page
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-			if (ntodelete)
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
 			{
-				START_CRIT_SECTION();
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				MarkBufferDirty(buffer);
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
+			}
+		}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+		/* We have dead tuples on the page */
+		if (ntodelete)
+		{
+			START_CRIT_SECTION();
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+			MarkBufferDirty(buffer);
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
 
-				END_CRIT_SECTION();
+			if (RelationNeedsWAL(rel))
+			{
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		}
-		else
-		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			END_CRIT_SECTION();
 
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
-
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
-			}
 		}
 
-		UnlockReleaseBuffer(buffer);
-
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.17.1 (Apple Git-112)

#39Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Andrey Borodin (#38)
Re: GiST VACUUM

On Sun, Oct 28, 2018 at 6:32 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote:

Hi everyone!

2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Looks like this patch is waiting for an input since August. Could any of the
reviewers (Heikki?) please take a look at the latest version? In the meantime
I'm moving it to the next CF.

#40Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#38)
Re: GiST VACUUM

On 28/10/2018 19:32, Andrey Borodin wrote:

Hi everyone!

2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Thanks, I had another look at these.

In patch #1, to do the vacuum scan in physical order:

* The starting NSN was not acquired correctly for unlogged and temp
relations. They don't use WAL, so their NSN values are based on the
'unloggedLSN' counter, rather than current WAL insert pointer. So must
use gistGetFakeLSN() rather than GetInsertRecPtr() for them. Fixed that.

* Adjusted comments a little bit, mostly by copy-pasting the
better-worded comments from the corresponding nbtree code, and ran pgindent.

I think this is ready to be committed, except that I didn't do any
testing. We discussed the need for testing earlier. Did you write some
test scripts for this, or how have you been testing?

Patch #2:

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned.
So this will fail with an index larger than 2^31 blocks. That's perhaps
academical, I don't think anyone will try to create a 16 TB GiST index
any time soon. But it feels wrong to introduce an arbitrary limitation
like that.

* I was surprised that the bms_make_empty() function doesn't set the
'nwords' field to match the allocated size. Perhaps that was
intentional, so that you don't need to scan the empty region at the end,
when you scan through all matching bits? Still, seems surprising, and
needs a comment at least.

* We're now scanning all internal pages in the 2nd phase. Not only those
internal pages that contained downlinks to empty leaf pages. That's
probably OK, but the comments need updating on that.

* In general, comments need to be fixed and added in several places. For
example, there's no clear explanation on what the "delete XID", stored
in pd_prune_xid, means. (I know what it is, because I'm familiar with
the same mechanism in B-tree, but it's not clear from the code itself.)

These can be fixed, they're not show-stoppers, but patch #2 isn't quite
ready yet.

- Heikki

#41Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#40)
2 attachment(s)
Re: GiST VACUUM

On 02/01/2019 17:35, Heikki Linnakangas wrote:

On 28/10/2018 19:32, Andrey Borodin wrote:

Hi everyone!

2 окт. 2018 г., в 6:14, Michael Paquier <michael@paquier.xyz> написал(а):
Andrey, your latest patch does not apply. I am moving this to the next
CF, waiting for your input.

I'm doing preps for CF.
Here's rebased version.

Thanks, I had another look at these.

Here are new patch versions, with the fixes I mentioned. Forgot to
attach these earlier.

- Heikki

Attachments:

0001-Physical-GiST-scan-in-VACUUM-v18-heikki.patchtext/x-patch; name=0001-Physical-GiST-scan-in-VACUUM-v18-heikki.patchDownload
From 46456dbf1d07aaf3e6963035a02aaa060decace3 Mon Sep 17 00:00:00 2001
From: Andrey Borodin <amborodin@acm.org>
Date: Sun, 28 Oct 2018 22:20:58 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v18-heikki

---
 src/backend/access/gist/gist.c       |   8 +-
 src/backend/access/gist/gistvacuum.c | 430 +++++++++++++++------------
 2 files changed, 247 insertions(+), 191 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index a2cb84800e8..d42e810c6b3 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -38,7 +38,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 bool unlockbuf, bool unlockleftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 				GISTSTATE *giststate, List *splitinfo, bool releasebuf);
-static void gistvacuumpage(Relation rel, Page page, Buffer buffer,
+static void gistprunepage(Relation rel, Page page, Buffer buffer,
 			   Relation heapRel);
 
 
@@ -261,7 +261,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 */
 	if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
 	{
-		gistvacuumpage(rel, page, buffer, heapRel);
+		gistprunepage(rel, page, buffer, heapRel);
 		is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
 	}
 
@@ -1544,11 +1544,11 @@ freeGISTstate(GISTSTATE *giststate)
 }
 
 /*
- * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ * gistprunepage() -- try to remove LP_DEAD items from the given page.
  * Function assumes that buffer is exclusively locked.
  */
 static void
-gistvacuumpage(Relation rel, Page page, Buffer buffer, Relation heapRel)
+gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 {
 	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 5948218c779..4fb32bf76bf 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
+	 * still need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,237 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command.
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
+
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	if (RelationNeedsWAL(rel))
+		vstate.startNSN = GetInsertRecPtr();
+	else
+		vstate.startNSN = gistGetFakeLSN(rel);
+	vstate.totFreePages = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
-
-	while (stack)
-	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
-		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * The outer loop iterates over all index pages, in physical order (we
+	 * hope the kernel will cooperate in providing read-ahead for speed).  It
+	 * is critical that we visit all leaf pages, including ones added after we
+	 * start the scan, else we might fail to delete some deletable tuples.
+	 * Hence, we must repeatedly check the relation length.  We must acquire
+	 * the relation-extension lock while doing so to avoid a race condition:
+	 * if someone else is extending the relation, there is a window where
+	 * bufmgr/smgr have created a new all-zero page but it hasn't yet been
+	 * write-locked by gistNewBuffer().  If we manage to scan such a page
+	 * here, we'll improperly assume it can be recycled.  Taking the lock
+	 * synchronizes things enough to prevent a problem: either num_pages won't
+	 * include the new page, or gistNewBuffer already has write lock on the
+	 * buffer and it will be fully initialized before we can examine it.  (See
+	 * also vacuumlazy.c, which has the same issue.)	Also, we need not worry
+	 * if a page is added immediately after we look; the page splitting code
+	 * already has write-lock on the left page before it adds a right page, so
+	 * we must already have processed any tuples due to be moved into such a
+	 * page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
+	{
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
+			gistvacuumpage(&vstate, blkno, blkno);
+	}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			maxoff = PageGetMaxOffsetNumber(page);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+restart:
+	recurse_to = InvalidBlockNumber;
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			stats->tuples_removed += ntodelete;
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
 
-			if (ntodelete)
-			{
-				START_CRIT_SECTION();
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				MarkBufferDirty(buffer);
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Check whether we need to recurse back to earlier pages.  What we
+		 * are concerned about is a page split that happened since we started
+		 * the vacuum scan.  If the split moved some tuples to a lower page
+		 * then we might have missed 'em.  If so, set up for tail recursion.
+		 */
+		if ((GistFollowRight(page) ||
+			 vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) &&
+			(opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+		/*
+		 * Scan over all items to see which ones need deleted according to the
+		 * callback function.
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				END_CRIT_SECTION();
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
 			}
-
 		}
-		else
+
+		/*
+		 * Apply any needed deletes.  We issue just one WAL record per page,
+		 * so as to minimize WAL traffic.
+		 */
+		if (ntodelete)
 		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			START_CRIT_SECTION();
 
-			maxoff = PageGetMaxOffsetNumber(page);
+			MarkBufferDirty(buffer);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
-		}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		UnlockReleaseBuffer(buffer);
+			END_CRIT_SECTION();
+
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
+			maxoff = PageGetMaxOffsetNumber(page);
+		}
 
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.19.2

0002-Delete-pages-during-GiST-VACUUM-v18-heikki.patchtext/x-patch; name=0002-Delete-pages-during-GiST-VACUUM-v18-heikki.patchDownload
From 62fbe0ce5506e006b92dbfb07aee7414040d982f Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 2 Jan 2019 16:00:16 +0200
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v18-heikki

---
 src/backend/access/gist/README         |  14 +++
 src/backend/access/gist/gist.c         |  18 +++
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 152 ++++++++++++++++++++++++-
 src/backend/access/gist/gistxlog.c     |  60 ++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/backend/nodes/bitmapset.c          |  16 +++
 src/include/access/gist.h              |   3 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  10 +-
 src/include/nodes/bitmapset.h          |   1 +
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 13 files changed, 282 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..c84359de310 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index d42e810c6b3..bbfd5a92b88 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +843,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 70627e5df66..adb316c6afa 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -807,7 +808,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalDataXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 4fb32bf76bf..bac6b8c77af 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,8 +16,10 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -30,6 +32,10 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	Bitmapset  *internalPagesMap;
+	Bitmapset  *emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -91,8 +97,6 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -122,6 +126,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
+	vstate.internalPagesMap = NULL;
+	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -166,6 +173,12 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
+		if (!vstate.internalPagesMap)
+			vstate.internalPagesMap = bms_make_empty(num_pages);
+		if (!vstate.emptyLeafPagesMap)
+			vstate.emptyLeafPagesMap = bms_make_empty(num_pages);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -189,6 +202,126 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	/* rescan inner pages that had empty child pages */
+	if (vstate.emptyPages > 0)
+	{
+		int			x;
+
+		x = -1;
+		while (vstate.emptyPages > 0 &&
+			   (x = bms_next_member(vstate.internalPagesMap, x)) >= 0)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber off,
+				maxoff;
+			IndexTuple  idxtuple;
+			ItemId	    iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		buftodelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/* FIXME: 'x' is signed, so this will not work with indexes larger than 2^31 blocks */
+			blkno = (BlockNumber) x;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, off);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if (!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+					continue;
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+												RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = off;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need a upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+				int			i;
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page) BufferGetPage(buftodelete[i]);
+					XLogRecPtr	recptr;
+
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+						recptr 	= gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+													 txid, buffer, todelete[i] - i);
+					else
+						recptr = gistGetFakeLSN(rel);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+
+	bms_free(vstate.emptyLeafPagesMap);
+	bms_free(vstate.internalPagesMap);
 }
 
 /*
@@ -242,6 +375,7 @@ restart:
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -309,8 +443,18 @@ restart:
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
-
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+		else
+			stats->num_index_tuples += nremain;
+	}
+	else
+	{
+		vstate->internalPagesMap = bms_add_member(vstate->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 01e025d5fdb..bb0fa473f5e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,6 +64,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -116,6 +149,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -535,6 +569,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +690,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index b79ed1dfdc8..f65335ba23a 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 8ce253c88df..29cfcd78984 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -258,6 +258,22 @@ bms_make_singleton(int x)
 	return result;
 }
 
+/*
+ * bms_make_singleton - preallocate an empty bitmapset
+ */
+Bitmapset *
+bms_make_empty(int size)
+{
+	Bitmapset  *result;
+	int			wordnum;
+
+	if (size < 0)
+		elog(ERROR, "negative bitmapset member not allowed");
+	wordnum = WORDNUM(size - 1);
+	result = (Bitmapset *) palloc0(BITMAPSET_SIZE(wordnum + 1));
+	return result;
+}
+
 /*
  * bms_free - free a bitmapset
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 827566dc6e7..0dd2bf47c8c 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,9 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index a73716d6eaa..5d02800dac6 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -412,12 +412,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index b67c7100500..3c71d0261a1 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,13 +17,15 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +78,12 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 433df8a46d0..55435f9ae64 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -79,6 +79,7 @@ extern Bitmapset *bms_copy(const Bitmapset *a);
 extern bool bms_equal(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_compare(const Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_make_singleton(int x);
+extern Bitmapset *bms_make_empty(int size);
 extern void bms_free(Bitmapset *a);
 
 extern Bitmapset *bms_union(const Bitmapset *a, const Bitmapset *b);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..5b92f08c747 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..e66396e851b 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.19.2

#42Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#40)
4 attachment(s)
Re: GiST VACUUM

Cool, thanks!

2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

In patch #1, to do the vacuum scan in physical order:
...
I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing earlier. Did you write some test scripts for this, or how have you been testing?

Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

Attachments:

0002-Test-left-jumps-v18.patchapplication/octet-stream; name=0002-Test-left-jumps-v18.patch; x-unix-mode=0644Download
From ee2b02e09bb4de148c8967bced924730750de744 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 3 Jan 2019 22:29:10 +0500
Subject: [PATCH 2/2] Test left jumps v18

---
 insert.sql                           |  1 +
 rescantest.sh                        | 23 +++++++++++++++++++++++
 src/backend/access/gist/gist.c       |  2 +-
 src/backend/access/gist/gistutil.c   |  2 ++
 src/backend/access/gist/gistvacuum.c |  4 ++++
 vacuum.sql                           |  2 ++
 6 files changed, 33 insertions(+), 1 deletion(-)
 create mode 100644 insert.sql
 create mode 100755 rescantest.sh
 create mode 100644 vacuum.sql

diff --git a/insert.sql b/insert.sql
new file mode 100644
index 0000000000..3028aa336c
--- /dev/null
+++ b/insert.sql
@@ -0,0 +1 @@
+insert into x select cube(random()) c from generate_series(1,10000) y;
diff --git a/rescantest.sh b/rescantest.sh
new file mode 100755
index 0000000000..2478599c70
--- /dev/null
+++ b/rescantest.sh
@@ -0,0 +1,23 @@
+#!/usr/bin/env bash
+
+set -e
+pkill -9 postgres || true
+make -j 16 && make install
+
+DB=~/DemoDb
+BINDIR=~/project/bin
+
+rm -rf $DB
+cp *.sql $BINDIR
+cd $BINDIR
+./initdb $DB
+./pg_ctl -D $DB start
+./psql postgres -c "create extension cube;"
+
+
+./psql postgres -c "create table x as select cube(random()) c from generate_series(1,10000) y; create index on x using gist(c);"
+./psql postgres -c "delete from x where (c~>1)>0.1;"
+./pgbench -f insert.sql postgres -T 30 &
+./pgbench -f vacuum.sql postgres -T 30
+
+./pg_ctl -D $DB stop
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..e034906e42 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -585,7 +585,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 * the child pages first, we wouldn't know the recptr of the WAL record
 	 * we're about to write.
 	 */
-	if (BufferIsValid(leftchildbuf))
+	if (BufferIsValid(leftchildbuf) && ((random()%20) != 0)) // REMOVE THIS LINE randoms
 	{
 		Page		leftpg = BufferGetPage(leftchildbuf);
 
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8d3dfad27b..f34dd03434 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -829,6 +829,8 @@ gistNewBuffer(Relation r)
 	if (needLock)
 		UnlockRelationForExtension(r, ExclusiveLock);
 
+	ReleaseBuffer(ReadBuffer(r, P_NEW));// REMOVE THIS LINE
+
 	return buffer;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index c4ed1b5402..7ce1099b62 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -256,6 +256,10 @@ restart:
 			(opaque->rightlink != InvalidBlockNumber) &&
 			(opaque->rightlink < orig_blkno))
 		{
+			if (GistFollowRight(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY FollowRight"); // REMOVE THIS LINE
+			if (vstate->startNSN < GistPageGetNSN(page)) // REMOVE THIS LINE
+				elog(WARNING,"RESCAN TRIGGERED BY NSN"); // REMOVE THIS LINE
 			recurse_to = opaque->rightlink;
 		}
 
diff --git a/vacuum.sql b/vacuum.sql
new file mode 100644
index 0000000000..f30150bf01
--- /dev/null
+++ b/vacuum.sql
@@ -0,0 +1,2 @@
+delete from x where (c~>1)>0.1;
+vacuum x;
-- 
2.19.2

0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch; x-unix-mode=0644Download
From 4de964ced47518e3cca12c5a4841a6756ca1f79f Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 3 Jan 2019 22:21:15 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v18 with test modifications

---
 src/backend/access/gist/gist.c       |   8 +-
 src/backend/access/gist/gistvacuum.c | 430 +++++++++++++++------------
 2 files changed, 247 insertions(+), 191 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..3f52b8f4dc 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -38,7 +38,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 bool unlockbuf, bool unlockleftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 				GISTSTATE *giststate, List *splitinfo, bool releasebuf);
-static void gistvacuumpage(Relation rel, Page page, Buffer buffer,
+static void gistprunepage(Relation rel, Page page, Buffer buffer,
 			   Relation heapRel);
 
 
@@ -261,7 +261,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 */
 	if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
 	{
-		gistvacuumpage(rel, page, buffer, heapRel);
+		gistprunepage(rel, page, buffer, heapRel);
 		is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
 	}
 
@@ -1544,11 +1544,11 @@ freeGISTstate(GISTSTATE *giststate)
 }
 
 /*
- * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ * gistprunepage() -- try to remove LP_DEAD items from the given page.
  * Function assumes that buffer is exclusively locked.
  */
 static void
-gistvacuumpage(Relation rel, Page page, Buffer buffer, Relation heapRel)
+gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 {
 	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ccb147406c..c4ed1b5402 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
+	 * still need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,237 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command.
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
+
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	if (RelationNeedsWAL(rel))
+		vstate.startNSN = GetInsertRecPtr();
+	else
+		vstate.startNSN = gistGetFakeLSN(rel);
+	vstate.totFreePages = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
-
-	while (stack)
-	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
-		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * The outer loop iterates over all index pages, in physical order (we
+	 * hope the kernel will cooperate in providing read-ahead for speed).  It
+	 * is critical that we visit all leaf pages, including ones added after we
+	 * start the scan, else we might fail to delete some deletable tuples.
+	 * Hence, we must repeatedly check the relation length.  We must acquire
+	 * the relation-extension lock while doing so to avoid a race condition:
+	 * if someone else is extending the relation, there is a window where
+	 * bufmgr/smgr have created a new all-zero page but it hasn't yet been
+	 * write-locked by gistNewBuffer().  If we manage to scan such a page
+	 * here, we'll improperly assume it can be recycled.  Taking the lock
+	 * synchronizes things enough to prevent a problem: either num_pages won't
+	 * include the new page, or gistNewBuffer already has write lock on the
+	 * buffer and it will be fully initialized before we can examine it.  (See
+	 * also vacuumlazy.c, which has the same issue.)	Also, we need not worry
+	 * if a page is added immediately after we look; the page splitting code
+	 * already has write-lock on the left page before it adds a right page, so
+	 * we must already have processed any tuples due to be moved into such a
+	 * page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
+	{
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
+			gistvacuumpage(&vstate, blkno, blkno);
+	}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			maxoff = PageGetMaxOffsetNumber(page);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+restart:
+	recurse_to = InvalidBlockNumber;
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			stats->tuples_removed += ntodelete;
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
 
-			if (ntodelete)
-			{
-				START_CRIT_SECTION();
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				MarkBufferDirty(buffer);
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Check whether we need to recurse back to earlier pages.  What we
+		 * are concerned about is a page split that happened since we started
+		 * the vacuum scan.  If the split moved some tuples to a lower page
+		 * then we might have missed 'em.  If so, set up for tail recursion.
+		 */
+		if ((GistFollowRight(page) ||
+			 vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) &&
+			(opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+		/*
+		 * Scan over all items to see which ones need deleted according to the
+		 * callback function.
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				END_CRIT_SECTION();
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
 			}
-
 		}
-		else
+
+		/*
+		 * Apply any needed deletes.  We issue just one WAL record per page,
+		 * so as to minimize WAL traffic.
+		 */
+		if (ntodelete)
 		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			START_CRIT_SECTION();
 
-			maxoff = PageGetMaxOffsetNumber(page);
+			MarkBufferDirty(buffer);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
-		}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		UnlockReleaseBuffer(buffer);
+			END_CRIT_SECTION();
+
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
+			maxoff = PageGetMaxOffsetNumber(page);
+		}
 
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.19.2

0002-Delete-pages-during-GiST-VACUUM-v19.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v19.patch; x-unix-mode=0644Download
From 92a459570cbb36965ead0036a1930c300637fc8b Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 3 Jan 2019 22:39:13 +0500
Subject: [PATCH 2/2] Delete pages during GiST VACUUM v19

---
 src/backend/access/gist/README         |  14 +++
 src/backend/access/gist/gist.c         |  18 +++
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 153 ++++++++++++++++++++++++-
 src/backend/access/gist/gistxlog.c     |  60 ++++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/backend/nodes/bitmapset.c          |  19 +++
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  10 +-
 src/include/nodes/bitmapset.h          |   1 +
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 13 files changed, 287 insertions(+), 13 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..96aa00f783 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +843,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8d3dfad27b..61012cded7 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -807,7 +808,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index c4ed1b5402..c458bfb565 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,8 +16,10 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
@@ -30,6 +32,10 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	Bitmapset  *internalPagesMap;
+	Bitmapset  *emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -91,8 +97,6 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
- *
- * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -122,6 +126,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
+	vstate.internalPagesMap = NULL;
+	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * Need lock unless it's local to this backend.
@@ -166,6 +173,12 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
+		if (!vstate.internalPagesMap)
+			vstate.internalPagesMap = bms_make_empty(num_pages);
+		if (!vstate.emptyLeafPagesMap)
+			vstate.emptyLeafPagesMap = bms_make_empty(num_pages);
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -189,6 +202,127 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	/* rescan all inner pages to find those that has empty child pages */
+	if (vstate.emptyPages > 0)
+	{
+		int			x;
+
+		x = -1;
+		while (vstate.emptyPages > 0 &&
+			   (x = bms_next_member(vstate.internalPagesMap, x)) >= 0)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber off,
+				maxoff;
+			IndexTuple  idxtuple;
+			ItemId	    iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		buftodelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			/* FIXME: 'x' is signed, so this will not work with indexes larger than 2^31 blocks */
+			blkno = (BlockNumber) x;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, off);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if (!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+					continue;
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+												RBM_NORMAL, info->strategy);
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = off;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId txid = ReadNewTransactionId();
+				int			i;
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage = (Page) BufferGetPage(buftodelete[i]);
+					XLogRecPtr	recptr;
+
+					/* Remember xid of last transaction that could see this page */
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+						recptr 	= gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+													 txid, buffer, todelete[i] - i);
+					else
+						recptr = gistGetFakeLSN(rel);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+
+	bms_free(vstate.emptyLeafPagesMap);
+	bms_free(vstate.internalPagesMap);
 }
 
 /*
@@ -242,6 +376,7 @@ restart:
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -309,8 +444,18 @@ restart:
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
-
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+		else
+			stats->num_index_tuples += nremain;
+	}
+	else
+	{
+		vstate->internalPagesMap = bms_add_member(vstate->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390a..3213ea98ea 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,6 +64,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -116,6 +149,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -535,6 +569,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +690,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15a..0861f82992 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 62cd00903c..1d854ff61b 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -258,6 +258,25 @@ bms_make_singleton(int x)
 	return result;
 }
 
+/*
+ * bms_make_singleton - preallocate an empty bitmapset
+ */
+Bitmapset *
+bms_make_empty(int size)
+{
+	Bitmapset  *result;
+	int			wordnum;
+
+	if (size < 0)
+		elog(ERROR, "negative bitmapset member not allowed");
+	wordnum = WORDNUM(size - 1);
+	result = (Bitmapset *) palloc(BITMAPSET_SIZE(wordnum + 1));
+	/* we set nwords to zero */
+	/* array is prepared to save repalloc's work */
+	result->nwords = 0;
+	return result;
+}
+
 /*
  * bms_free - free a bitmapset
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f24156..ce8bfd83ea 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3698942f9d..117cc83ba5 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -412,12 +412,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1a..127cff5cb7 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,13 +17,15 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +78,12 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index 892410635b..79e6d5b719 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -79,6 +79,7 @@ extern Bitmapset *bms_copy(const Bitmapset *a);
 extern bool bms_equal(const Bitmapset *a, const Bitmapset *b);
 extern int	bms_compare(const Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_make_singleton(int x);
+extern Bitmapset *bms_make_empty(int size);
 extern void bms_free(Bitmapset *a);
 
 extern Bitmapset *bms_union(const Bitmapset *a, const Bitmapset *b);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.19.2

0001-Physical-GiST-scan-in-VACUUM-v19.patchapplication/octet-stream; name=0001-Physical-GiST-scan-in-VACUUM-v19.patch; x-unix-mode=0644Download
From 4134edab7763d11dfa0ce71fd8bb7f565691223d Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Thu, 3 Jan 2019 22:38:24 +0500
Subject: [PATCH 1/2] Physical GiST scan in VACUUM v19

---
 src/backend/access/gist/gist.c       |   8 +-
 src/backend/access/gist/gistvacuum.c | 430 +++++++++++++++------------
 2 files changed, 247 insertions(+), 191 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index b75b3a8dac..3f52b8f4dc 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -38,7 +38,7 @@ static bool gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 				 bool unlockbuf, bool unlockleftchild);
 static void gistfinishsplit(GISTInsertState *state, GISTInsertStack *stack,
 				GISTSTATE *giststate, List *splitinfo, bool releasebuf);
-static void gistvacuumpage(Relation rel, Page page, Buffer buffer,
+static void gistprunepage(Relation rel, Page page, Buffer buffer,
 			   Relation heapRel);
 
 
@@ -261,7 +261,7 @@ gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 	 */
 	if (is_split && GistPageIsLeaf(page) && GistPageHasGarbage(page))
 	{
-		gistvacuumpage(rel, page, buffer, heapRel);
+		gistprunepage(rel, page, buffer, heapRel);
 		is_split = gistnospace(page, itup, ntup, oldoffnum, freespace);
 	}
 
@@ -1544,11 +1544,11 @@ freeGISTstate(GISTSTATE *giststate)
 }
 
 /*
- * gistvacuumpage() -- try to remove LP_DEAD items from the given page.
+ * gistprunepage() -- try to remove LP_DEAD items from the given page.
  * Function assumes that buffer is exclusively locked.
  */
 static void
-gistvacuumpage(Relation rel, Page page, Buffer buffer, Relation heapRel)
+gistprunepage(Relation rel, Page page, Buffer buffer, Relation heapRel)
 {
 	OffsetNumber deletable[MaxIndexTuplesPerPage];
 	int			ndeletable = 0;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ccb147406c..c4ed1b5402 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -21,6 +21,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	GistNSN		startNSN;
+	BlockNumber totFreePages;	/* true total # of free pages */
+} GistVacState;
+
+static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state);
+static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
+			   BlockNumber orig_blkno);
+
+IndexBulkDeleteResult *
+gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+			   IndexBulkDeleteCallback callback, void *callback_state)
+{
+	/* allocate stats if first time through, else re-use existing struct */
+	if (stats == NULL)
+		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+
+	gistvacuumscan(info, stats, callback, callback_state);
+
+	return stats;
+}
 
 /*
  * VACUUM cleanup: update FSM
@@ -28,104 +56,36 @@
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
-	Relation	rel = info->index;
-	BlockNumber npages,
-				blkno;
-	BlockNumber totFreePages;
-	double		tuplesCount;
-	bool		needLock;
-
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
 
-	/* Set up all-zero stats if gistbulkdelete wasn't called */
+	/*
+	 * If gistbulkdelete was called, we need not do anything, just return the
+	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
+	 * still need to do a pass over the index, to obtain index statistics.
+	 */
 	if (stats == NULL)
+	{
 		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+		gistvacuumscan(info, stats, NULL, NULL);
+	}
 
 	/*
-	 * Need lock unless it's local to this backend.
+	 * It's quite possible for us to be fooled by concurrent page splits into
+	 * double-counting some index tuples, so disbelieve any total that exceeds
+	 * the underlying heap's count ... if we know that accurately.  Otherwise
+	 * this might just make matters worse.
 	 */
-	needLock = !RELATION_IS_LOCAL(rel);
-
-	/* try to find deleted pages */
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	npages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-
-	totFreePages = 0;
-	tuplesCount = 0;
-	for (blkno = GIST_ROOT_BLKNO + 1; blkno < npages; blkno++)
+	if (!info->estimated_count)
 	{
-		Buffer		buffer;
-		Page		page;
-
-		vacuum_delay_point();
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-									info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		page = (Page) BufferGetPage(buffer);
-
-		if (PageIsNew(page) || GistPageIsDeleted(page))
-		{
-			totFreePages++;
-			RecordFreeIndexPage(rel, blkno);
-		}
-		else if (GistPageIsLeaf(page))
-		{
-			/* count tuples in index (considering only leaf tuples) */
-			tuplesCount += PageGetMaxOffsetNumber(page);
-		}
-		UnlockReleaseBuffer(buffer);
+		if (stats->num_index_tuples > info->num_heap_tuples)
+			stats->num_index_tuples = info->num_heap_tuples;
 	}
 
-	/* Finally, vacuum the FSM */
-	IndexFreeSpaceMapVacuum(info->index);
-
-	/* return statistics */
-	stats->pages_free = totFreePages;
-	if (needLock)
-		LockRelationForExtension(rel, ExclusiveLock);
-	stats->num_pages = RelationGetNumberOfBlocks(rel);
-	if (needLock)
-		UnlockRelationForExtension(rel, ExclusiveLock);
-	stats->num_index_tuples = tuplesCount;
-	stats->estimated_count = false;
-
 	return stats;
 }
 
-typedef struct GistBDItem
-{
-	GistNSN		parentlsn;
-	BlockNumber blkno;
-	struct GistBDItem *next;
-} GistBDItem;
-
-static void
-pushStackIfSplited(Page page, GistBDItem *stack)
-{
-	GISTPageOpaque opaque = GistPageGetOpaque(page);
-
-	if (stack->blkno != GIST_ROOT_BLKNO && !XLogRecPtrIsInvalid(stack->parentlsn) &&
-		(GistFollowRight(page) || stack->parentlsn < GistPageGetNSN(page)) &&
-		opaque->rightlink != InvalidBlockNumber /* sanity check */ )
-	{
-		/* split page detected, install right link to the stack */
-
-		GistBDItem *ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-
-		ptr->blkno = opaque->rightlink;
-		ptr->parentlsn = stack->parentlsn;
-		ptr->next = stack->next;
-		stack->next = ptr;
-	}
-}
-
-
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
@@ -134,141 +94,237 @@ pushStackIfSplited(Page page, GistBDItem *stack)
  *
  * Result: a palloc'd struct containing statistical info for VACUUM displays.
  */
-IndexBulkDeleteResult *
-gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void
+gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
-	GistBDItem *stack,
-			   *ptr;
+	GistVacState vstate;
+	BlockNumber num_pages;
+	bool		needLock;
+	BlockNumber blkno;
 
-	/* first time through? */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-	/* we'll re-count the tuples each time */
+	/*
+	 * Reset counts that will be incremented during the scan; needed in case
+	 * of multiple scans during a single VACUUM command.
+	 */
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
+	stats->pages_deleted = 0;
+
+	/* Set up info to pass down to gistvacuumpage */
+	vstate.info = info;
+	vstate.stats = stats;
+	vstate.callback = callback;
+	vstate.callback_state = callback_state;
+	if (RelationNeedsWAL(rel))
+		vstate.startNSN = GetInsertRecPtr();
+	else
+		vstate.startNSN = gistGetFakeLSN(rel);
+	vstate.totFreePages = 0;
 
-	stack = (GistBDItem *) palloc0(sizeof(GistBDItem));
-	stack->blkno = GIST_ROOT_BLKNO;
-
-	while (stack)
-	{
-		Buffer		buffer;
-		Page		page;
-		OffsetNumber i,
-					maxoff;
-		IndexTuple	idxtuple;
-		ItemId		iid;
-
-		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, stack->blkno,
-									RBM_NORMAL, info->strategy);
-		LockBuffer(buffer, GIST_SHARE);
-		gistcheckpage(rel, buffer);
-		page = (Page) BufferGetPage(buffer);
-
-		if (GistPageIsLeaf(page))
-		{
-			OffsetNumber todelete[MaxOffsetNumber];
-			int			ntodelete = 0;
+	/*
+	 * Need lock unless it's local to this backend.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			LockBuffer(buffer, GIST_UNLOCK);
-			LockBuffer(buffer, GIST_EXCLUSIVE);
+	/*
+	 * The outer loop iterates over all index pages, in physical order (we
+	 * hope the kernel will cooperate in providing read-ahead for speed).  It
+	 * is critical that we visit all leaf pages, including ones added after we
+	 * start the scan, else we might fail to delete some deletable tuples.
+	 * Hence, we must repeatedly check the relation length.  We must acquire
+	 * the relation-extension lock while doing so to avoid a race condition:
+	 * if someone else is extending the relation, there is a window where
+	 * bufmgr/smgr have created a new all-zero page but it hasn't yet been
+	 * write-locked by gistNewBuffer().  If we manage to scan such a page
+	 * here, we'll improperly assume it can be recycled.  Taking the lock
+	 * synchronizes things enough to prevent a problem: either num_pages won't
+	 * include the new page, or gistNewBuffer already has write lock on the
+	 * buffer and it will be fully initialized before we can examine it.  (See
+	 * also vacuumlazy.c, which has the same issue.)	Also, we need not worry
+	 * if a page is added immediately after we look; the page splitting code
+	 * already has write-lock on the left page before it adds a right page, so
+	 * we must already have processed any tuples due to be moved into such a
+	 * page.
+	 *
+	 * We can skip locking for new or temp relations, however, since no one
+	 * else could be accessing them.
+	 */
+	needLock = !RELATION_IS_LOCAL(rel);
 
-			page = (Page) BufferGetPage(buffer);
-			if (stack->blkno == GIST_ROOT_BLKNO && !GistPageIsLeaf(page))
-			{
-				/* only the root can become non-leaf during relock */
-				UnlockReleaseBuffer(buffer);
-				/* one more check */
-				continue;
-			}
+	blkno = GIST_ROOT_BLKNO;
+	for (;;)
+	{
+		/* Get the current relation length */
+		if (needLock)
+			LockRelationForExtension(rel, ExclusiveLock);
+		num_pages = RelationGetNumberOfBlocks(rel);
+		if (needLock)
+			UnlockRelationForExtension(rel, ExclusiveLock);
+
+		/* Quit if we've scanned the whole relation */
+		if (blkno >= num_pages)
+			break;
+		/* Iterate over pages, then loop back to recheck length */
+		for (; blkno < num_pages; blkno++)
+			gistvacuumpage(&vstate, blkno, blkno);
+	}
 
-			/*
-			 * check for split proceeded after look at parent, we should check
-			 * it after relock
-			 */
-			pushStackIfSplited(page, stack);
+	/*
+	 * If we found any recyclable pages (and recorded them in the FSM), then
+	 * forcibly update the upper-level FSM pages to ensure that searchers can
+	 * find them.  It's possible that the pages were also found during
+	 * previous scans and so this is a waste of time, but it's cheap enough
+	 * relative to scanning the index that it shouldn't matter much, and
+	 * making sure that free pages are available sooner not later seems
+	 * worthwhile.
+	 *
+	 * Note that if no recyclable pages exist, we don't bother vacuuming the
+	 * FSM at all.
+	 */
+	if (vstate.totFreePages > 0)
+		IndexFreeSpaceMapVacuum(rel);
 
-			/*
-			 * Remove deletable tuples from page
-			 */
+	/* update statistics */
+	stats->num_pages = num_pages;
+	stats->pages_free = vstate.totFreePages;
+}
 
-			maxoff = PageGetMaxOffsetNumber(page);
+/*
+ * gistvacuumpage --- VACUUM one page
+ *
+ * This processes a single page for gistbulkdelete().  In some cases we
+ * must go back and re-examine previously-scanned pages; this routine
+ * recurses when necessary to handle that case.
+ *
+ * blkno is the page to process.  orig_blkno is the highest block number
+ * reached by the outer gistvacuumscan loop (the same as blkno, unless we
+ * are recursing to re-examine a previous page).
+ */
+static void
+gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
+{
+	IndexVacuumInfo *info = vstate->info;
+	IndexBulkDeleteResult *stats = vstate->stats;
+	IndexBulkDeleteCallback callback = vstate->callback;
+	void	   *callback_state = vstate->callback_state;
+	Relation	rel = info->index;
+	Buffer		buffer;
+	Page		page;
+	BlockNumber recurse_to;
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
-			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
+restart:
+	recurse_to = InvalidBlockNumber;
 
-				if (callback(&(idxtuple->t_tid), callback_state))
-					todelete[ntodelete++] = i;
-				else
-					stats->num_index_tuples += 1;
-			}
+	/* call vacuum_delay_point while not holding any buffer lock */
+	vacuum_delay_point();
 
-			stats->tuples_removed += ntodelete;
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+								info->strategy);
 
-			if (ntodelete)
-			{
-				START_CRIT_SECTION();
+	/*
+	 * We are not going to stay here for a long time, agressively grab an
+	 * exclusive lock.
+	 */
+	LockBuffer(buffer, GIST_EXCLUSIVE);
+	page = (Page) BufferGetPage(buffer);
 
-				MarkBufferDirty(buffer);
+	if (PageIsNew(page) || GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		vstate->totFreePages++;
+		RecordFreeIndexPage(rel, blkno);
+		return;
+	}
 
-				PageIndexMultiDelete(page, todelete, ntodelete);
-				GistMarkTuplesDeleted(page);
+	if (GistPageIsLeaf(page))
+	{
+		OffsetNumber todelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		GISTPageOpaque opaque = GistPageGetOpaque(page);
+		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
+
+		/*
+		 * Check whether we need to recurse back to earlier pages.  What we
+		 * are concerned about is a page split that happened since we started
+		 * the vacuum scan.  If the split moved some tuples to a lower page
+		 * then we might have missed 'em.  If so, set up for tail recursion.
+		 */
+		if ((GistFollowRight(page) ||
+			 vstate->startNSN < GistPageGetNSN(page)) &&
+			(opaque->rightlink != InvalidBlockNumber) &&
+			(opaque->rightlink < orig_blkno))
+		{
+			recurse_to = opaque->rightlink;
+		}
 
-				if (RelationNeedsWAL(rel))
-				{
-					XLogRecPtr	recptr;
+		/*
+		 * Scan over all items to see which ones need deleted according to the
+		 * callback function.
+		 */
+		if (callback)
+		{
+			OffsetNumber off;
 
-					recptr = gistXLogUpdate(buffer,
-											todelete, ntodelete,
-											NULL, 0, InvalidBuffer);
-					PageSetLSN(page, recptr);
-				}
-				else
-					PageSetLSN(page, gistGetFakeLSN(rel));
+			for (off = FirstOffsetNumber; off <= maxoff; off = OffsetNumberNext(off))
+			{
+				ItemId		iid = PageGetItemId(page, off);
+				IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
 
-				END_CRIT_SECTION();
+				if (callback(&(idxtuple->t_tid), callback_state))
+					todelete[ntodelete++] = off;
 			}
-
 		}
-		else
+
+		/*
+		 * Apply any needed deletes.  We issue just one WAL record per page,
+		 * so as to minimize WAL traffic.
+		 */
+		if (ntodelete)
 		{
-			/* check for split proceeded after look at parent */
-			pushStackIfSplited(page, stack);
+			START_CRIT_SECTION();
 
-			maxoff = PageGetMaxOffsetNumber(page);
+			MarkBufferDirty(buffer);
 
-			for (i = FirstOffsetNumber; i <= maxoff; i = OffsetNumberNext(i))
+			PageIndexMultiDelete(page, todelete, ntodelete);
+			GistMarkTuplesDeleted(page);
+
+			if (RelationNeedsWAL(rel))
 			{
-				iid = PageGetItemId(page, i);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-
-				ptr = (GistBDItem *) palloc(sizeof(GistBDItem));
-				ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				ptr->parentlsn = BufferGetLSNAtomic(buffer);
-				ptr->next = stack->next;
-				stack->next = ptr;
-
-				if (GistTupleIsInvalid(idxtuple))
-					ereport(LOG,
-							(errmsg("index \"%s\" contains an inner tuple marked as invalid",
-									RelationGetRelationName(rel)),
-							 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
-							 errhint("Please REINDEX it.")));
+				XLogRecPtr	recptr;
+
+				recptr = gistXLogUpdate(buffer,
+										todelete, ntodelete,
+										NULL, 0, InvalidBuffer);
+				PageSetLSN(page, recptr);
 			}
-		}
+			else
+				PageSetLSN(page, gistGetFakeLSN(rel));
 
-		UnlockReleaseBuffer(buffer);
+			END_CRIT_SECTION();
+
+			stats->tuples_removed += ntodelete;
+			/* must recompute maxoff */
+			maxoff = PageGetMaxOffsetNumber(page);
+		}
 
-		ptr = stack->next;
-		pfree(stack);
-		stack = ptr;
+		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
 
-		vacuum_delay_point();
 	}
 
-	return stats;
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * This is really tail recursion, but if the compiler is too stupid to
+	 * optimize it as such, we'd eat an uncomfortably large amount of stack
+	 * space per recursion level (due to the deletable[] array). A failure is
+	 * improbable since the number of levels isn't likely to be large ... but
+	 * just in case, let's hand-optimize into a loop.
+	 */
+	if (recurse_to != InvalidBlockNumber)
+	{
+		blkno = recurse_to;
+		goto restart;
+	}
 }
-- 
2.19.2

#43Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Andrey Borodin (#42)
1 attachment(s)
Re: GiST VACUUM

3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а):

* Bitmapset stores 32 bit signed integers, but BlockNumber is unsigned. So this will fail with an index larger than 2^31 blocks. That's perhaps academical, I don't think anyone will try to create a 16 TB GiST index any time soon. But it feels wrong to introduce an arbitrary limitation like that.

Looks like bitmapset is unsuitable for collecting block numbers due to the interface. Let's just create custom container for this?
Or there is one other option: for each block number throw away sign bit and consider page potentially internal and potentially empty leaf if (blkno & 0x7FFFFFF) is in bitmapset.
And then we will have to create at least one 17Tb GiST to check it actually works.

Heikki, how do you think, is implementing our own radix tree for this is viable solution?
I've written working implementation with 4-level statically typed tree. If we follow this route, probably, there must be tests for this data structure.

Best regards, Andrey Borodin.

Attachments:

radix_tree.diffapplication/octet-stream; name=radix_tree.diff; x-unix-mode=0644Download
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index c458bfb565..06d94d2545 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -23,6 +23,190 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+/* Lowest level of radix tree is represented by bitmap */
+typedef struct
+{
+	char data[256/8];
+} BlockSetLevel4Data;
+
+typedef BlockSetLevel4Data *BlockSetLevel4;
+
+/* statically typed inner level chunks points to ground level */
+typedef struct
+{
+	/* null references denote empty subtree */
+	BlockSetLevel4 next[256];
+} BlockSetLevel3Data;
+
+typedef BlockSetLevel3Data *BlockSetLevel3;
+
+/* inner level points to another inner level */
+typedef struct
+{
+	BlockSetLevel3 next[256];
+} BlockSetLevel2Data;
+
+typedef BlockSetLevel2Data *BlockSetLevel2;
+
+/* Radix tree root */
+typedef struct
+{
+	BlockSetLevel2 next[256];
+} BlockSetData;
+
+typedef BlockSetData *BlockSet;
+
+/* multiplex block number into indexes of radix tree */
+#define BLOCKSET_SPLIT_BLKNO				\
+	int i1, i2, i3, i4, byte_no, byte_mask;	\
+	i4 = blkno % 256;						\
+	blkno /= 256;							\
+	i3 = blkno % 256;						\
+	blkno /= 256;							\
+	i2 = blkno % 256;						\
+	blkno /= 256;							\
+	i1 = blkno;								\
+	byte_no = i4 / 8;						\
+	byte_mask = 1 << (i4 % 8)				\
+
+/* indicate presence of block number in set */
+static
+BlockSet blockset_set(BlockSet bs, BlockNumber blkno)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+	{
+		bs = palloc0(sizeof(BlockSetData));
+	}
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+	{
+		bs2 = palloc0(sizeof(BlockSetLevel2Data));
+		bs->next[i1] = bs2;
+	}
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+	{
+		bs3 = palloc0(sizeof(BlockSetLevel3Data));
+		bs2->next[i2] = bs3;
+	}
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+	{
+		bs4 = palloc0(sizeof(BlockSetLevel4Data));
+		bs3->next[i3] = bs4;
+	}
+	bs4->data[byte_no] = byte_mask | bs4->data[byte_no];
+	return bs;
+}
+
+/* Test presence of block in set */
+static
+bool blockset_get(BlockNumber blkno, BlockSet bs)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return false;
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+		return false;
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+		return false;
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+		return false;
+	return (bs4->data[byte_no] & byte_mask);
+}
+
+/* 
+ * Find nearest block number in set no less than blkno
+ * Return InvalidBlockNumber if nothing to return
+ * If given InvalidBlockNumber - returns minimal element in set
+ */
+static
+BlockNumber blockset_next(BlockSet bs, BlockNumber blkno)
+{
+	if (blkno == InvalidBlockNumber)
+		blkno = 0; /* equvalent to ++, left for clear code */
+	else
+		blkno++;
+
+	BLOCKSET_SPLIT_BLKNO;
+
+	if (bs == NULL)
+		return InvalidBlockNumber;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (!bs4)
+					continue;
+				for (; byte_no < 256 / 8; byte_no++)
+				{
+					if (!bs4->data[byte_no])
+						continue;
+					while (byte_mask <= 0x70)
+					{
+						if ((byte_mask & bs4->data[byte_no]) == byte_mask)
+						{
+							i4 = byte_no * 8;
+							while (byte_mask >>= 1) i4++;
+							return i4 + 256 * (i3 + 256 * (i2 + 256 * i1));
+						}
+						byte_mask <<= 1;
+					}
+					byte_mask = 1;
+				}
+				byte_no = 0;
+			}
+			i3 = 0;
+		}
+		i2 = 0;
+	}
+	return InvalidBlockNumber;
+}
+
+/* free anything palloced */
+static
+void blockset_free(BlockSet bs)
+{
+	BlockNumber blkno = 0;
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (bs4)
+					pfree(bs4);
+			}
+			pfree(bs3);
+		}
+		pfree(bs2);
+	}
+	pfree(bs);
+}
+
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -34,8 +218,8 @@ typedef struct
 	BlockNumber totFreePages;	/* true total # of free pages */
 	BlockNumber emptyPages;
 
-	Bitmapset  *internalPagesMap;
-	Bitmapset  *emptyLeafPagesMap;
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -53,6 +237,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
+
 	return stats;
 }
 
@@ -174,11 +359,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		if (blkno >= num_pages)
 			break;
 
-		if (!vstate.internalPagesMap)
-			vstate.internalPagesMap = bms_make_empty(num_pages);
-		if (!vstate.emptyLeafPagesMap)
-			vstate.emptyLeafPagesMap = bms_make_empty(num_pages);
-
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -206,11 +386,11 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* rescan all inner pages to find those that has empty child pages */
 	if (vstate.emptyPages > 0)
 	{
-		int			x;
+		BlockNumber			x;
 
-		x = -1;
+		x = InvalidBlockNumber;
 		while (vstate.emptyPages > 0 &&
-			   (x = bms_next_member(vstate.internalPagesMap, x)) >= 0)
+			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
 		{
 			Buffer		buffer;
 			Page		page;
@@ -248,7 +428,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 				idxtuple = (IndexTuple) PageGetItem(page, iid);
 				/* if this page was not empty in previous scan - we do not consider it */
 				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				if (!bms_is_member(leafBlockNo, vstate.emptyLeafPagesMap))
+				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
 					continue;
 
 				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
@@ -321,8 +501,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
-	bms_free(vstate.emptyLeafPagesMap);
-	bms_free(vstate.internalPagesMap);
+	blockset_free(vstate.emptyLeafPagesMap);
+	blockset_free(vstate.internalPagesMap);
 }
 
 /*
@@ -447,7 +627,7 @@ restart:
 		nremain = maxoff - FirstOffsetNumber + 1;
 		if (nremain == 0)
 		{
-			vstate->emptyLeafPagesMap = bms_add_member(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
 			vstate->emptyPages++;
 		}
 		else
@@ -455,7 +635,7 @@ restart:
 	}
 	else
 	{
-		vstate->internalPagesMap = bms_add_member(vstate->internalPagesMap, blkno);
+		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
#44Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#43)
Re: GiST VACUUM

On Fri, Jan 04, 2019 at 06:26:18PM +0500, Andrey Borodin wrote:

Heikki, how do you think, is implementing our own radix tree for
this is viable solution?
I've written working implementation with 4-level statically typed
tree. If we follow this route, probably, there must be tests for
this data structure.

(Note that the latest patch does not apply.)
Heikki, are you planning to answer to the questions and/or review the
patch? I have moved it to next CF for now.
--
Michael

#45Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#43)
Re: GiST VACUUM

On 04/01/2019 21:26, Andrey Borodin wrote:

3 янв. 2019 г., в 23:47, Andrey Borodin <x4mmm@yandex-team.ru>
написал(а):

* Bitmapset stores 32 bit signed integers, but BlockNumber is
unsigned. So this will fail with an index larger than 2^31
blocks. That's perhaps academical, I don't think anyone will try
to create a 16 TB GiST index any time soon. But it feels wrong to
introduce an arbitrary limitation like that.

Looks like bitmapset is unsuitable for collecting block numbers due
to the interface. Let's just create custom container for this? Or
there is one other option: for each block number throw away sign
bit and consider page potentially internal and potentially empty
leaf if (blkno & 0x7FFFFFF) is in bitmapset. And then we will have
to create at least one 17Tb GiST to check it actually works.

Heikki, how do you think, is implementing our own radix tree for this
is viable solution? I've written working implementation with 4-level
statically typed tree. If we follow this route, probably, there must
be tests for this data structure.

Yeah, seems reasonable.

- Heikki

#46Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#42)
2 attachment(s)
Re: GiST VACUUM

On 04/01/2019 02:47, Andrey Borodin wrote:

2 янв. 2019 г., в 20:35, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

In patch #1, to do the vacuum scan in physical order:
...
I think this is ready to be committed, except that I didn't do any testing. We discussed the need for testing earlier. Did you write some test scripts for this, or how have you been testing?

Please see test I used to check left jumps for v18:
0001-Physical-GiST-scan-in-VACUUM-v18-with-test-modificat.patch
0002-Test-left-jumps-v18.patch

To trigger FollowRight GiST sometimes forget to clear follow-right marker simulating crash of an insert. This fills logs with "fixing incomplete split" messages. Search for "REMOVE THIS" to disable these ill-behavior triggers.
To trigger NSN jump GiST allocate empty page after every real allocation.

In log output I see
2019-01-03 22:27:30.028 +05 [54596] WARNING: RESCAN TRIGGERED BY NSN
WARNING: RESCAN TRIGGERED BY NSN
2019-01-03 22:27:30.104 +05 [54596] WARNING: RESCAN TRIGGERED BY FollowRight
This means that code paths were really executed (for v18).

Thanks! As I noted at
/messages/by-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e@iki.fi,
the test patch left the index corrupt. I fixed it so that it leaves
behind incompletely split pages, without the corruption, see attached.
It's similar to yours, but let me recap what it does:

* Hacks gistbuild(), create 100 empty pages immediately after the root
pages. They are leaked, so they won't be reused until the a VACUUM sees
them and puts them to the FSM

* Hacks gistinserttuples(), to leave the split incompleted with 50%
probability

* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to
yours, but it also tries to verify that the index returns correct
results. It prints a result set like this:

sum
---------
-364450
364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should
see "RESCAN" lines in the log, to show that jumping left happened.
Because the behavior is random and racy, you may need to run the script
many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED
BY FollowRight" cases. Especially the "FollowRight" case happens less
frequently than the NSN case, you might need to run the script > 10
times to see it.

I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as
your latest patch v19, except for some tiny comment kibitzing. I'll mark
this as Ready for Committer in the commitfest app, and will try to
commit it in the next couple of days.

- Heikki

Attachments:

gist-vacuum-test.shapplication/x-shellscript; name=gist-vacuum-test.shDownload
Test-left-jumps-2.patchtext/x-patch; name=Test-left-jumps-2.patchDownload
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3f52b8f4dc..cad4a2a46e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -1187,6 +1187,8 @@ gistinserttuple(GISTInsertState *state, GISTInsertStack *stack,
 							InvalidBuffer, InvalidBuffer, false, false);
 }
 
+static bool HACK = false;
+
 /* ----------------
  * An extended workhorse version of gistinserttuple(). This version allows
  * inserting multiple tuples, or replacing a single tuple with multiple tuples.
@@ -1230,6 +1232,14 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	CheckForSerializableConflictIn(state->r, NULL, stack->buffer);
 
 	/* Insert the tuple(s) to the page, splitting the page if necessary */
+	if (BufferIsValid(leftchild) && HACK)
+	{
+		/* skip actually inserting */
+		splitinfo = NULL;
+		is_split = false;
+	}
+	else
+	{
 	is_split = gistplacetopage(state->r, state->freespace, giststate,
 							   stack->buffer,
 							   tuples, ntup,
@@ -1238,6 +1248,7 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 							   &splitinfo,
 							   true,
 							   state->heapRel);
+	}
 
 	/*
 	 * Before recursing up in case the page was split, release locks on the
@@ -1256,7 +1267,12 @@ gistinserttuples(GISTInsertState *state, GISTInsertStack *stack,
 	 * didn't have to split, release it ourselves.
 	 */
 	if (splitinfo)
+	{
+		if (random() % 2 == 0)
+			HACK = true;
 		gistfinishsplit(state, stack, giststate, splitinfo, unlockbuf);
+		HACK = false;
+	}
 	else if (unlockbuf)
 		LockBuffer(stack->buffer, GIST_UNLOCK);
 
diff --git a/src/backend/access/gist/gistbuild.c b/src/backend/access/gist/gistbuild.c
index bd142a3560..fdfc54b009 100644
--- a/src/backend/access/gist/gistbuild.c
+++ b/src/backend/access/gist/gistbuild.c
@@ -201,6 +201,9 @@ gistbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	buildstate.indtuples = 0;
 	buildstate.indtuplesSize = 0;
 
+	for (int i = 0; i < 100; i++)
+		ReleaseBuffer(ReadBuffer(index, P_NEW));
+
 	/*
 	 * Do the heap scan.
 	 */
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 1cbcb038f7..c306621e1e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -254,6 +254,11 @@ restart:
 			(opaque->rightlink != InvalidBlockNumber) &&
 			(opaque->rightlink < orig_blkno))
 		{
+			if (GistFollowRight(page))
+				elog(LOG, "RESCAN %u->%u TRIGGERED BY FollowRight", blkno, opaque->rightlink);
+			if (vstate->startNSN < GistPageGetNSN(page))
+				elog(LOG,"RESCAN %u->%u TRIGGERED BY NSN", blkno, opaque->rightlink);
+
 			recurse_to = opaque->rightlink;
 		}
 
#47Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#46)
Re: GiST VACUUM

Hi!

Thanks for fixing gist amcheck! The idea of using these two patches together seems so obvious now, but never actually visited my mind before.

4 марта 2019 г., в 18:04, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
Thanks! As I noted at /messages/by-id/2ff57b1f-01b4-eacf-36a2-485a12017f6e@iki.fi, the test patch left the index corrupt. I fixed it so that it leaves behind incompletely split pages, without the corruption, see attached. It's similar to yours, but let me recap what it does:

* Hacks gistbuild(), create 100 empty pages immediately after the root pages. They are leaked, so they won't be reused until the a VACUUM sees them and puts them to the FSM

* Hacks gistinserttuples(), to leave the split incompleted with 50% probability

* In vacuum, print a line to the log whenever it needs to "jump left"

I used this patch, with the attached test script that's similar to yours, but it also tries to verify that the index returns correct results. It prints a result set like this:

sum
---------
-364450
364450
(2 rows)

If the two numbers add up to 0, the index seems valid. And you should see "RESCAN" lines in the log, to show that jumping left happened. Because the behavior is random and racy, you may need to run the script many times to see both "RESCAN TRIGGERED BY NSN" and "RESCAN TRIGGERED BY FollowRight" cases. Especially the "FollowRight" case happens less frequently than the NSN case, you might need to run the script > 10 times to see it.

Great! I've repeated your tests on my machine, I observe similar frequencies of causes of rescan left jumps.

I also tried your amcheck tool with this. It did not report any errors.

Attached is also latest version of the patch itself. It is the same as your latest patch v19, except for some tiny comment kibitzing. I'll mark this as Ready for Committer in the commitfest app, and will try to commit it in the next couple of days.

That's cool! I'll work on 2nd step of these patchset to make blockset data structure prettier and less hacky.

Best regards, Andrey Borodin.

#48Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#47)
Re: GiST VACUUM

On 05/03/2019 02:26, Andrey Borodin wrote:

I also tried your amcheck tool with this. It did not report any
errors.

Attached is also latest version of the patch itself. It is the
same as your latest patch v19, except for some tiny comment
kibitzing. I'll mark this as Ready for Committer in the commitfest
app, and will try to commit it in the next couple of days.

That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

I did some last minute copy-editing on the comments, and fixed one
little thing that we missed earlier: the check for "invalid tuples" that
might be left over in pg_upgraded pre-9.1 indexes, was lost at some
point. I added that check back. (It would be nice if GiST indexes had a
metadata page with a version number, so we could avoid that work in the
99% of cases that that check is not needed, but that's a different story.)

I'll change the status of this patch to "Waiting on Author", to reflect
the state of the 2nd patch, since you're working on the radix tree
blockset stuff.

- Heikki

#49Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#48)
2 attachment(s)
Re: GiST VACUUM

5 марта 2019 г., в 18:21, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 05/03/2019 02:26, Andrey Borodin wrote:

That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

That's cool! Thanks!

I'll change the status of this patch to "Waiting on Author", to reflect the state of the 2nd patch, since you're working on the radix tree blockset stuff.

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes in GiST README recommend locking child before parent.
Thus now we unlock internal page before locking children for deletion, and lock it again, check that everything is still on it's place and delete only then.
One thing still bothers me. Let's assume that we have internal page with 2 deletable leaves. We lock these leaves in order of items on internal page. Is it possible that 2nd page have follow-right link on 1st and someone will lock 2nd page, try to lock 1st and deadlock with VACUUM?
2. Added radix-tree-based block set to lib, with tests.

Best regards, Andrey Borodin.

Attachments:

0002-Delete-pages-during-GiST-VACUUM.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM.patch; x-unix-mode=0644Download
From b8764ef0dd02bad609969520485f408a726052ae Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:24:37 +0500
Subject: [PATCH 2/2] Delete pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make notes
of empty pages and internal pages. At second pass we scan through
internal pages looking for references to empty leaf pages.
---
 src/backend/access/gist/README         |  14 ++
 src/backend/access/gist/gist.c         |  18 ++
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 218 ++++++++++++++++++++++++-
 src/backend/access/gist/gistxlog.c     |  60 +++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  10 +-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 11 files changed, 335 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 5ea774661a..dd36cfdf63 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +843,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 8d3dfad27b..61012cded7 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -807,7 +808,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e..fc606ac823 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,11 +16,15 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/blockset.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -30,6 +34,10 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -50,6 +58,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
+
 	return stats;
 }
 
@@ -103,6 +112,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * while the index is being expanded, leaving an all-zeros page behind.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
+ * 
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -132,6 +146,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
+	vstate.internalPagesMap = NULL;
+	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -171,6 +188,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -194,6 +212,194 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	/* rescan all inner pages to find those that has empty child pages */
+	if (vstate.emptyPages > 0)
+	{
+		BlockNumber			x;
+
+		x = InvalidBlockNumber;
+		while (vstate.emptyPages > 0 &&
+			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber off,
+				maxoff;
+			IndexTuple  idxtuple;
+			ItemId	    iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		buftodelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			blkno = (BlockNumber) x;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, off);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
+					continue;
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+												RBM_NORMAL, info->strategy);
+
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = off;
+
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = off;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+			/*
+			 * We will have to relock internal page in case of deletes:
+			 * we cannot lock child while holding parent lock without risk
+			 * of a deadlock
+			 */
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			if (ntodelete)
+			{
+				TransactionId txid;
+				int			i;
+
+				for (i = 0; i < ntodelete; i++)
+				{
+					Buffer	leafBuffer = buftodelete[i];
+					Page	leafPage;
+					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+					gistcheckpage(rel, leafBuffer);
+					leafPage = (Page) BufferGetPage(leafBuffer);
+					if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
+						|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
+						|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+						)
+					{
+						UnlockReleaseBuffer(leafBuffer);
+						buftodelete[i] = InvalidBuffer;
+						todelete[i] = InvalidOffsetNumber;
+					}
+				}
+
+				LockBuffer(buffer, GIST_EXCLUSIVE);
+				page = (Page) BufferGetPage(buffer);
+
+				for (i = 0; i < ntodelete; i++)
+				{
+					Buffer	leafBuffer = buftodelete[i];
+					bool	inconsistent = false;
+					if (todelete[i] == InvalidOffsetNumber)
+						continue;
+
+					if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page)
+						|| PageGetMaxOffsetNumber(page) < todelete[i])
+						inconsistent = true;
+
+					if (!inconsistent)
+					{
+						iid = PageGetItemId(page, todelete[i]);
+						idxtuple = (IndexTuple) PageGetItem(page, iid);
+						if (todelete[i] != ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+							inconsistent = true;
+					}
+
+					if (inconsistent)
+					{
+						UnlockReleaseBuffer(leafBuffer);
+						buftodelete[i] = InvalidBuffer;
+						todelete[i] = InvalidOffsetNumber;
+					}
+				}
+
+				/*
+				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage;
+					XLogRecPtr	recptr;
+
+					if (todelete[i] == InvalidOffsetNumber)
+						continue;
+
+					leafPage = (Page) BufferGetPage(buftodelete[i]);
+
+					/* Remember xid of last transaction that could see this page */
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+						recptr 	= gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+													 txid, buffer, todelete[i] - i);
+					else
+						recptr = gistGetFakeLSN(rel);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+
+				LockBuffer(buffer, GIST_UNLOCK);
+			}
+
+			ReleaseBuffer(buffer);
+		}
+	}
+
+	blockset_free(vstate.emptyLeafPagesMap);
+	blockset_free(vstate.internalPagesMap);
 }
 
 /*
@@ -246,6 +452,7 @@ restart:
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -319,10 +526,19 @@ restart:
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+		else
+			stats->num_index_tuples += nremain;
 	}
 	else
 	{
+		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
+
 		/*
 		 * On an internal page, check for "invalid tuples", left behind by an
 		 * incomplete page split on PostgreSQL 9.0 or below.  These are not
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390a..3213ea98ea 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,6 +64,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -116,6 +149,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -535,6 +569,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +690,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15a..0861f82992 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f24156..ce8bfd83ea 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3698942f9d..117cc83ba5 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -412,12 +412,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1a..127cff5cb7 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,13 +17,15 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +78,12 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..5b92f08c74 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..e66396e851 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.20.1

0001-Add-block-set-data-structure.patchapplication/octet-stream; name=0001-Add-block-set-data-structure.patch; x-unix-mode=0644Download
From e0509eab42b46ae13cec21bd3a86e2fe0eee4698 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:19:32 +0500
Subject: [PATCH 1/2] Add block set data structure

Currently we have Bitmapset which works only with 32 bit ints.
Furthermore it is not very space-efficient in case of sparse
bitmap.

In this commit we invent block set: statically typed radix tree
for keeping BlockNumbers. This still can be improved in many ways
applicable to bitmaps, most notable in-pointer lists can be used.
But for the sake of code simplicity it is now plain in it's
ctructure.

The block set is introduced to support efficient page deletion in
GiST's VACUUM.
---
 src/backend/lib/Makefile                      |   4 +-
 src/backend/lib/blockset.c                    | 201 ++++++++++++++++++
 src/include/lib/blockset.h                    |  24 +++
 src/test/modules/test_blockset/.gitignore     |   4 +
 src/test/modules/test_blockset/Makefile       |  21 ++
 src/test/modules/test_blockset/README         |   8 +
 .../test_blockset/expected/test_blockset.out  |   7 +
 .../test_blockset/sql/test_blockset.sql       |   3 +
 .../test_blockset/test_blockset--1.0.sql      |   8 +
 .../modules/test_blockset/test_blockset.c     | 182 ++++++++++++++++
 .../test_blockset/test_blockset.control       |   4 +
 11 files changed, 464 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/blockset.c
 create mode 100644 src/include/lib/blockset.h
 create mode 100644 src/test/modules/test_blockset/.gitignore
 create mode 100644 src/test/modules/test_blockset/Makefile
 create mode 100644 src/test/modules/test_blockset/README
 create mode 100644 src/test/modules/test_blockset/expected/test_blockset.out
 create mode 100644 src/test/modules/test_blockset/sql/test_blockset.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset--1.0.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset.c
 create mode 100644 src/test/modules/test_blockset/test_blockset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca2..9601894f07 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o blockset.o bloomfilter.o dshash.o \
+       hyperloglog.o ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/blockset.c b/src/backend/lib/blockset.c
new file mode 100644
index 0000000000..c07b2974b1
--- /dev/null
+++ b/src/backend/lib/blockset.c
@@ -0,0 +1,201 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.c
+ *		Data structure for operations on set of block numbers
+ *
+ * This data structure resembles bitmap set in idea and operations, but
+ * has two main differences:
+ * 1. It handles unsigned BlockNumber as position in set instead of int32_t
+ * This allows to work with relation forks bigger than 2B blocks
+ * 2. It is more space efficient for sparse bitmaps: regions are allocated
+ * in chunks of 256 items at once.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/blockset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/blockset.h"
+
+/* Lowest level of radix tree is represented by bitmap */
+typedef struct
+{
+	char data[256/8];
+} BlockSetLevel4Data;
+
+typedef BlockSetLevel4Data *BlockSetLevel4;
+
+/* statically typed inner level chunks points to ground level */
+typedef struct
+{
+	/* null references denote empty subtree */
+	BlockSetLevel4 next[256];
+} BlockSetLevel3Data;
+
+typedef BlockSetLevel3Data *BlockSetLevel3;
+
+/* inner level points to another inner level */
+typedef struct
+{
+	BlockSetLevel3 next[256];
+} BlockSetLevel2Data;
+
+typedef BlockSetLevel2Data *BlockSetLevel2;
+
+/* Radix tree root */
+typedef struct BlockSetData
+{
+	BlockSetLevel2 next[256];
+} BlockSetData;
+
+/* multiplex block number into indexes of radix tree */
+#define BLOCKSET_SPLIT_BLKNO				\
+	uint32_t i1, i2, i3, i4, byte_no, byte_mask;	\
+	i4 = blkno % 256;						\
+	blkno /= 256;							\
+	i3 = blkno % 256;						\
+	blkno /= 256;							\
+	i2 = blkno % 256;						\
+	blkno /= 256;							\
+	i1 = blkno;								\
+	byte_no = i4 / 8;						\
+	byte_mask = 1 << (i4 % 8)				\
+
+/* indicate presence of block number in set */
+BlockSet blockset_set(BlockSet bs, BlockNumber blkno)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+	{
+		bs = palloc0(sizeof(BlockSetData));
+	}
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+	{
+		bs2 = palloc0(sizeof(BlockSetLevel2Data));
+		bs->next[i1] = bs2;
+	}
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+	{
+		bs3 = palloc0(sizeof(BlockSetLevel3Data));
+		bs2->next[i2] = bs3;
+	}
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+	{
+		bs4 = palloc0(sizeof(BlockSetLevel4Data));
+		bs3->next[i3] = bs4;
+	}
+	bs4->data[byte_no] = byte_mask | bs4->data[byte_no];
+	return bs;
+}
+
+/* Test presence of block in set */
+bool blockset_get(BlockNumber blkno, BlockSet bs)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return false;
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+		return false;
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+		return false;
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+		return false;
+	return (bs4->data[byte_no] & byte_mask);
+}
+
+/* 
+ * Find nearest block number in set no less than blkno
+ * Return InvalidBlockNumber if nothing to return
+ * If given InvalidBlockNumber - returns minimal element in set
+ */
+BlockNumber blockset_next(BlockSet bs, BlockNumber blkno)
+{
+	if (blkno == InvalidBlockNumber)
+		blkno = 0; /* equvalent to ++, left for clear code */
+	else
+		blkno++;
+
+	BLOCKSET_SPLIT_BLKNO;
+
+	if (bs == NULL)
+		return InvalidBlockNumber;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (!bs4)
+					continue;
+				for (; byte_no < 256 / 8; byte_no++)
+				{
+					if (!bs4->data[byte_no])
+						continue;
+					while (byte_mask < 256)
+					{
+						if ((byte_mask & bs4->data[byte_no]) == byte_mask)
+						{
+							i4 = byte_no * 8;
+							while (byte_mask >>= 1) i4++;
+							return i4 + 256 * (i3 + 256 * (i2 + 256 * i1));
+						}
+						byte_mask <<= 1;
+					}
+					byte_mask = 1;
+				}
+				byte_no = 0;
+			}
+			i3 = 0;
+		}
+		i2 = 0;
+	}
+	return InvalidBlockNumber;
+}
+
+/* free anything palloced */
+void blockset_free(BlockSet bs)
+{
+	BlockNumber blkno = 0;
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (bs4)
+					pfree(bs4);
+			}
+			pfree(bs3);
+		}
+		pfree(bs2);
+	}
+	pfree(bs);
+}
\ No newline at end of file
diff --git a/src/include/lib/blockset.h b/src/include/lib/blockset.h
new file mode 100644
index 0000000000..1966d17a8d
--- /dev/null
+++ b/src/include/lib/blockset.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.h
+ *		Data structure for operations on set of block numbers
+ *
+ * IDENTIFICATION
+ *    src/include/lib/blockset.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef BLOCKSET_H
+#define BLOCKSET_H
+
+#include "storage/block.h"
+
+typedef struct BlockSetData *BlockSet;
+
+extern BlockSet blockset_set(BlockSet bs, BlockNumber blkno);
+extern bool blockset_get(BlockNumber blkno, BlockSet bs);
+extern BlockNumber blockset_next(BlockSet bs, BlockNumber blkno);
+extern void blockset_free(BlockSet bs);
+
+#endif							/* BLOCKSET_H */
diff --git a/src/test/modules/test_blockset/.gitignore b/src/test/modules/test_blockset/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_blockset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_blockset/Makefile b/src/test/modules/test_blockset/Makefile
new file mode 100644
index 0000000000..091cf8c095
--- /dev/null
+++ b/src/test/modules/test_blockset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_blockset/Makefile
+
+MODULE_big = test_blockset
+OBJS = test_blockset.o $(WIN32RES)
+PGFILEDESC = "test_blockset - test code for block set library"
+
+EXTENSION = test_blockset
+DATA = test_blockset--1.0.sql
+
+REGRESS = test_blockset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_blockset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_blockset/README b/src/test/modules/test_blockset/README
new file mode 100644
index 0000000000..730951ff03
--- /dev/null
+++ b/src/test/modules/test_blockset/README
@@ -0,0 +1,8 @@
+test_blockset overview
+=========================
+
+test_blockset is a test harness module for testing block set data structure.
+There are two main tests:
+1. Test of comliance with Bitmapset on numbers avaiable to Bitmapset
+2. Test of numbers that can exceed INT32_MAX but are just shifted on one bit
+from numbers kept in Bitmapset
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/expected/test_blockset.out b/src/test/modules/test_blockset/expected/test_blockset.out
new file mode 100644
index 0000000000..02c29d5fc0
--- /dev/null
+++ b/src/test/modules/test_blockset/expected/test_blockset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_blockset;
+SELECT test_blockset();
+ test_blockset 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_blockset/sql/test_blockset.sql b/src/test/modules/test_blockset/sql/test_blockset.sql
new file mode 100644
index 0000000000..85d2886676
--- /dev/null
+++ b/src/test/modules/test_blockset/sql/test_blockset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_blockset;
+
+SELECT test_blockset();
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/test_blockset--1.0.sql b/src/test/modules/test_blockset/test_blockset--1.0.sql
new file mode 100644
index 0000000000..04eea8a614
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_blockset/test_blockset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_blockset" to load this file. \quit
+
+CREATE FUNCTION test_blockset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_blockset/test_blockset.c b/src/test/modules/test_blockset/test_blockset.c
new file mode 100644
index 0000000000..7a1d1c86c8
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.c
@@ -0,0 +1,182 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_blockset.c
+ *		Test block set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_blockset/test_blockset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/blockset.h"
+#include "nodes/bitmapset.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_blockset);
+
+static void test_blockset_bms_compliance();
+static void test_blockset_big_block_numbers();
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_blockset(PG_FUNCTION_ARGS)
+{
+	test_blockset_bms_compliance(0);
+	test_blockset_bms_compliance(1);
+	test_blockset_bms_compliance(2);
+	test_blockset_bms_compliance(1337);
+	test_blockset_bms_compliance(100000);
+	test_blockset_big_block_numbers(1337);
+	test_blockset_big_block_numbers(31337);
+	PG_RETURN_VOID();
+}
+
+/*
+ * This test creates random bitmap with test_limit members
+ * and checks that block set behavior is similar to Bitmapset
+ */
+static void test_blockset_bms_compliance(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != next_bn)
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
+
+/* 
+ * This test is similar to test_blockset_bms_compliance()
+ * except that it shifts BlockNumbers by one bit to ensure that blockset
+ * operates correctly on values higher that INT32_MAX
+ * This function is copy-pasted from previous with the exception of barrel
+ * shifts for BlockNumbers. I've tried various refactorings, but they all
+ * looked ugly.
+ */
+static void test_blockset_big_block_numbers(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno << 1);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != (next_bn >> 1))
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno << 1, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
diff --git a/src/test/modules/test_blockset/test_blockset.control b/src/test/modules/test_blockset/test_blockset.control
new file mode 100644
index 0000000000..fdb7598c5a
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for block set library'
+default_version = '1.0'
+module_pathname = '$libdir/test_blockset'
+relocatable = true
-- 
2.20.1

#50Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#49)
6 attachment(s)
Re: GiST VACUUM

On 10/03/2019 18:40, Andrey Borodin wrote:

Here's new version of the patch for actual page deletion.
Changes:

1. Fixed possible concurrency issue:

We were locking child page while holding lock on internal page. Notes
in GiST README recommend locking child before parent. Thus now we
unlock internal page before locking children for deletion, and lock
it again, check that everything is still on it's place and delete
only then.

Good catch. The implementation is a bit broken, though. This segfaults:

create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);

insert into gist_point_tbl (id, p)
select g, point((1+g) % 100, (1+g) % 100) from generate_series(1,
10000) g;
insert into gist_point_tbl (id, p)
select -g, point(1000, 1000) from generate_series(1, 10000) g;

delete from gist_point_tbl where id >= 0;
vacuum verbose gist_point_tbl;

BTW, we don't seem to have test coverage for this in the regression
tests. The tests that delete some rows, where you updated the comment,
doesn't actually seem to produce any empty pages that could be deleted.

One thing still bothers me. Let's assume that we have internal page
with 2 deletable leaves. We lock these leaves in order of items on
internal page. Is it possible that 2nd page have follow-right link on
1st and someone will lock 2nd page, try to lock 1st and deadlock with
VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its right
sibling doesn't have a downlink in the parent yet. Nevertheless, I think
I'd sleep better, if we acquired the locks in left-to-right order, to be
safe.

An easy cop-out would be to use LWLockConditionalAcquire, and bail out
if we can't get the lock. But if it's not too complicated, it'd be nice
to acquire the locks in the correct order to begin with.

I did a round of cleanup and moving things around, before I bumped into
the above issue. I'm including them here as separate patches, for easier
review, but they should of course be squashed into yours. For
convenience, I'm including your patches here as well, so that all the
patches you need are in the same place, but they are identical to what
you posted.

- Heikki

Attachments:

0001-Add-block-set-data-structure.patchtext/x-patch; name=0001-Add-block-set-data-structure.patchDownload
From a09f14de32f3040b19238d239b7f025f345e940b Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:19:32 +0500
Subject: [PATCH 1/6] Add block set data structure

Currently we have Bitmapset which works only with 32 bit ints.
Furthermore it is not very space-efficient in case of sparse
bitmap.

In this commit we invent block set: statically typed radix tree
for keeping BlockNumbers. This still can be improved in many ways
applicable to bitmaps, most notable in-pointer lists can be used.
But for the sake of code simplicity it is now plain in it's
ctructure.

The block set is introduced to support efficient page deletion in
GiST's VACUUM.
---
 src/backend/lib/Makefile                      |   4 +-
 src/backend/lib/blockset.c                    | 201 ++++++++++++++++++
 src/include/lib/blockset.h                    |  24 +++
 src/test/modules/test_blockset/.gitignore     |   4 +
 src/test/modules/test_blockset/Makefile       |  21 ++
 src/test/modules/test_blockset/README         |   8 +
 .../test_blockset/expected/test_blockset.out  |   7 +
 .../test_blockset/sql/test_blockset.sql       |   3 +
 .../test_blockset/test_blockset--1.0.sql      |   8 +
 .../modules/test_blockset/test_blockset.c     | 182 ++++++++++++++++
 .../test_blockset/test_blockset.control       |   4 +
 11 files changed, 464 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/blockset.c
 create mode 100644 src/include/lib/blockset.h
 create mode 100644 src/test/modules/test_blockset/.gitignore
 create mode 100644 src/test/modules/test_blockset/Makefile
 create mode 100644 src/test/modules/test_blockset/README
 create mode 100644 src/test/modules/test_blockset/expected/test_blockset.out
 create mode 100644 src/test/modules/test_blockset/sql/test_blockset.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset--1.0.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset.c
 create mode 100644 src/test/modules/test_blockset/test_blockset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..9601894f07c 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o blockset.o bloomfilter.o dshash.o \
+       hyperloglog.o ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/blockset.c b/src/backend/lib/blockset.c
new file mode 100644
index 00000000000..c07b2974b1e
--- /dev/null
+++ b/src/backend/lib/blockset.c
@@ -0,0 +1,201 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.c
+ *		Data structure for operations on set of block numbers
+ *
+ * This data structure resembles bitmap set in idea and operations, but
+ * has two main differences:
+ * 1. It handles unsigned BlockNumber as position in set instead of int32_t
+ * This allows to work with relation forks bigger than 2B blocks
+ * 2. It is more space efficient for sparse bitmaps: regions are allocated
+ * in chunks of 256 items at once.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/blockset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/blockset.h"
+
+/* Lowest level of radix tree is represented by bitmap */
+typedef struct
+{
+	char data[256/8];
+} BlockSetLevel4Data;
+
+typedef BlockSetLevel4Data *BlockSetLevel4;
+
+/* statically typed inner level chunks points to ground level */
+typedef struct
+{
+	/* null references denote empty subtree */
+	BlockSetLevel4 next[256];
+} BlockSetLevel3Data;
+
+typedef BlockSetLevel3Data *BlockSetLevel3;
+
+/* inner level points to another inner level */
+typedef struct
+{
+	BlockSetLevel3 next[256];
+} BlockSetLevel2Data;
+
+typedef BlockSetLevel2Data *BlockSetLevel2;
+
+/* Radix tree root */
+typedef struct BlockSetData
+{
+	BlockSetLevel2 next[256];
+} BlockSetData;
+
+/* multiplex block number into indexes of radix tree */
+#define BLOCKSET_SPLIT_BLKNO				\
+	uint32_t i1, i2, i3, i4, byte_no, byte_mask;	\
+	i4 = blkno % 256;						\
+	blkno /= 256;							\
+	i3 = blkno % 256;						\
+	blkno /= 256;							\
+	i2 = blkno % 256;						\
+	blkno /= 256;							\
+	i1 = blkno;								\
+	byte_no = i4 / 8;						\
+	byte_mask = 1 << (i4 % 8)				\
+
+/* indicate presence of block number in set */
+BlockSet blockset_set(BlockSet bs, BlockNumber blkno)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+	{
+		bs = palloc0(sizeof(BlockSetData));
+	}
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+	{
+		bs2 = palloc0(sizeof(BlockSetLevel2Data));
+		bs->next[i1] = bs2;
+	}
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+	{
+		bs3 = palloc0(sizeof(BlockSetLevel3Data));
+		bs2->next[i2] = bs3;
+	}
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+	{
+		bs4 = palloc0(sizeof(BlockSetLevel4Data));
+		bs3->next[i3] = bs4;
+	}
+	bs4->data[byte_no] = byte_mask | bs4->data[byte_no];
+	return bs;
+}
+
+/* Test presence of block in set */
+bool blockset_get(BlockNumber blkno, BlockSet bs)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return false;
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+		return false;
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+		return false;
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+		return false;
+	return (bs4->data[byte_no] & byte_mask);
+}
+
+/* 
+ * Find nearest block number in set no less than blkno
+ * Return InvalidBlockNumber if nothing to return
+ * If given InvalidBlockNumber - returns minimal element in set
+ */
+BlockNumber blockset_next(BlockSet bs, BlockNumber blkno)
+{
+	if (blkno == InvalidBlockNumber)
+		blkno = 0; /* equvalent to ++, left for clear code */
+	else
+		blkno++;
+
+	BLOCKSET_SPLIT_BLKNO;
+
+	if (bs == NULL)
+		return InvalidBlockNumber;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (!bs4)
+					continue;
+				for (; byte_no < 256 / 8; byte_no++)
+				{
+					if (!bs4->data[byte_no])
+						continue;
+					while (byte_mask < 256)
+					{
+						if ((byte_mask & bs4->data[byte_no]) == byte_mask)
+						{
+							i4 = byte_no * 8;
+							while (byte_mask >>= 1) i4++;
+							return i4 + 256 * (i3 + 256 * (i2 + 256 * i1));
+						}
+						byte_mask <<= 1;
+					}
+					byte_mask = 1;
+				}
+				byte_no = 0;
+			}
+			i3 = 0;
+		}
+		i2 = 0;
+	}
+	return InvalidBlockNumber;
+}
+
+/* free anything palloced */
+void blockset_free(BlockSet bs)
+{
+	BlockNumber blkno = 0;
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (bs4)
+					pfree(bs4);
+			}
+			pfree(bs3);
+		}
+		pfree(bs2);
+	}
+	pfree(bs);
+}
\ No newline at end of file
diff --git a/src/include/lib/blockset.h b/src/include/lib/blockset.h
new file mode 100644
index 00000000000..1966d17a8d4
--- /dev/null
+++ b/src/include/lib/blockset.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.h
+ *		Data structure for operations on set of block numbers
+ *
+ * IDENTIFICATION
+ *    src/include/lib/blockset.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef BLOCKSET_H
+#define BLOCKSET_H
+
+#include "storage/block.h"
+
+typedef struct BlockSetData *BlockSet;
+
+extern BlockSet blockset_set(BlockSet bs, BlockNumber blkno);
+extern bool blockset_get(BlockNumber blkno, BlockSet bs);
+extern BlockNumber blockset_next(BlockSet bs, BlockNumber blkno);
+extern void blockset_free(BlockSet bs);
+
+#endif							/* BLOCKSET_H */
diff --git a/src/test/modules/test_blockset/.gitignore b/src/test/modules/test_blockset/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_blockset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_blockset/Makefile b/src/test/modules/test_blockset/Makefile
new file mode 100644
index 00000000000..091cf8c0958
--- /dev/null
+++ b/src/test/modules/test_blockset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_blockset/Makefile
+
+MODULE_big = test_blockset
+OBJS = test_blockset.o $(WIN32RES)
+PGFILEDESC = "test_blockset - test code for block set library"
+
+EXTENSION = test_blockset
+DATA = test_blockset--1.0.sql
+
+REGRESS = test_blockset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_blockset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_blockset/README b/src/test/modules/test_blockset/README
new file mode 100644
index 00000000000..730951ff03c
--- /dev/null
+++ b/src/test/modules/test_blockset/README
@@ -0,0 +1,8 @@
+test_blockset overview
+=========================
+
+test_blockset is a test harness module for testing block set data structure.
+There are two main tests:
+1. Test of comliance with Bitmapset on numbers avaiable to Bitmapset
+2. Test of numbers that can exceed INT32_MAX but are just shifted on one bit
+from numbers kept in Bitmapset
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/expected/test_blockset.out b/src/test/modules/test_blockset/expected/test_blockset.out
new file mode 100644
index 00000000000..02c29d5fc0c
--- /dev/null
+++ b/src/test/modules/test_blockset/expected/test_blockset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_blockset;
+SELECT test_blockset();
+ test_blockset 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_blockset/sql/test_blockset.sql b/src/test/modules/test_blockset/sql/test_blockset.sql
new file mode 100644
index 00000000000..85d2886676e
--- /dev/null
+++ b/src/test/modules/test_blockset/sql/test_blockset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_blockset;
+
+SELECT test_blockset();
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/test_blockset--1.0.sql b/src/test/modules/test_blockset/test_blockset--1.0.sql
new file mode 100644
index 00000000000..04eea8a6146
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_blockset/test_blockset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_blockset" to load this file. \quit
+
+CREATE FUNCTION test_blockset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_blockset/test_blockset.c b/src/test/modules/test_blockset/test_blockset.c
new file mode 100644
index 00000000000..7a1d1c86c86
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.c
@@ -0,0 +1,182 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_blockset.c
+ *		Test block set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_blockset/test_blockset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/blockset.h"
+#include "nodes/bitmapset.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_blockset);
+
+static void test_blockset_bms_compliance();
+static void test_blockset_big_block_numbers();
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_blockset(PG_FUNCTION_ARGS)
+{
+	test_blockset_bms_compliance(0);
+	test_blockset_bms_compliance(1);
+	test_blockset_bms_compliance(2);
+	test_blockset_bms_compliance(1337);
+	test_blockset_bms_compliance(100000);
+	test_blockset_big_block_numbers(1337);
+	test_blockset_big_block_numbers(31337);
+	PG_RETURN_VOID();
+}
+
+/*
+ * This test creates random bitmap with test_limit members
+ * and checks that block set behavior is similar to Bitmapset
+ */
+static void test_blockset_bms_compliance(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != next_bn)
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
+
+/* 
+ * This test is similar to test_blockset_bms_compliance()
+ * except that it shifts BlockNumbers by one bit to ensure that blockset
+ * operates correctly on values higher that INT32_MAX
+ * This function is copy-pasted from previous with the exception of barrel
+ * shifts for BlockNumbers. I've tried various refactorings, but they all
+ * looked ugly.
+ */
+static void test_blockset_big_block_numbers(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno << 1);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != (next_bn >> 1))
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno << 1, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
diff --git a/src/test/modules/test_blockset/test_blockset.control b/src/test/modules/test_blockset/test_blockset.control
new file mode 100644
index 00000000000..fdb7598c5a7
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for block set library'
+default_version = '1.0'
+module_pathname = '$libdir/test_blockset'
+relocatable = true
-- 
2.20.1

0002-Delete-pages-during-GiST-VACUUM.patchtext/x-patch; name=0002-Delete-pages-during-GiST-VACUUM.patchDownload
From 1e477f083cd639117944c7910db8aff0c763b4f6 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:24:37 +0500
Subject: [PATCH 2/6] Delete pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make notes
of empty pages and internal pages. At second pass we scan through
internal pages looking for references to empty leaf pages.
---
 src/backend/access/gist/README         |  14 ++
 src/backend/access/gist/gist.c         |  18 ++
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 218 ++++++++++++++++++++++++-
 src/backend/access/gist/gistxlog.c     |  60 +++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  10 +-
 src/test/regress/expected/gist.out     |   4 +-
 src/test/regress/sql/gist.sql          |   4 +-
 11 files changed, 335 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..c84359de310 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef98..6f87b4b5044 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +843,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index f32e16eed58..4fa44bf2f6c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -834,7 +835,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e8..fc606ac823c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,11 +16,15 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/blockset.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -30,6 +34,10 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -50,6 +58,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
+
 	return stats;
 }
 
@@ -103,6 +112,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * while the index is being expanded, leaving an all-zeros page behind.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
+ * 
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -132,6 +146,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
+	vstate.internalPagesMap = NULL;
+	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -171,6 +188,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -194,6 +212,194 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	/* rescan all inner pages to find those that has empty child pages */
+	if (vstate.emptyPages > 0)
+	{
+		BlockNumber			x;
+
+		x = InvalidBlockNumber;
+		while (vstate.emptyPages > 0 &&
+			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber off,
+				maxoff;
+			IndexTuple  idxtuple;
+			ItemId	    iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		buftodelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			blkno = (BlockNumber) x;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+			{
+				Buffer		leafBuffer;
+				Page		leafPage;
+				BlockNumber leafBlockNo;
+
+				iid = PageGetItemId(page, off);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
+					continue;
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+												RBM_NORMAL, info->strategy);
+
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = off;
+
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage))
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					continue;
+				}
+
+				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+				{
+					buftodelete[ntodelete] = leafBuffer;
+					todelete[ntodelete++] = off;
+				}
+				else
+					UnlockReleaseBuffer(leafBuffer);
+			}
+
+			/*
+			 * We will have to relock internal page in case of deletes:
+			 * we cannot lock child while holding parent lock without risk
+			 * of a deadlock
+			 */
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			if (ntodelete)
+			{
+				TransactionId txid;
+				int			i;
+
+				for (i = 0; i < ntodelete; i++)
+				{
+					Buffer	leafBuffer = buftodelete[i];
+					Page	leafPage;
+					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+					gistcheckpage(rel, leafBuffer);
+					leafPage = (Page) BufferGetPage(leafBuffer);
+					if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
+						|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
+						|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+						)
+					{
+						UnlockReleaseBuffer(leafBuffer);
+						buftodelete[i] = InvalidBuffer;
+						todelete[i] = InvalidOffsetNumber;
+					}
+				}
+
+				LockBuffer(buffer, GIST_EXCLUSIVE);
+				page = (Page) BufferGetPage(buffer);
+
+				for (i = 0; i < ntodelete; i++)
+				{
+					Buffer	leafBuffer = buftodelete[i];
+					bool	inconsistent = false;
+					if (todelete[i] == InvalidOffsetNumber)
+						continue;
+
+					if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page)
+						|| PageGetMaxOffsetNumber(page) < todelete[i])
+						inconsistent = true;
+
+					if (!inconsistent)
+					{
+						iid = PageGetItemId(page, todelete[i]);
+						idxtuple = (IndexTuple) PageGetItem(page, iid);
+						if (todelete[i] != ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+							inconsistent = true;
+					}
+
+					if (inconsistent)
+					{
+						UnlockReleaseBuffer(leafBuffer);
+						buftodelete[i] = InvalidBuffer;
+						todelete[i] = InvalidOffsetNumber;
+					}
+				}
+
+				/*
+				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				txid = ReadNewTransactionId();
+
+				START_CRIT_SECTION();
+
+				/* Mark pages as deleted dropping references from internal pages */
+				for (i = 0; i < ntodelete; i++)
+				{
+					Page		leafPage;
+					XLogRecPtr	recptr;
+
+					if (todelete[i] == InvalidOffsetNumber)
+						continue;
+
+					leafPage = (Page) BufferGetPage(buftodelete[i]);
+
+					/* Remember xid of last transaction that could see this page */
+					GistPageSetDeleteXid(leafPage,txid);
+
+					GistPageSetDeleted(leafPage);
+					MarkBufferDirty(buftodelete[i]);
+					stats->pages_deleted++;
+					vstate.emptyPages--;
+
+					MarkBufferDirty(buffer);
+					/* Offsets are changed as long as we delete tuples from internal page */
+					PageIndexTupleDelete(page, todelete[i] - i);
+
+					if (RelationNeedsWAL(rel))
+						recptr 	= gistXLogSetDeleted(rel->rd_node, buftodelete[i],
+													 txid, buffer, todelete[i] - i);
+					else
+						recptr = gistGetFakeLSN(rel);
+					PageSetLSN(page, recptr);
+					PageSetLSN(leafPage, recptr);
+
+					UnlockReleaseBuffer(buftodelete[i]);
+				}
+				END_CRIT_SECTION();
+
+				LockBuffer(buffer, GIST_UNLOCK);
+			}
+
+			ReleaseBuffer(buffer);
+		}
+	}
+
+	blockset_free(vstate.emptyLeafPagesMap);
+	blockset_free(vstate.internalPagesMap);
 }
 
 /*
@@ -246,6 +452,7 @@ restart:
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -319,10 +526,19 @@ restart:
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+		else
+			stats->num_index_tuples += nremain;
 	}
 	else
 	{
+		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
+
 		/*
 		 * On an internal page, check for "invalid tuples", left behind by an
 		 * incomplete page split on PostgreSQL 9.0 or below.  These are not
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390af..3213ea98ea3 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,6 +64,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -116,6 +149,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -535,6 +569,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +690,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15aa..0861f829922 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f241560..ce8bfd83ea4 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 463d2bfc7b9..943163ccce7 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -414,12 +414,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1af..127cff5cb78 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,13 +17,15 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +78,12 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..5b92f08c747 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,9 +27,7 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..e66396e851b 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,9 +28,7 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
+-- And also delete some concentration of values.
 delete from gist_point_tbl where id < 10000;
 
 vacuum analyze gist_point_tbl;
-- 
2.20.1

0003-Minor-cleanup.patchtext/x-patch; name=0003-Minor-cleanup.patchDownload
From f3a33df4906e0074821493311cd8e0d25f4f63c6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:15:43 +0200
Subject: [PATCH 3/6] Minor cleanup

I renamed gistXLogSetDeleted to gistXLogPageDelete. I think that's better,
the WAL record also includes information about removing the downlink from
the parent, not just setting the flag on the child.
---
 src/backend/access/gist/gist.c       |  7 +--
 src/backend/access/gist/gistvacuum.c |  9 +--
 src/backend/access/gist/gistxlog.c   | 88 +++++++++++++++-------------
 src/include/access/gist_private.h    |  6 +-
 src/include/access/gistxlog.h        | 10 ++--
 5 files changed, 62 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 6f87b4b5044..e260c4df4e7 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -844,11 +844,10 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			}
 
 			/*
-			 * Leaf pages can be left deleted but still referenced
-			 * until it's space is reused. Downlink to this page may be already
-			 * removed from the internal page, but this scan can posess it.
+			 * The page might have been deleted after we scanned the parent
+			 * and saw the downlink.
 			 */
-			if(GistPageIsDeleted(stack->page))
+			if (GistPageIsDeleted(stack->page))
 			{
 				UnlockReleaseBuffer(stack->buffer);
 				xlocked = false;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index fc606ac823c..eb90b2077d3 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,11 +20,9 @@
 #include "commands/vacuum.h"
 #include "lib/blockset.h"
 #include "miscadmin.h"
-#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -58,7 +56,6 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
-
 	return stats;
 }
 
@@ -112,7 +109,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * while the index is being expanded, leaving an all-zeros page behind.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
- * 
+ *
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
@@ -213,7 +210,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
 
-	/* rescan all inner pages to find those that has empty child pages */
+	/* rescan all inner pages to find those that have empty child pages */
 	if (vstate.emptyPages > 0)
 	{
 		BlockNumber			x;
@@ -305,7 +302,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
 					gistcheckpage(rel, leafBuffer);
 					leafPage = (Page) BufferGetPage(leafBuffer);
-					if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
+					if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */
 						|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
 						|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
 						)
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3213ea98ea3..7110c70451e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,39 +64,6 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
-static void
-gistRedoPageSetDeleted(XLogReaderState *record)
-{
-	XLogRecPtr	lsn = record->EndRecPtr;
-	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
-	Buffer		buffer;
-	Page		page;
-
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-
-		GistPageSetDeleteXid(page, xldata->deleteXid);
-		GistPageSetDeleted(page);
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
-	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-
-		PageIndexTupleDelete(page, xldata->downlinkOffset);
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-}
 /*
  * redo any page update (except page split)
  */
@@ -542,6 +509,43 @@ gistRedoCreateIndex(XLogReaderState *record)
 	UnlockReleaseBuffer(buffer);
 }
 
+/* redo page deletion */
+static void
+gistRedoPageDelete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	/* FIXME: Are we locking the pages in correct order, for hot standby? */
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
 void
 gist_redo(XLogReaderState *record)
 {
@@ -570,7 +574,7 @@ gist_redo(XLogReaderState *record)
 			gistRedoCreateIndex(record);
 			break;
 		case XLOG_GIST_PAGE_DELETE:
-			gistRedoPageSetDeleted(record);
+			gistRedoPageDelete(record);
 			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
@@ -691,25 +695,27 @@ gistXLogSplit(bool page_is_leaf,
 }
 
 /*
- * Write XLOG record describing a page delete. This also includes removal of
- * downlink from internal page.
+ * Write XLOG record describing a page deletion. This also includes removal of
+ * downlink from the parent page.
  */
 XLogRecPtr
-gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
-					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+gistXLogPageDelete(Buffer buffer, TransactionId xid,
+				   Buffer parentBuffer, OffsetNumber downlinkOffset)
+{
 	gistxlogPageDelete xlrec;
 	XLogRecPtr	recptr;
 
 	xlrec.deleteXid = xid;
-	xlrec.downlinkOffset = internalPageOffset;
+	xlrec.downlinkOffset = downlinkOffset;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
 
 	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
-	/* new tuples */
+	XLogRegisterBuffer(1, parentBuffer, REGBUF_STANDARD);
+
 	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+
 	return recptr;
 }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 943163ccce7..c77d0b4dd81 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -415,9 +415,9 @@ extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
 /* gistxlog.c */
-extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
-					TransactionId xid, Buffer internalPageBuffer,
-					OffsetNumber internalPageOffset);
+extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
+				   TransactionId xid, Buffer parentBuffer,
+				   OffsetNumber downlinkOffset);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 127cff5cb78..939a1ea7559 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,8 +17,6 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
-/* XLog stuff */
-
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
@@ -78,10 +76,14 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+/*
+ * Backup Blk 0: page that was deleted.
+ * Backup Blk 1: parent page, containing the downlink to the deleted page.
+ */
 typedef struct gistxlogPageDelete
 {
-   TransactionId deleteXid; /* last Xid which could see page in scan */
-   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
 } gistxlogPageDelete;
 
 extern void gist_redo(XLogReaderState *record);
-- 
2.20.1

0004-Move-the-page-deletion-logic-to-separate-function.patchtext/x-patch; name=0004-Move-the-page-deletion-logic-to-separate-function.patchDownload
From 3b6c4f6901f4b861e37e2aaba755dc66a5012607 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:34:01 +0200
Subject: [PATCH 4/6] Move the page deletion logic to separate function.

If a VACUUM does multiple index passes, I think we only want to do the
empty page deletion after the final pass. That saves effort, since we
only need to scan the internal pages once. But even if we wanted to do
it on every pass, I think having a separate function makes it more
readable.
---
 src/backend/access/gist/gistvacuum.c | 464 ++++++++++++++-------------
 1 file changed, 240 insertions(+), 224 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index eb90b2077d3..b95e755406e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -23,25 +23,31 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-/* Working state needed by gistbulkdelete */
 typedef struct
 {
+	IndexBulkDeleteResult stats;
+
 	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
+	BlockNumber numEmptyPages;
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
+} GistBulkDeleteResult;
+
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	GistBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
-	BlockNumber emptyPages;
-
-	BlockSet	internalPagesMap;
-	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
-static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
+static void gistvacuum_recycle_pages(GistBulkDeleteResult *stats);
 
 /*
  * VACUUM bulkdelete stage: remove index entries.
@@ -50,13 +56,15 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	if (gist_stats == NULL)
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, gist_stats, callback, callback_state);
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -65,6 +73,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -74,12 +84,15 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
 	 * still need to do a pass over the index, to obtain index statistics.
 	 */
-	if (stats == NULL)
+	if (gist_stats == NULL)
 	{
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
+		gistvacuumscan(info, gist_stats, NULL, NULL);
 	}
 
+	/* Recycle empty pages */
+	gistvacuum_recycle_pages(gist_stats);
+
 	/*
 	 * It's quite possible for us to be fooled by concurrent page splits into
 	 * double-counting some index tuples, so disbelieve any total that exceeds
@@ -88,11 +101,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	if (!info->estimated_count)
 	{
-		if (stats->num_index_tuples > info->num_heap_tuples)
-			stats->num_index_tuples = info->num_heap_tuples;
+		if (gist_stats->stats.num_index_tuples > info->num_heap_tuples)
+			gist_stats->stats.num_index_tuples = info->num_heap_tuples;
 	}
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -116,7 +129,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  */
 static void
-gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
@@ -129,12 +142,12 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command.
 	 */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
-	stats->pages_deleted = 0;
+	stats->stats.estimated_count = false;
+	stats->stats.num_index_tuples = 0;
+	stats->stats.pages_deleted = 0;
 
 	/* Set up info to pass down to gistvacuumpage */
-	vstate.info = info;
+	stats->info = info;
 	vstate.stats = stats;
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
@@ -143,9 +156,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
-	vstate.emptyPages = 0;
-	vstate.internalPagesMap = NULL;
-	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -207,196 +217,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 
 	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
-
-	/* rescan all inner pages to find those that have empty child pages */
-	if (vstate.emptyPages > 0)
-	{
-		BlockNumber			x;
-
-		x = InvalidBlockNumber;
-		while (vstate.emptyPages > 0 &&
-			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
-		{
-			Buffer		buffer;
-			Page		page;
-			OffsetNumber off,
-				maxoff;
-			IndexTuple  idxtuple;
-			ItemId	    iid;
-			OffsetNumber todelete[MaxOffsetNumber];
-			Buffer		buftodelete[MaxOffsetNumber];
-			int			ntodelete = 0;
-
-			blkno = (BlockNumber) x;
-
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										info->strategy);
-
-			LockBuffer(buffer, GIST_EXCLUSIVE);
-			page = (Page) BufferGetPage(buffer);
-			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
-			{
-				UnlockReleaseBuffer(buffer);
-				continue;
-			}
-
-			maxoff = PageGetMaxOffsetNumber(page);
-			/* Check that leafs are still empty and decide what to delete */
-			for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
-			{
-				Buffer		leafBuffer;
-				Page		leafPage;
-				BlockNumber leafBlockNo;
-
-				iid = PageGetItemId(page, off);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-				/* if this page was not empty in previous scan - we do not consider it */
-				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
-					continue;
-
-				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
-												RBM_NORMAL, info->strategy);
-
-				buftodelete[ntodelete] = leafBuffer;
-				todelete[ntodelete++] = off;
-
-				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
-				gistcheckpage(rel, leafBuffer);
-				leafPage = (Page) BufferGetPage(leafBuffer);
-				if (!GistPageIsLeaf(leafPage))
-				{
-					UnlockReleaseBuffer(leafBuffer);
-					continue;
-				}
-
-				if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
-					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
-					&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
-				{
-					buftodelete[ntodelete] = leafBuffer;
-					todelete[ntodelete++] = off;
-				}
-				else
-					UnlockReleaseBuffer(leafBuffer);
-			}
-
-			/*
-			 * We will have to relock internal page in case of deletes:
-			 * we cannot lock child while holding parent lock without risk
-			 * of a deadlock
-			 */
-			LockBuffer(buffer, GIST_UNLOCK);
-
-			if (ntodelete)
-			{
-				TransactionId txid;
-				int			i;
-
-				for (i = 0; i < ntodelete; i++)
-				{
-					Buffer	leafBuffer = buftodelete[i];
-					Page	leafPage;
-					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
-					gistcheckpage(rel, leafBuffer);
-					leafPage = (Page) BufferGetPage(leafBuffer);
-					if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */
-						|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
-						|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
-						)
-					{
-						UnlockReleaseBuffer(leafBuffer);
-						buftodelete[i] = InvalidBuffer;
-						todelete[i] = InvalidOffsetNumber;
-					}
-				}
-
-				LockBuffer(buffer, GIST_EXCLUSIVE);
-				page = (Page) BufferGetPage(buffer);
-
-				for (i = 0; i < ntodelete; i++)
-				{
-					Buffer	leafBuffer = buftodelete[i];
-					bool	inconsistent = false;
-					if (todelete[i] == InvalidOffsetNumber)
-						continue;
-
-					if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page)
-						|| PageGetMaxOffsetNumber(page) < todelete[i])
-						inconsistent = true;
-
-					if (!inconsistent)
-					{
-						iid = PageGetItemId(page, todelete[i]);
-						idxtuple = (IndexTuple) PageGetItem(page, iid);
-						if (todelete[i] != ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
-							inconsistent = true;
-					}
-
-					if (inconsistent)
-					{
-						UnlockReleaseBuffer(leafBuffer);
-						buftodelete[i] = InvalidBuffer;
-						todelete[i] = InvalidOffsetNumber;
-					}
-				}
-
-				/*
-				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
-				 * that could hold downlinks to this page. We use
-				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
-				 * since we are in a VACUUM.
-				 */
-				txid = ReadNewTransactionId();
-
-				START_CRIT_SECTION();
-
-				/* Mark pages as deleted dropping references from internal pages */
-				for (i = 0; i < ntodelete; i++)
-				{
-					Page		leafPage;
-					XLogRecPtr	recptr;
-
-					if (todelete[i] == InvalidOffsetNumber)
-						continue;
-
-					leafPage = (Page) BufferGetPage(buftodelete[i]);
-
-					/* Remember xid of last transaction that could see this page */
-					GistPageSetDeleteXid(leafPage,txid);
-
-					GistPageSetDeleted(leafPage);
-					MarkBufferDirty(buftodelete[i]);
-					stats->pages_deleted++;
-					vstate.emptyPages--;
-
-					MarkBufferDirty(buffer);
-					/* Offsets are changed as long as we delete tuples from internal page */
-					PageIndexTupleDelete(page, todelete[i] - i);
-
-					if (RelationNeedsWAL(rel))
-						recptr 	= gistXLogSetDeleted(rel->rd_node, buftodelete[i],
-													 txid, buffer, todelete[i] - i);
-					else
-						recptr = gistGetFakeLSN(rel);
-					PageSetLSN(page, recptr);
-					PageSetLSN(leafPage, recptr);
-
-					UnlockReleaseBuffer(buftodelete[i]);
-				}
-				END_CRIT_SECTION();
-
-				LockBuffer(buffer, GIST_UNLOCK);
-			}
-
-			ReleaseBuffer(buffer);
-		}
-	}
-
-	blockset_free(vstate.emptyLeafPagesMap);
-	blockset_free(vstate.internalPagesMap);
+	stats->stats.num_pages = num_pages;
+	stats->stats.pages_free = vstate.totFreePages;
 }
 
 /*
@@ -413,8 +235,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void
 gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 {
-	IndexVacuumInfo *info = vstate->info;
-	IndexBulkDeleteResult *stats = vstate->stats;
+	GistBulkDeleteResult *stats = vstate->stats;
+	IndexVacuumInfo *info = stats->info;
 	IndexBulkDeleteCallback callback = vstate->callback;
 	void	   *callback_state = vstate->callback_state;
 	Relation	rel = info->index;
@@ -443,7 +265,7 @@ restart:
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
 		vstate->totFreePages++;
-		stats->pages_deleted++;
+		stats->stats.pages_deleted++;
 	}
 	else if (GistPageIsLeaf(page))
 	{
@@ -518,7 +340,7 @@ restart:
 
 			END_CRIT_SECTION();
 
-			stats->tuples_removed += ntodelete;
+			stats->stats.tuples_removed += ntodelete;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -526,16 +348,14 @@ restart:
 		nremain = maxoff - FirstOffsetNumber + 1;
 		if (nremain == 0)
 		{
-			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
-			vstate->emptyPages++;
+			stats->emptyLeafPagesMap = blockset_set(stats->emptyLeafPagesMap, blkno);
+			stats->numEmptyPages++;
 		}
 		else
-			stats->num_index_tuples += nremain;
+			stats->stats.num_index_tuples += nremain;
 	}
 	else
 	{
-		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
-
 		/*
 		 * On an internal page, check for "invalid tuples", left behind by an
 		 * incomplete page split on PostgreSQL 9.0 or below.  These are not
@@ -560,6 +380,8 @@ restart:
 						 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
 						 errhint("Please REINDEX it.")));
 		}
+
+		stats->internalPagesMap = blockset_set(stats->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -577,3 +399,197 @@ restart:
 		goto restart;
 	}
 }
+
+static void
+gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
+{
+	IndexVacuumInfo *info = stats->info;
+	Relation	rel = info->index;
+	BlockNumber	x;
+
+	/* quick exit if no empty pages */
+	if (stats->numEmptyPages > 0)
+		gistvacuum_recycle_pages(stats);
+
+	/* rescan all inner pages to find those that have empty child pages */
+	x = InvalidBlockNumber;
+	while (stats->numEmptyPages > 0 &&
+		   (x = blockset_next(stats->internalPagesMap, x)) != InvalidBlockNumber)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber off,
+			maxoff;
+		IndexTuple  idxtuple;
+		ItemId	    iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+		BlockNumber blkno;
+
+		blkno = (BlockNumber) x;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+		/* Check that leafs are still empty and decide what to delete */
+		for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+		{
+			Buffer		leafBuffer;
+			Page		leafPage;
+			BlockNumber leafBlockNo;
+
+			iid = PageGetItemId(page, off);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			/* if this page was not empty in previous scan - we do not consider it */
+			leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+			if (!blockset_get(leafBlockNo, stats->emptyLeafPagesMap))
+				continue;
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+											RBM_NORMAL, info->strategy);
+
+			buftodelete[ntodelete] = leafBuffer;
+			todelete[ntodelete++] = off;
+
+			LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafBuffer);
+			leafPage = (Page) BufferGetPage(leafBuffer);
+			if (!GistPageIsLeaf(leafPage))
+			{
+				UnlockReleaseBuffer(leafBuffer);
+				continue;
+			}
+
+			if (PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Nothing left to split */
+				&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+				&& ntodelete < maxoff-1) /* We must keep at least one leaf page per each */
+			{
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = off;
+			}
+			else
+				UnlockReleaseBuffer(leafBuffer);
+		}
+
+		/*
+		 * We will have to relock internal page in case of deletes:
+		 * we cannot lock child while holding parent lock without risk
+		 * of a deadlock
+		 */
+		LockBuffer(buffer, GIST_UNLOCK);
+
+		if (ntodelete)
+		{
+			TransactionId txid;
+			int			i;
+
+			for (i = 0; i < ntodelete; i++)
+			{
+				Buffer	leafBuffer = buftodelete[i];
+				Page	leafPage;
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */
+					|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
+					|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
+					)
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					buftodelete[i] = InvalidBuffer;
+					todelete[i] = InvalidOffsetNumber;
+				}
+			}
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+
+			for (i = 0; i < ntodelete; i++)
+			{
+				Buffer	leafBuffer = buftodelete[i];
+				bool	inconsistent = false;
+
+				if (todelete[i] == InvalidOffsetNumber)
+					continue;
+
+				if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page)
+					|| PageGetMaxOffsetNumber(page) < todelete[i])
+					inconsistent = true;
+
+				if (!inconsistent)
+				{
+					iid = PageGetItemId(page, todelete[i]);
+					idxtuple = (IndexTuple) PageGetItem(page, iid);
+					if (todelete[i] != ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+						inconsistent = true;
+				}
+
+				if (inconsistent)
+				{
+					UnlockReleaseBuffer(leafBuffer);
+					buftodelete[i] = InvalidBuffer;
+					todelete[i] = InvalidOffsetNumber;
+				}
+			}
+
+			/*
+			 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			txid = ReadNewTransactionId();
+
+			START_CRIT_SECTION();
+
+			/* Mark pages as deleted dropping references from internal pages */
+			for (i = 0; i < ntodelete; i++)
+			{
+				Page		leafPage;
+				XLogRecPtr	recptr;
+
+				if (todelete[i] == InvalidOffsetNumber)
+					continue;
+
+				leafPage = (Page) BufferGetPage(buftodelete[i]);
+
+				/* Remember xid of last transaction that could see this page */
+				GistPageSetDeleteXid(leafPage,txid);
+
+				GistPageSetDeleted(leafPage);
+				MarkBufferDirty(buftodelete[i]);
+				stats->stats.pages_deleted++;
+				stats->numEmptyPages--;
+
+				MarkBufferDirty(buffer);
+				/* Offsets are changed as long as we delete tuples from internal page */
+				PageIndexTupleDelete(page, todelete[i] - i);
+
+				if (RelationNeedsWAL(rel))
+					recptr 	= gistXLogPageDelete(buftodelete[i],
+												 txid, buffer, todelete[i] - i);
+				else
+					recptr = gistGetFakeLSN(rel);
+				PageSetLSN(page, recptr);
+				PageSetLSN(leafPage, recptr);
+
+				UnlockReleaseBuffer(buftodelete[i]);
+			}
+			END_CRIT_SECTION();
+
+			LockBuffer(buffer, GIST_UNLOCK);
+		}
+
+		ReleaseBuffer(buffer);
+	}
+}
-- 
2.20.1

0005-Remove-numEmptyPages-.-it-s-not-really-needed.patchtext/x-patch; name=0005-Remove-numEmptyPages-.-it-s-not-really-needed.patchDownload
From 54393f2a3c81fe05f56140c9c367e1ff960fe0ba Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:41:07 +0200
Subject: [PATCH 5/6] Remove 'numEmptyPages'. it's not really needed.

---
 src/backend/access/gist/gistvacuum.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index b95e755406e..bf0b9d54f69 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -28,7 +28,6 @@ typedef struct
 	IndexBulkDeleteResult stats;
 
 	IndexVacuumInfo *info;
-	BlockNumber numEmptyPages;
 	BlockSet	internalPagesMap;
 	BlockSet	emptyLeafPagesMap;
 } GistBulkDeleteResult;
@@ -349,7 +348,6 @@ restart:
 		if (nremain == 0)
 		{
 			stats->emptyLeafPagesMap = blockset_set(stats->emptyLeafPagesMap, blkno);
-			stats->numEmptyPages++;
 		}
 		else
 			stats->stats.num_index_tuples += nremain;
@@ -408,13 +406,15 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 	BlockNumber	x;
 
 	/* quick exit if no empty pages */
-	if (stats->numEmptyPages > 0)
-		gistvacuum_recycle_pages(stats);
+	/* HEIKKI: I'm assuming the blockset is always NULL if it's empty. That's true
+	 * for the current usage. But add comments, and maybe a blockset_isempty() macro
+	 * for clarity */
+	if (stats->emptyLeafPagesMap == NULL)
+		return;
 
 	/* rescan all inner pages to find those that have empty child pages */
 	x = InvalidBlockNumber;
-	while (stats->numEmptyPages > 0 &&
-		   (x = blockset_next(stats->internalPagesMap, x)) != InvalidBlockNumber)
+	while ((x = blockset_next(stats->internalPagesMap, x)) != InvalidBlockNumber)
 	{
 		Buffer		buffer;
 		Page		page;
@@ -442,7 +442,9 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 
 		maxoff = PageGetMaxOffsetNumber(page);
 		/* Check that leafs are still empty and decide what to delete */
-		for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+		for (off = FirstOffsetNumber;
+			 off <= maxoff && ntodelete < maxoff-1;
+			 off = OffsetNumberNext(off))
 		{
 			Buffer		leafBuffer;
 			Page		leafPage;
@@ -569,7 +571,6 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 				GistPageSetDeleted(leafPage);
 				MarkBufferDirty(buftodelete[i]);
 				stats->stats.pages_deleted++;
-				stats->numEmptyPages--;
 
 				MarkBufferDirty(buffer);
 				/* Offsets are changed as long as we delete tuples from internal page */
-- 
2.20.1

0006-Misc-cleanup.patchtext/x-patch; name=0006-Misc-cleanup.patchDownload
From b365081e1bc0818445e44b1ad4ba32d311609f06 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 16:58:10 +0200
Subject: [PATCH 6/6] Misc cleanup.

---
 src/backend/access/gist/gistvacuum.c | 32 ++++++++++++++++------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index bf0b9d54f69..54a0adaf71f 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -419,7 +419,7 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 		Buffer		buffer;
 		Page		page;
 		OffsetNumber off,
-			maxoff;
+					maxoff;
 		IndexTuple  idxtuple;
 		ItemId	    iid;
 		OffsetNumber todelete[MaxOffsetNumber];
@@ -436,6 +436,8 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 		page = (Page) BufferGetPage(buffer);
 		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
 		{
+			/* HEIKKI: This page was an internal page earlier, but now it's something else.
+			 * Shouldn't happen... */
 			UnlockReleaseBuffer(buffer);
 			continue;
 		}
@@ -446,12 +448,13 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 			 off <= maxoff && ntodelete < maxoff-1;
 			 off = OffsetNumberNext(off))
 		{
+			BlockNumber leafBlockNo;
 			Buffer		leafBuffer;
 			Page		leafPage;
-			BlockNumber leafBlockNo;
 
 			iid = PageGetItemId(page, off);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
 			/* if this page was not empty in previous scan - we do not consider it */
 			leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 			if (!blockset_get(leafBlockNo, stats->emptyLeafPagesMap))
@@ -468,6 +471,7 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 			leafPage = (Page) BufferGetPage(leafBuffer);
 			if (!GistPageIsLeaf(leafPage))
 			{
+				/* it's not a leaf anymore? shouldn't happen.. */
 				UnlockReleaseBuffer(leafBuffer);
 				continue;
 			}
@@ -483,27 +487,29 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 				UnlockReleaseBuffer(leafBuffer);
 		}
 
-		/*
-		 * We will have to relock internal page in case of deletes:
-		 * we cannot lock child while holding parent lock without risk
-		 * of a deadlock
-		 */
-		LockBuffer(buffer, GIST_UNLOCK);
-
-		if (ntodelete)
+		if (ntodelete > 0)
 		{
 			TransactionId txid;
 			int			i;
 
+			/*
+			 * We will have to relock internal page in case of deletes:
+			 * we cannot lock child while holding parent lock without risk
+			 * of a deadlock
+			 */
+			LockBuffer(buffer, GIST_UNLOCK);
+
 			for (i = 0; i < ntodelete; i++)
 			{
 				Buffer	leafBuffer = buftodelete[i];
 				Page	leafPage;
+
+				/* FIXME: Aren't we still holding the lock from the loop above? */
 				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
 				gistcheckpage(rel, leafBuffer);
 				leafPage = (Page) BufferGetPage(leafBuffer);
 				if (!GistPageIsLeaf(leafPage) /* not a leaf anymore */
-					|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empry */
+					|| PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber /* Page is not empty */
 					|| (GistFollowRight(leafPage) || GistPageGetNSN(page) < GistPageGetNSN(leafPage)) /* No follow-right */
 					)
 				{
@@ -587,10 +593,8 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 				UnlockReleaseBuffer(buftodelete[i]);
 			}
 			END_CRIT_SECTION();
-
-			LockBuffer(buffer, GIST_UNLOCK);
 		}
 
-		ReleaseBuffer(buffer);
+		UnlockReleaseBuffer(buffer);
 	}
 }
-- 
2.20.1

#51Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#50)
6 attachment(s)
Re: GiST VACUUM

Hi!

11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

Here's new version of the patch for actual page deletion.
Changes:
1. Fixed possible concurrency issue:
We were locking child page while holding lock on internal page. Notes
in GiST README recommend locking child before parent. Thus now we
unlock internal page before locking children for deletion, and lock
it again, check that everything is still on it's place and delete
only then.

Good catch. The implementation is a bit broken, though. This segfaults:

Sorry for the noise. Few lines of old code leaked into the patch between testing and mailing.

BTW, we don't seem to have test coverage for this in the regression tests. The tests that delete some rows, where you updated the comment, doesn't actually seem to produce any empty pages that could be deleted.

I've updated test to delete more rows. But it did not trigger previous bug anyway, we had to delete everything for this case.

One thing still bothers me. Let's assume that we have internal page
with 2 deletable leaves. We lock these leaves in order of items on
internal page. Is it possible that 2nd page have follow-right link on
1st and someone will lock 2nd page, try to lock 1st and deadlock with
VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its right sibling doesn't have a downlink in the parent yet. Nevertheless, I think I'd sleep better, if we acquired the locks in left-to-right order, to be safe.

Actually, I did not found lock coupling in GiST code. But I decided to lock just two pages at once (leaf, then parent, for every pair). PFA v22 with this concurrency logic.

An easy cop-out would be to use LWLockConditionalAcquire, and bail out if we can't get the lock. But if it's not too complicated, it'd be nice to acquire the locks in the correct order to begin with.

I did a round of cleanup and moving things around, before I bumped into the above issue. I'm including them here as separate patches, for easier review, but they should of course be squashed into yours. For convenience, I'm including your patches here as well, so that all the patches you need are in the same place, but they are identical to what you posted.

I've rebased all these patches so that VACUUM should work on every commit. Let's just squash them for the next iteration, it was quite a messy rebase.
BTW, you deleted numEmptyPages, this makes code cleaner, but this variable could stop gistvacuum_recycle_pages() when everything is recycled already.

Thanks!

Best regards, Andrey Borodin.

Attachments:

0001-Add-block-set-data-structure-v22.patchapplication/octet-stream; name=0001-Add-block-set-data-structure-v22.patch; x-unix-mode=0644Download
From 1cec1b62a44cadb0de58a145807ade9635dc9ddf Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:19:32 +0500
Subject: [PATCH 1/6] Add block set data structure v22

Currently we have Bitmapset which works only with 32 bit ints.
Furthermore it is not very space-efficient in case of sparse
bitmap.

In this commit we invent block set: statically typed radix tree
for keeping BlockNumbers. This still can be improved in many ways
applicable to bitmaps, most notable in-pointer lists can be used.
But for the sake of code simplicity it is now plain in it's
ctructure.

The block set is introduced to support efficient page deletion in
GiST's VACUUM.
---
 src/backend/lib/Makefile                      |   4 +-
 src/backend/lib/blockset.c                    | 201 ++++++++++++++++++
 src/include/lib/blockset.h                    |  24 +++
 src/test/modules/test_blockset/.gitignore     |   4 +
 src/test/modules/test_blockset/Makefile       |  21 ++
 src/test/modules/test_blockset/README         |   8 +
 .../test_blockset/expected/test_blockset.out  |   7 +
 .../test_blockset/sql/test_blockset.sql       |   3 +
 .../test_blockset/test_blockset--1.0.sql      |   8 +
 .../modules/test_blockset/test_blockset.c     | 182 ++++++++++++++++
 .../test_blockset/test_blockset.control       |   4 +
 11 files changed, 464 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/lib/blockset.c
 create mode 100644 src/include/lib/blockset.h
 create mode 100644 src/test/modules/test_blockset/.gitignore
 create mode 100644 src/test/modules/test_blockset/Makefile
 create mode 100644 src/test/modules/test_blockset/README
 create mode 100644 src/test/modules/test_blockset/expected/test_blockset.out
 create mode 100644 src/test/modules/test_blockset/sql/test_blockset.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset--1.0.sql
 create mode 100644 src/test/modules/test_blockset/test_blockset.c
 create mode 100644 src/test/modules/test_blockset/test_blockset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca2..9601894f07 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/lib
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+OBJS = binaryheap.o bipartite_match.o blockset.o bloomfilter.o dshash.o \
+       hyperloglog.o ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/blockset.c b/src/backend/lib/blockset.c
new file mode 100644
index 0000000000..c07b2974b1
--- /dev/null
+++ b/src/backend/lib/blockset.c
@@ -0,0 +1,201 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.c
+ *		Data structure for operations on set of block numbers
+ *
+ * This data structure resembles bitmap set in idea and operations, but
+ * has two main differences:
+ * 1. It handles unsigned BlockNumber as position in set instead of int32_t
+ * This allows to work with relation forks bigger than 2B blocks
+ * 2. It is more space efficient for sparse bitmaps: regions are allocated
+ * in chunks of 256 items at once.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/blockset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "lib/blockset.h"
+
+/* Lowest level of radix tree is represented by bitmap */
+typedef struct
+{
+	char data[256/8];
+} BlockSetLevel4Data;
+
+typedef BlockSetLevel4Data *BlockSetLevel4;
+
+/* statically typed inner level chunks points to ground level */
+typedef struct
+{
+	/* null references denote empty subtree */
+	BlockSetLevel4 next[256];
+} BlockSetLevel3Data;
+
+typedef BlockSetLevel3Data *BlockSetLevel3;
+
+/* inner level points to another inner level */
+typedef struct
+{
+	BlockSetLevel3 next[256];
+} BlockSetLevel2Data;
+
+typedef BlockSetLevel2Data *BlockSetLevel2;
+
+/* Radix tree root */
+typedef struct BlockSetData
+{
+	BlockSetLevel2 next[256];
+} BlockSetData;
+
+/* multiplex block number into indexes of radix tree */
+#define BLOCKSET_SPLIT_BLKNO				\
+	uint32_t i1, i2, i3, i4, byte_no, byte_mask;	\
+	i4 = blkno % 256;						\
+	blkno /= 256;							\
+	i3 = blkno % 256;						\
+	blkno /= 256;							\
+	i2 = blkno % 256;						\
+	blkno /= 256;							\
+	i1 = blkno;								\
+	byte_no = i4 / 8;						\
+	byte_mask = 1 << (i4 % 8)				\
+
+/* indicate presence of block number in set */
+BlockSet blockset_set(BlockSet bs, BlockNumber blkno)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+	{
+		bs = palloc0(sizeof(BlockSetData));
+	}
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+	{
+		bs2 = palloc0(sizeof(BlockSetLevel2Data));
+		bs->next[i1] = bs2;
+	}
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+	{
+		bs3 = palloc0(sizeof(BlockSetLevel3Data));
+		bs2->next[i2] = bs3;
+	}
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+	{
+		bs4 = palloc0(sizeof(BlockSetLevel4Data));
+		bs3->next[i3] = bs4;
+	}
+	bs4->data[byte_no] = byte_mask | bs4->data[byte_no];
+	return bs;
+}
+
+/* Test presence of block in set */
+bool blockset_get(BlockNumber blkno, BlockSet bs)
+{
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return false;
+	BlockSetLevel2 bs2 = bs->next[i1];
+	if (bs2 == NULL)
+		return false;
+	BlockSetLevel3 bs3 = bs2->next[i2];
+	if (bs3 == NULL)
+		return false;
+	BlockSetLevel4 bs4 = bs3->next[i3];
+	if (bs4 == NULL)
+		return false;
+	return (bs4->data[byte_no] & byte_mask);
+}
+
+/* 
+ * Find nearest block number in set no less than blkno
+ * Return InvalidBlockNumber if nothing to return
+ * If given InvalidBlockNumber - returns minimal element in set
+ */
+BlockNumber blockset_next(BlockSet bs, BlockNumber blkno)
+{
+	if (blkno == InvalidBlockNumber)
+		blkno = 0; /* equvalent to ++, left for clear code */
+	else
+		blkno++;
+
+	BLOCKSET_SPLIT_BLKNO;
+
+	if (bs == NULL)
+		return InvalidBlockNumber;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (!bs4)
+					continue;
+				for (; byte_no < 256 / 8; byte_no++)
+				{
+					if (!bs4->data[byte_no])
+						continue;
+					while (byte_mask < 256)
+					{
+						if ((byte_mask & bs4->data[byte_no]) == byte_mask)
+						{
+							i4 = byte_no * 8;
+							while (byte_mask >>= 1) i4++;
+							return i4 + 256 * (i3 + 256 * (i2 + 256 * i1));
+						}
+						byte_mask <<= 1;
+					}
+					byte_mask = 1;
+				}
+				byte_no = 0;
+			}
+			i3 = 0;
+		}
+		i2 = 0;
+	}
+	return InvalidBlockNumber;
+}
+
+/* free anything palloced */
+void blockset_free(BlockSet bs)
+{
+	BlockNumber blkno = 0;
+	BLOCKSET_SPLIT_BLKNO;
+	if (bs == NULL)
+		return;
+	for (; i1 < 256; i1++)
+	{
+		BlockSetLevel2 bs2 = bs->next[i1];
+		if (!bs2)
+			continue;
+		for (; i2 < 256; i2++)
+		{
+			BlockSetLevel3 bs3 = bs2->next[i2];
+			if (!bs3)
+				continue;
+			for (; i3 < 256; i3++)
+			{
+				BlockSetLevel4 bs4 = bs3->next[i3];
+				if (bs4)
+					pfree(bs4);
+			}
+			pfree(bs3);
+		}
+		pfree(bs2);
+	}
+	pfree(bs);
+}
\ No newline at end of file
diff --git a/src/include/lib/blockset.h b/src/include/lib/blockset.h
new file mode 100644
index 0000000000..1966d17a8d
--- /dev/null
+++ b/src/include/lib/blockset.h
@@ -0,0 +1,24 @@
+/*-------------------------------------------------------------------------
+ *
+ * blockset.h
+ *		Data structure for operations on set of block numbers
+ *
+ * IDENTIFICATION
+ *    src/include/lib/blockset.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef BLOCKSET_H
+#define BLOCKSET_H
+
+#include "storage/block.h"
+
+typedef struct BlockSetData *BlockSet;
+
+extern BlockSet blockset_set(BlockSet bs, BlockNumber blkno);
+extern bool blockset_get(BlockNumber blkno, BlockSet bs);
+extern BlockNumber blockset_next(BlockSet bs, BlockNumber blkno);
+extern void blockset_free(BlockSet bs);
+
+#endif							/* BLOCKSET_H */
diff --git a/src/test/modules/test_blockset/.gitignore b/src/test/modules/test_blockset/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_blockset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_blockset/Makefile b/src/test/modules/test_blockset/Makefile
new file mode 100644
index 0000000000..091cf8c095
--- /dev/null
+++ b/src/test/modules/test_blockset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_blockset/Makefile
+
+MODULE_big = test_blockset
+OBJS = test_blockset.o $(WIN32RES)
+PGFILEDESC = "test_blockset - test code for block set library"
+
+EXTENSION = test_blockset
+DATA = test_blockset--1.0.sql
+
+REGRESS = test_blockset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_blockset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_blockset/README b/src/test/modules/test_blockset/README
new file mode 100644
index 0000000000..730951ff03
--- /dev/null
+++ b/src/test/modules/test_blockset/README
@@ -0,0 +1,8 @@
+test_blockset overview
+=========================
+
+test_blockset is a test harness module for testing block set data structure.
+There are two main tests:
+1. Test of comliance with Bitmapset on numbers avaiable to Bitmapset
+2. Test of numbers that can exceed INT32_MAX but are just shifted on one bit
+from numbers kept in Bitmapset
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/expected/test_blockset.out b/src/test/modules/test_blockset/expected/test_blockset.out
new file mode 100644
index 0000000000..02c29d5fc0
--- /dev/null
+++ b/src/test/modules/test_blockset/expected/test_blockset.out
@@ -0,0 +1,7 @@
+CREATE EXTENSION test_blockset;
+SELECT test_blockset();
+ test_blockset 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_blockset/sql/test_blockset.sql b/src/test/modules/test_blockset/sql/test_blockset.sql
new file mode 100644
index 0000000000..85d2886676
--- /dev/null
+++ b/src/test/modules/test_blockset/sql/test_blockset.sql
@@ -0,0 +1,3 @@
+CREATE EXTENSION test_blockset;
+
+SELECT test_blockset();
\ No newline at end of file
diff --git a/src/test/modules/test_blockset/test_blockset--1.0.sql b/src/test/modules/test_blockset/test_blockset--1.0.sql
new file mode 100644
index 0000000000..04eea8a614
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_blockset/test_blockset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_blockset" to load this file. \quit
+
+CREATE FUNCTION test_blockset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_blockset/test_blockset.c b/src/test/modules/test_blockset/test_blockset.c
new file mode 100644
index 0000000000..7a1d1c86c8
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.c
@@ -0,0 +1,182 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_blockset.c
+ *		Test block set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_blockset/test_blockset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/blockset.h"
+#include "nodes/bitmapset.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_blockset);
+
+static void test_blockset_bms_compliance();
+static void test_blockset_big_block_numbers();
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_blockset(PG_FUNCTION_ARGS)
+{
+	test_blockset_bms_compliance(0);
+	test_blockset_bms_compliance(1);
+	test_blockset_bms_compliance(2);
+	test_blockset_bms_compliance(1337);
+	test_blockset_bms_compliance(100000);
+	test_blockset_big_block_numbers(1337);
+	test_blockset_big_block_numbers(31337);
+	PG_RETURN_VOID();
+}
+
+/*
+ * This test creates random bitmap with test_limit members
+ * and checks that block set behavior is similar to Bitmapset
+ */
+static void test_blockset_bms_compliance(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != next_bn)
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
+
+/* 
+ * This test is similar to test_blockset_bms_compliance()
+ * except that it shifts BlockNumbers by one bit to ensure that blockset
+ * operates correctly on values higher that INT32_MAX
+ * This function is copy-pasted from previous with the exception of barrel
+ * shifts for BlockNumbers. I've tried various refactorings, but they all
+ * looked ugly.
+ */
+static void test_blockset_big_block_numbers(int32_t test_limit)
+{
+	BlockSet bs = NULL;
+	Bitmapset *bms = NULL;
+	BlockNumber blockno;
+	int index;
+	int32_t point_index = 0;
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		/* bms does not support block numbers above INT32_MAX */
+		bs = blockset_set(bs, blockno << 1);
+		bms = bms_add_member(bms, (int)blockno);
+	}
+
+	index = -1;
+	blockno = InvalidBlockNumber;
+
+	while (true)
+	{
+		point_index++;
+		BlockNumber next_bn = blockset_next(bs, blockno);
+		int next_index = bms_next_member(bms, index);
+
+
+		if (next_bn == InvalidBlockNumber && next_index == -2)
+			return; /* We have found everything */
+
+		if (((BlockNumber)next_index) != (next_bn >> 1))
+		{
+			elog(ERROR,
+				 "Bitmapset returned value %X different from block set %X,"
+				 " test_limit %d, point index %d",
+				 next_index, next_bn, test_limit, point_index);
+		}
+
+		if (!blockset_get(next_bn, bs))
+		{
+			elog(ERROR,
+				 "Block set did not found present item %X"
+				 " test_limit %d, point index %d",
+				 next_bn, test_limit, point_index);
+		}
+
+		index = next_index;
+		blockno = next_bn;
+	}
+
+	for (int32_t i = 0; i < test_limit; i++)
+	{
+		blockno = random() & INT32_MAX;
+		if (blockset_get(blockno << 1, bs) != bms_is_member((int)blockno, bms))
+		{
+			elog(ERROR,
+				 "Block set did agree with bitmapset item %X"
+				 " test_limit %d, point index %d",
+				 blockno, test_limit, point_index);
+		}
+	}
+
+	blockset_free(bs);
+	bms_free(bms);
+}
diff --git a/src/test/modules/test_blockset/test_blockset.control b/src/test/modules/test_blockset/test_blockset.control
new file mode 100644
index 0000000000..fdb7598c5a
--- /dev/null
+++ b/src/test/modules/test_blockset/test_blockset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for block set library'
+default_version = '1.0'
+module_pathname = '$libdir/test_blockset'
+relocatable = true
-- 
2.20.1

0002-Delete-pages-during-GiST-VACUUM-v22.patchapplication/octet-stream; name=0002-Delete-pages-during-GiST-VACUUM-v22.patch; x-unix-mode=0644Download
From cb3349f5a838099b386ad490a664801f96473678 Mon Sep 17 00:00:00 2001
From: Andrey <amborodin@acm.org>
Date: Fri, 8 Mar 2019 21:24:37 +0500
Subject: [PATCH 2/6] Delete pages during GiST VACUUM v22

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make notes
of empty pages and internal pages. At second pass we scan through
internal pages looking for references to empty leaf pages.
---
 src/backend/access/gist/README         |  14 ++
 src/backend/access/gist/gist.c         |  18 +++
 src/backend/access/gist/gistutil.c     |   3 +-
 src/backend/access/gist/gistvacuum.c   | 186 ++++++++++++++++++++++++-
 src/backend/access/gist/gistxlog.c     |  60 ++++++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  10 +-
 src/test/regress/expected/gist.out     |   6 +-
 src/test/regress/sql/gist.sql          |   6 +-
 11 files changed, 305 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b8..c84359de31 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,20 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+Function gistbulkdelete() is responsible for marking empty leaf pages as free
+so that they can be used it allocate newly split pages. To find this pages
+function scans index in physical order.
+
+Physical scan reads the entire index from the first page to last. This scan
+maintains information necessary to collect block numbers of internal pages
+that need cleansing and block number of empty leafs.
+
+After the scan, for each internal pages under exclusive lock, each potentially
+free leaf page is examined. gistbulkdelete() never delete last one reference
+from internal page to preserve balanced tree properties.
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef9..6f87b4b504 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,11 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/*
+			 * Currently internal pages are not deleted during vacuum,
+			 * so we do not need to check if page is deleted
+			 */
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +843,19 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * Leaf pages can be left deleted but still referenced
+			 * until it's space is reused. Downlink to this page may be already
+			 * removed from the internal page, but this scan can posess it.
+			 */
+			if(GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index f32e16eed5..4fa44bf2f6 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -834,7 +835,7 @@ gistNewBuffer(Relation r)
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
+			if (GistPageIsDeleted(page) && TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
 				return buffer;	/* OK to use */
 
 			LockBuffer(buffer, GIST_UNLOCK);
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e..85b9d7c219 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,11 +16,15 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/blockset.h"
 #include "miscadmin.h"
+#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
+
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -30,6 +34,10 @@ typedef struct
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
+	BlockNumber emptyPages;
+
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
 static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -50,6 +58,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
+
 	return stats;
 }
 
@@ -89,6 +98,57 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	return stats;
 }
 
+/*
+ * gistdeletepage takes parent page and leaf page and tries to delete leaf
+ * Both pages must be locked. Returns true if delete actually happened.
+ * Does not remove last downlink.
+ */
+static bool
+gistdeletepage(GistVacState *vstate,
+			   Buffer buffer, Page page, OffsetNumber downlink,
+			   Buffer leafBuffer, Page leafPage, TransactionId txid)
+{
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	XLogRecPtr	recptr;
+	if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page)
+		|| PageGetMaxOffsetNumber(page) < downlink
+		|| PageGetMaxOffsetNumber(page) <= FirstOffsetNumber)
+		return false;
+
+	/* chech that old downlink is still pointing to leafBuffer */
+	iid = PageGetItemId(page, downlink);
+	idxtuple = (IndexTuple) PageGetItem(page, iid);
+	if (BufferGetBlockNumber(leafBuffer) !=
+		ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+		return false;
+
+	/* Mark page as deleted dropping references from internal pages */	
+	START_CRIT_SECTION();
+
+	/* Remember xid of last transaction that could see this page */
+	GistPageSetDeleteXid(leafPage,txid);
+	GistPageSetDeleted(leafPage);
+	MarkBufferDirty(leafBuffer);
+	vstate->stats->pages_deleted++;
+	vstate->emptyPages--;
+
+	MarkBufferDirty(buffer);
+	/* Offsets are changed as long as we delete tuples from internal page */
+	PageIndexTupleDelete(page, downlink);
+
+	if (RelationNeedsWAL(vstate->info->index))
+		recptr 	= gistXLogSetDeleted(vstate->info->index->rd_node, leafBuffer,
+										txid, buffer, downlink);
+	else
+		recptr = gistGetFakeLSN(vstate->info->index);
+	PageSetLSN(page, recptr);
+	PageSetLSN(leafPage, recptr);
+
+	END_CRIT_SECTION();
+	return true;
+}
+
 /*
  * gistvacuumscan --- scan the index for VACUUMing purposes
  *
@@ -103,6 +163,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * while the index is being expanded, leaving an all-zeros page behind.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
+ * 
+ * Bulk deletion of all index entries pointing to a set of heap tuples and
+ * check invalid tuples left after upgrade.
+ * The set of target tuples is specified via a callback routine that tells
+ * whether any given heap tuple (identified by ItemPointer) is being deleted.
  */
 static void
 gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
@@ -132,6 +197,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
+	vstate.emptyPages = 0;
+	vstate.internalPagesMap = NULL;
+	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -171,6 +239,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -194,6 +263,111 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* update statistics */
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
+
+	/* rescan all inner pages to find those that has empty child pages */
+	if (vstate.emptyPages > 0)
+	{
+		BlockNumber			x;
+
+		x = InvalidBlockNumber;
+		while (vstate.emptyPages > 0 &&
+			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
+		{
+			Buffer		buffer;
+			Page		page;
+			OffsetNumber off,
+				maxoff;
+			IndexTuple  idxtuple;
+			ItemId	    iid;
+			OffsetNumber todelete[MaxOffsetNumber];
+			Buffer		buftodelete[MaxOffsetNumber];
+			int			ntodelete = 0;
+
+			blkno = (BlockNumber) x;
+
+			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+										info->strategy);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			page = (Page) BufferGetPage(buffer);
+			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+			{
+				UnlockReleaseBuffer(buffer);
+				continue;
+			}
+
+			maxoff = PageGetMaxOffsetNumber(page);
+			/* Check that leafs are still empty and decide what to delete */
+			for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+			{
+				Buffer		leafBuffer;
+				BlockNumber leafBlockNo;
+
+				/* We must keep at least one leaf page per each */
+				if (ntodelete >= maxoff-1)
+					continue;
+
+				iid = PageGetItemId(page, off);
+				idxtuple = (IndexTuple) PageGetItem(page, iid);
+				/* if this page was not empty in previous scan - we do not consider it */
+				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
+					continue;
+
+				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+												RBM_NORMAL, info->strategy);
+ 
+				buftodelete[ntodelete] = leafBuffer;
+				todelete[ntodelete++] = off;
+			}
+
+			/*
+			 * We will have to relock internal page in case of deletes:
+			 * we cannot lock child while holding parent lock without risk
+			 * of a deadlock
+			 */
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			if (ntodelete)
+			{
+				/*
+				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+				 * that could hold downlinks to this page. We use
+				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+				 * since we are in a VACUUM.
+				 */
+				TransactionId	txid = ReadNewTransactionId();
+
+				int deleted = 0;
+
+				for (off = 0; off < ntodelete; off++)
+				{
+					Buffer	leafBuffer = buftodelete[off];
+					Page	leafPage;
+					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+					gistcheckpage(rel, leafBuffer);
+					leafPage = (Page) BufferGetPage(leafBuffer);
+					if (GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
+						&& PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Page is not empry */
+						&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) > GistPageGetNSN(leafPage)) /* No follow-right */
+						)
+					{
+						LockBuffer(buffer, GIST_EXCLUSIVE);
+						page = (Page) BufferGetPage(buffer);
+						if (gistdeletepage(&vstate, buffer, page, todelete[off] - deleted, leafBuffer, leafPage, txid))
+							deleted++;
+						LockBuffer(buffer, GIST_UNLOCK);
+					}
+					UnlockReleaseBuffer(leafBuffer);
+				}
+			}
+
+			ReleaseBuffer(buffer);
+		}
+	}
+
+	blockset_free(vstate.emptyLeafPagesMap);
+	blockset_free(vstate.internalPagesMap);
 }
 
 /*
@@ -246,6 +420,7 @@ restart:
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -319,10 +494,19 @@ restart:
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
+			vstate->emptyPages++;
+		}
+		else
+			stats->num_index_tuples += nremain;
 	}
 	else
 	{
+		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
+
 		/*
 		 * On an internal page, check for "invalid tuples", left behind by an
 		 * incomplete page split on PostgreSQL 9.0 or below.  These are not
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390a..3213ea98ea 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,6 +64,39 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
+static void
+gistRedoPageSetDeleted(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
 /*
  * redo any page update (except page split)
  */
@@ -116,6 +149,7 @@ gistRedoPageUpdateRecord(XLogReaderState *record)
 			data += sizeof(OffsetNumber) * xldata->ntodelete;
 
 			PageIndexMultiDelete(page, todelete, xldata->ntodelete);
+
 			if (GistPageIsLeaf(page))
 				GistMarkTuplesDeleted(page);
 		}
@@ -535,6 +569,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageSetDeleted(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +690,29 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page delete. This also includes removal of
+ * downlink from internal page.
+ */
+XLogRecPtr
+gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
+					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = internalPageOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
+	/* new tuples */
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15a..0861f82992 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f24156..ce8bfd83ea 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 463d2bfc7b..943163ccce 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -414,12 +414,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
+					TransactionId xid, Buffer internalPageBuffer,
+					OffsetNumber internalPageOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1a..127cff5cb7 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,13 +17,15 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
+/* XLog stuff */
+
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +78,12 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+typedef struct gistxlogPageDelete
+{
+   TransactionId deleteXid; /* last Xid which could see page in scan */
+   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf..0a43449f00 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,10 +27,8 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
 alter index gist_pointidx SET (fillfactor = 40);
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13..657b195484 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,10 +28,8 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 
 vacuum analyze gist_point_tbl;
 
-- 
2.20.1

0003-Minor-cleanup-v22.patchapplication/octet-stream; name=0003-Minor-cleanup-v22.patch; x-unix-mode=0644Download
From b7e482b287f88c2c679f67e7ca46636097252d2a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:15:43 +0200
Subject: [PATCH 3/6] Minor cleanup v22

I renamed gistXLogSetDeleted to gistXLogPageDelete. I think that's better,
the WAL record also includes information about removing the downlink from
the parent, not just setting the flag on the child.
---
 src/backend/access/gist/gist.c       |  7 +--
 src/backend/access/gist/gistvacuum.c |  7 +--
 src/backend/access/gist/gistxlog.c   | 88 +++++++++++++++-------------
 src/include/access/gist_private.h    |  6 +-
 src/include/access/gistxlog.h        | 10 ++--
 5 files changed, 61 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 6f87b4b504..e260c4df4e 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -844,11 +844,10 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			}
 
 			/*
-			 * Leaf pages can be left deleted but still referenced
-			 * until it's space is reused. Downlink to this page may be already
-			 * removed from the internal page, but this scan can posess it.
+			 * The page might have been deleted after we scanned the parent
+			 * and saw the downlink.
 			 */
-			if(GistPageIsDeleted(stack->page))
+			if (GistPageIsDeleted(stack->page))
 			{
 				UnlockReleaseBuffer(stack->buffer);
 				xlocked = false;
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 85b9d7c219..99dbf1cbb7 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -20,11 +20,9 @@
 #include "commands/vacuum.h"
 #include "lib/blockset.h"
 #include "miscadmin.h"
-#include "nodes/bitmapset.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-
 /* Working state needed by gistbulkdelete */
 typedef struct
 {
@@ -58,7 +56,6 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 
 	gistvacuumscan(info, stats, callback, callback_state);
 
-
 	return stats;
 }
 
@@ -163,7 +160,7 @@ gistdeletepage(GistVacState *vstate,
  * while the index is being expanded, leaving an all-zeros page behind.
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
- * 
+ *
  * Bulk deletion of all index entries pointing to a set of heap tuples and
  * check invalid tuples left after upgrade.
  * The set of target tuples is specified via a callback routine that tells
@@ -264,7 +261,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	stats->num_pages = num_pages;
 	stats->pages_free = vstate.totFreePages;
 
-	/* rescan all inner pages to find those that has empty child pages */
+	/* rescan all inner pages to find those that have empty child pages */
 	if (vstate.emptyPages > 0)
 	{
 		BlockNumber			x;
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 3213ea98ea..7110c70451 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -64,39 +64,6 @@ gistRedoClearFollowRight(XLogReaderState *record, uint8 block_id)
 		UnlockReleaseBuffer(buffer);
 }
 
-static void
-gistRedoPageSetDeleted(XLogReaderState *record)
-{
-	XLogRecPtr	lsn = record->EndRecPtr;
-	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
-	Buffer		buffer;
-	Page		page;
-
-	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-
-		GistPageSetDeleteXid(page, xldata->deleteXid);
-		GistPageSetDeleted(page);
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
-	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-
-		PageIndexTupleDelete(page, xldata->downlinkOffset);
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-}
 /*
  * redo any page update (except page split)
  */
@@ -542,6 +509,43 @@ gistRedoCreateIndex(XLogReaderState *record)
 	UnlockReleaseBuffer(buffer);
 }
 
+/* redo page deletion */
+static void
+gistRedoPageDelete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	/* FIXME: Are we locking the pages in correct order, for hot standby? */
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
 void
 gist_redo(XLogReaderState *record)
 {
@@ -570,7 +574,7 @@ gist_redo(XLogReaderState *record)
 			gistRedoCreateIndex(record);
 			break;
 		case XLOG_GIST_PAGE_DELETE:
-			gistRedoPageSetDeleted(record);
+			gistRedoPageDelete(record);
 			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
@@ -691,25 +695,27 @@ gistXLogSplit(bool page_is_leaf,
 }
 
 /*
- * Write XLOG record describing a page delete. This also includes removal of
- * downlink from internal page.
+ * Write XLOG record describing a page deletion. This also includes removal of
+ * downlink from the parent page.
  */
 XLogRecPtr
-gistXLogSetDeleted(RelFileNode node, Buffer buffer, TransactionId xid,
-					Buffer internalPageBuffer, OffsetNumber internalPageOffset) {
+gistXLogPageDelete(Buffer buffer, TransactionId xid,
+				   Buffer parentBuffer, OffsetNumber downlinkOffset)
+{
 	gistxlogPageDelete xlrec;
 	XLogRecPtr	recptr;
 
 	xlrec.deleteXid = xid;
-	xlrec.downlinkOffset = internalPageOffset;
+	xlrec.downlinkOffset = downlinkOffset;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
 
 	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
-	XLogRegisterBuffer(1, internalPageBuffer, REGBUF_STANDARD);
-	/* new tuples */
+	XLogRegisterBuffer(1, parentBuffer, REGBUF_STANDARD);
+
 	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+
 	return recptr;
 }
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 943163ccce..c77d0b4dd8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -415,9 +415,9 @@ extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
 /* gistxlog.c */
-extern XLogRecPtr gistXLogSetDeleted(RelFileNode node, Buffer buffer,
-					TransactionId xid, Buffer internalPageBuffer,
-					OffsetNumber internalPageOffset);
+extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
+				   TransactionId xid, Buffer parentBuffer,
+				   OffsetNumber downlinkOffset);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 127cff5cb7..939a1ea755 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -17,8 +17,6 @@
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 
-/* XLog stuff */
-
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
  /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
@@ -78,10 +76,14 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+/*
+ * Backup Blk 0: page that was deleted.
+ * Backup Blk 1: parent page, containing the downlink to the deleted page.
+ */
 typedef struct gistxlogPageDelete
 {
-   TransactionId deleteXid; /* last Xid which could see page in scan */
-   OffsetNumber downlinkOffset; /* Offset of the downlink referencing this page */
+	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
 } gistxlogPageDelete;
 
 extern void gist_redo(XLogReaderState *record);
-- 
2.20.1

0004-Move-the-page-deletion-logic-to-separate-function-v2.patchapplication/octet-stream; name=0004-Move-the-page-deletion-logic-to-separate-function-v2.patch; x-unix-mode=0644Download
From c865e25a6367eda8165db9028786b6ea36e855ff Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:34:01 +0200
Subject: [PATCH 4/6] Move the page deletion logic to separate function v22

If a VACUUM does multiple index passes, I think we only want to do the
empty page deletion after the final pass. That saves effort, since we
only need to scan the internal pages once. But even if we wanted to do
it on every pass, I think having a separate function makes it more
readable.
---
 src/backend/access/gist/gistvacuum.c | 304 ++++++++++++++-------------
 1 file changed, 157 insertions(+), 147 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 99dbf1cbb7..9439dfb49c 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -23,25 +23,34 @@
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-/* Working state needed by gistbulkdelete */
 typedef struct
 {
+	IndexBulkDeleteResult stats;
+
 	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
+	BlockNumber numEmptyPages;
+	BlockSet	internalPagesMap;
+	BlockSet	emptyLeafPagesMap;
+} GistBulkDeleteResult;
+
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	GistBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
-	BlockNumber emptyPages;
-
-	BlockSet	internalPagesMap;
-	BlockSet	emptyLeafPagesMap;
 } GistVacState;
 
-static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static bool gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer buffer, Page page, OffsetNumber downlink,
+			   Buffer leafBuffer, Page leafPage, TransactionId txid);
+static void gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
+static void gistvacuum_recycle_pages(GistBulkDeleteResult *stats);
 
 /*
  * VACUUM bulkdelete stage: remove index entries.
@@ -50,13 +59,15 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	if (gist_stats == NULL)
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, gist_stats, callback, callback_state);
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -65,6 +76,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -74,10 +87,10 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
 	 * still need to do a pass over the index, to obtain index statistics.
 	 */
-	if (stats == NULL)
+	if (gist_stats == NULL)
 	{
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
+		gistvacuumscan(info, gist_stats, NULL, NULL);
 	}
 
 	/*
@@ -88,11 +101,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	if (!info->estimated_count)
 	{
-		if (stats->num_index_tuples > info->num_heap_tuples)
-			stats->num_index_tuples = info->num_heap_tuples;
+		if (gist_stats->stats.num_index_tuples > info->num_heap_tuples)
+			gist_stats->stats.num_index_tuples = info->num_heap_tuples;
 	}
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -101,7 +114,7 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * Does not remove last downlink.
  */
 static bool
-gistdeletepage(GistVacState *vstate,
+gistdeletepage(GistBulkDeleteResult *stats,
 			   Buffer buffer, Page page, OffsetNumber downlink,
 			   Buffer leafBuffer, Page leafPage, TransactionId txid)
 {
@@ -127,18 +140,17 @@ gistdeletepage(GistVacState *vstate,
 	GistPageSetDeleteXid(leafPage,txid);
 	GistPageSetDeleted(leafPage);
 	MarkBufferDirty(leafBuffer);
-	vstate->stats->pages_deleted++;
-	vstate->emptyPages--;
+	stats->stats.pages_deleted++;
+	stats->numEmptyPages--;
 
 	MarkBufferDirty(buffer);
 	/* Offsets are changed as long as we delete tuples from internal page */
 	PageIndexTupleDelete(page, downlink);
 
-	if (RelationNeedsWAL(vstate->info->index))
-		recptr 	= gistXLogSetDeleted(vstate->info->index->rd_node, leafBuffer,
-										txid, buffer, downlink);
+	if (RelationNeedsWAL(stats->info->index))
+		recptr 	= gistXLogPageDelete(leafBuffer, txid, buffer, downlink);
 	else
-		recptr = gistGetFakeLSN(vstate->info->index);
+		recptr = gistGetFakeLSN(stats->info->index);
 	PageSetLSN(page, recptr);
 	PageSetLSN(leafPage, recptr);
 
@@ -167,7 +179,7 @@ gistdeletepage(GistVacState *vstate,
  * whether any given heap tuple (identified by ItemPointer) is being deleted.
  */
 static void
-gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
@@ -180,12 +192,12 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command.
 	 */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
-	stats->pages_deleted = 0;
+	stats->stats.estimated_count = false;
+	stats->stats.num_index_tuples = 0;
+	stats->stats.pages_deleted = 0;
 
 	/* Set up info to pass down to gistvacuumpage */
-	vstate.info = info;
+	stats->info = info;
 	vstate.stats = stats;
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
@@ -194,9 +206,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
 	vstate.totFreePages = 0;
-	vstate.emptyPages = 0;
-	vstate.internalPagesMap = NULL;
-	vstate.emptyLeafPagesMap = NULL;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -257,114 +266,15 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	if (vstate.totFreePages > 0)
 		IndexFreeSpaceMapVacuum(rel);
 
-	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
-
-	/* rescan all inner pages to find those that have empty child pages */
-	if (vstate.emptyPages > 0)
-	{
-		BlockNumber			x;
-
-		x = InvalidBlockNumber;
-		while (vstate.emptyPages > 0 &&
-			   (x = blockset_next(vstate.internalPagesMap, x)) != InvalidBlockNumber)
-		{
-			Buffer		buffer;
-			Page		page;
-			OffsetNumber off,
-				maxoff;
-			IndexTuple  idxtuple;
-			ItemId	    iid;
-			OffsetNumber todelete[MaxOffsetNumber];
-			Buffer		buftodelete[MaxOffsetNumber];
-			int			ntodelete = 0;
-
-			blkno = (BlockNumber) x;
-
-			buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
-										info->strategy);
-
-			LockBuffer(buffer, GIST_EXCLUSIVE);
-			page = (Page) BufferGetPage(buffer);
-			if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
-			{
-				UnlockReleaseBuffer(buffer);
-				continue;
-			}
-
-			maxoff = PageGetMaxOffsetNumber(page);
-			/* Check that leafs are still empty and decide what to delete */
-			for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
-			{
-				Buffer		leafBuffer;
-				BlockNumber leafBlockNo;
-
-				/* We must keep at least one leaf page per each */
-				if (ntodelete >= maxoff-1)
-					continue;
-
-				iid = PageGetItemId(page, off);
-				idxtuple = (IndexTuple) PageGetItem(page, iid);
-				/* if this page was not empty in previous scan - we do not consider it */
-				leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
-				if (!blockset_get(leafBlockNo, vstate.emptyLeafPagesMap))
-					continue;
-
-				leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
-												RBM_NORMAL, info->strategy);
- 
-				buftodelete[ntodelete] = leafBuffer;
-				todelete[ntodelete++] = off;
-			}
-
-			/*
-			 * We will have to relock internal page in case of deletes:
-			 * we cannot lock child while holding parent lock without risk
-			 * of a deadlock
-			 */
-			LockBuffer(buffer, GIST_UNLOCK);
-
-			if (ntodelete)
-			{
-				/*
-				 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
-				 * that could hold downlinks to this page. We use
-				 * ReadNewTransactionId() to instead of GetCurrentTransactionId
-				 * since we are in a VACUUM.
-				 */
-				TransactionId	txid = ReadNewTransactionId();
-
-				int deleted = 0;
-
-				for (off = 0; off < ntodelete; off++)
-				{
-					Buffer	leafBuffer = buftodelete[off];
-					Page	leafPage;
-					LockBuffer(leafBuffer, GIST_EXCLUSIVE);
-					gistcheckpage(rel, leafBuffer);
-					leafPage = (Page) BufferGetPage(leafBuffer);
-					if (GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
-						&& PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Page is not empry */
-						&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) > GistPageGetNSN(leafPage)) /* No follow-right */
-						)
-					{
-						LockBuffer(buffer, GIST_EXCLUSIVE);
-						page = (Page) BufferGetPage(buffer);
-						if (gistdeletepage(&vstate, buffer, page, todelete[off] - deleted, leafBuffer, leafPage, txid))
-							deleted++;
-						LockBuffer(buffer, GIST_UNLOCK);
-					}
-					UnlockReleaseBuffer(leafBuffer);
-				}
-			}
 
-			ReleaseBuffer(buffer);
-		}
-	}
+	/* Recycle empty pages */
+	gistvacuum_recycle_pages(stats);
 
-	blockset_free(vstate.emptyLeafPagesMap);
-	blockset_free(vstate.internalPagesMap);
+	blockset_free(stats->emptyLeafPagesMap);
+	blockset_free(stats->internalPagesMap);
+	/* update statistics */
+	stats->stats.num_pages = num_pages;
+	stats->stats.pages_free = vstate.totFreePages;
 }
 
 /*
@@ -381,8 +291,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void
 gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 {
-	IndexVacuumInfo *info = vstate->info;
-	IndexBulkDeleteResult *stats = vstate->stats;
+	GistBulkDeleteResult *stats = vstate->stats;
+	IndexVacuumInfo *info = stats->info;
 	IndexBulkDeleteCallback callback = vstate->callback;
 	void	   *callback_state = vstate->callback_state;
 	Relation	rel = info->index;
@@ -411,7 +321,7 @@ restart:
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
 		vstate->totFreePages++;
-		stats->pages_deleted++;
+		stats->stats.pages_deleted++;
 	}
 	else if (GistPageIsLeaf(page))
 	{
@@ -486,7 +396,7 @@ restart:
 
 			END_CRIT_SECTION();
 
-			stats->tuples_removed += ntodelete;
+			stats->stats.tuples_removed += ntodelete;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
@@ -494,16 +404,14 @@ restart:
 		nremain = maxoff - FirstOffsetNumber + 1;
 		if (nremain == 0)
 		{
-			vstate->emptyLeafPagesMap = blockset_set(vstate->emptyLeafPagesMap, blkno);
-			vstate->emptyPages++;
+			stats->emptyLeafPagesMap = blockset_set(stats->emptyLeafPagesMap, blkno);
+			stats->numEmptyPages++;
 		}
 		else
-			stats->num_index_tuples += nremain;
+			stats->stats.num_index_tuples += nremain;
 	}
 	else
 	{
-		vstate->internalPagesMap = blockset_set(vstate->internalPagesMap, blkno);
-
 		/*
 		 * On an internal page, check for "invalid tuples", left behind by an
 		 * incomplete page split on PostgreSQL 9.0 or below.  These are not
@@ -528,6 +436,8 @@ restart:
 						 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
 						 errhint("Please REINDEX it.")));
 		}
+
+		stats->internalPagesMap = blockset_set(stats->internalPagesMap, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -545,3 +455,103 @@ restart:
 		goto restart;
 	}
 }
+
+static void
+gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
+{
+	IndexVacuumInfo *info = stats->info;
+	Relation	rel = info->index;
+	BlockNumber blkno;
+
+	/* rescan all inner pages to find those that have empty child pages */
+	blkno = InvalidBlockNumber;
+	while (stats->numEmptyPages > 0 &&
+		   (blkno = blockset_next(stats->internalPagesMap, blkno)) != InvalidBlockNumber)
+	{
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber off,
+			maxoff;
+		IndexTuple  idxtuple;
+		ItemId	    iid;
+		OffsetNumber todelete[MaxOffsetNumber];
+		Buffer		buftodelete[MaxOffsetNumber];
+		int			ntodelete = 0;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_EXCLUSIVE);
+		page = (Page) BufferGetPage(buffer);
+		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+		{
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		maxoff = PageGetMaxOffsetNumber(page);
+		/* Check that leafs are still empty and decide what to delete */
+		for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+		{
+			Buffer		leafBuffer;
+			BlockNumber leafBlockNo;
+
+			iid = PageGetItemId(page, off);
+			idxtuple = (IndexTuple) PageGetItem(page, iid);
+			/* if this page was not empty in previous scan - we do not consider it */
+			leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+			if (!blockset_get(leafBlockNo, stats->emptyLeafPagesMap))
+				continue;
+
+			leafBuffer = ReadBufferExtended(rel, MAIN_FORKNUM, leafBlockNo,
+											RBM_NORMAL, info->strategy);
+
+			buftodelete[ntodelete] = leafBuffer;
+			todelete[ntodelete++] = off;
+		}
+
+		
+		/*
+		 * We will have to relock internal page in case of deletes:
+		 * we cannot lock child while holding parent lock without risk
+		 * of a deadlock
+		 */
+		LockBuffer(buffer, GIST_UNLOCK);
+
+		if (ntodelete)
+		{
+			/*
+			 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+			 * that could hold downlinks to this page. We use
+			 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+			 * since we are in a VACUUM.
+			 */
+			TransactionId	txid = ReadNewTransactionId();
+
+			int deleted = 0;
+
+			for (off = 0; off < ntodelete; off++)
+			{
+				Buffer	leafBuffer = buftodelete[off];
+				Page	leafPage;
+				LockBuffer(leafBuffer, GIST_EXCLUSIVE);
+				gistcheckpage(rel, leafBuffer);
+				leafPage = (Page) BufferGetPage(leafBuffer);
+				if (GistPageIsLeaf(leafPage) /* not a leaf anymore */ 
+					&& PageGetMaxOffsetNumber(leafPage) == InvalidOffsetNumber /* Page is not empry */
+					&& !(GistFollowRight(leafPage) || GistPageGetNSN(page) > GistPageGetNSN(leafPage)) /* No follow-right */
+					)
+				{
+					LockBuffer(buffer, GIST_EXCLUSIVE);
+					page = (Page) BufferGetPage(buffer);
+					if (gistdeletepage(stats, buffer, page, todelete[off] - deleted, leafBuffer, leafPage, txid))
+						deleted++;
+					LockBuffer(buffer, GIST_UNLOCK);
+				}
+				UnlockReleaseBuffer(leafBuffer);
+			}
+		}
+
+		ReleaseBuffer(buffer);
+	}
+}
-- 
2.20.1

0005-Remove-numEmptyPages-.-it-s-not-really-needed-v22.patchapplication/octet-stream; name=0005-Remove-numEmptyPages-.-it-s-not-really-needed-v22.patch; x-unix-mode=0644Download
From ec700693399a486daf80153881e57d57bfe9ddbe Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 15:41:07 +0200
Subject: [PATCH 5/6] Remove 'numEmptyPages'. it's not really needed v22

---
 src/backend/access/gist/gistvacuum.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 9439dfb49c..f530369f4e 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -28,7 +28,6 @@ typedef struct
 	IndexBulkDeleteResult stats;
 
 	IndexVacuumInfo *info;
-	BlockNumber numEmptyPages;
 	BlockSet	internalPagesMap;
 	BlockSet	emptyLeafPagesMap;
 } GistBulkDeleteResult;
@@ -141,7 +140,6 @@ gistdeletepage(GistBulkDeleteResult *stats,
 	GistPageSetDeleted(leafPage);
 	MarkBufferDirty(leafBuffer);
 	stats->stats.pages_deleted++;
-	stats->numEmptyPages--;
 
 	MarkBufferDirty(buffer);
 	/* Offsets are changed as long as we delete tuples from internal page */
@@ -405,7 +403,6 @@ restart:
 		if (nremain == 0)
 		{
 			stats->emptyLeafPagesMap = blockset_set(stats->emptyLeafPagesMap, blkno);
-			stats->numEmptyPages++;
 		}
 		else
 			stats->stats.num_index_tuples += nremain;
@@ -461,12 +458,18 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 {
 	IndexVacuumInfo *info = stats->info;
 	Relation	rel = info->index;
-	BlockNumber blkno;
+	BlockNumber	blkno;
+
+	/* quick exit if no empty pages */
+	/* HEIKKI: I'm assuming the blockset is always NULL if it's empty. That's true
+	 * for the current usage. But add comments, and maybe a blockset_isempty() macro
+	 * for clarity */
+	if (stats->emptyLeafPagesMap == NULL)
+		return;
 
 	/* rescan all inner pages to find those that have empty child pages */
 	blkno = InvalidBlockNumber;
-	while (stats->numEmptyPages > 0 &&
-		   (blkno = blockset_next(stats->internalPagesMap, blkno)) != InvalidBlockNumber)
+	while ((blkno = blockset_next(stats->internalPagesMap, blkno)) != InvalidBlockNumber)
 	{
 		Buffer		buffer;
 		Page		page;
@@ -491,7 +494,9 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 
 		maxoff = PageGetMaxOffsetNumber(page);
 		/* Check that leafs are still empty and decide what to delete */
-		for (off = FirstOffsetNumber; off <= maxoff && ntodelete < maxoff-1; off = OffsetNumberNext(off))
+		for (off = FirstOffsetNumber;
+			 off <= maxoff && ntodelete < maxoff-1;
+			 off = OffsetNumberNext(off))
 		{
 			Buffer		leafBuffer;
 			BlockNumber leafBlockNo;
-- 
2.20.1

0006-Misc-cleanup-v22.patchapplication/octet-stream; name=0006-Misc-cleanup-v22.patch; x-unix-mode=0644Download
From 364b5736c6145cafd524784774444c83ef01e6b7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 11 Mar 2019 16:58:10 +0200
Subject: [PATCH 6/6] Misc cleanup v22

---
 src/backend/access/gist/gistvacuum.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index f530369f4e..e2b37cfc2d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -474,7 +474,7 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 		Buffer		buffer;
 		Page		page;
 		OffsetNumber off,
-			maxoff;
+					maxoff;
 		IndexTuple  idxtuple;
 		ItemId	    iid;
 		OffsetNumber todelete[MaxOffsetNumber];
@@ -488,6 +488,8 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 		page = (Page) BufferGetPage(buffer);
 		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
 		{
+			/* HEIKKI: This page was an internal page earlier, but now it's something else.
+			 * Shouldn't happen... */
 			UnlockReleaseBuffer(buffer);
 			continue;
 		}
@@ -498,11 +500,12 @@ gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
 			 off <= maxoff && ntodelete < maxoff-1;
 			 off = OffsetNumberNext(off))
 		{
-			Buffer		leafBuffer;
 			BlockNumber leafBlockNo;
+			Buffer		leafBuffer;
 
 			iid = PageGetItemId(page, off);
 			idxtuple = (IndexTuple) PageGetItem(page, iid);
+
 			/* if this page was not empty in previous scan - we do not consider it */
 			leafBlockNo = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
 			if (!blockset_get(leafBlockNo, stats->emptyLeafPagesMap))
-- 
2.20.1

#52Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#48)
Re: GiST VACUUM

On Tue, Mar 5, 2019 at 8:21 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 05/03/2019 02:26, Andrey Borodin wrote:

I also tried your amcheck tool with this. It did not report any
errors.

Attached is also latest version of the patch itself. It is the
same as your latest patch v19, except for some tiny comment
kibitzing. I'll mark this as Ready for Committer in the commitfest
app, and will try to commit it in the next couple of days.

That's cool! I'll work on 2nd step of these patchset to make
blockset data structure prettier and less hacky.

Committed the first patch. Thanks for the patch!

Thank you. This is a transformational change; it will allow GiST indexes
larger than RAM to be used in some cases where they were simply not
feasible to use before. On a HDD, it resulted in a 50 fold improvement in
vacuum time, and the machine went from unusably unresponsive to merely
sluggish during the vacuum. On a SSD (albeit a very cheap laptop one, and
exposed from Windows host to Ubuntu over VM Virtual Box) it is still a 30
fold improvement, from a far faster baseline. Even on an AWS instance with
a "GP2" SSD volume, which normally shows little benefit from sequential
reads, I get a 3 fold speed up.

I also ran this through a lot of crash-recovery testing using simulated
torn-page writes using my traditional testing harness with high concurrency
(AWS c4.4xlarge and a1.4xlarge using 32 concurrent update processes) and
did not encounter any problems. I tested both with btree_gist on a scalar
int, and on tsvector with each tsvector having 101 tokens.

I did notice that the space freed up in the index by vacuum doesn't seem to
get re-used very efficiently, but that is an ancestral problem independent
of this change.

Cheers,

Jeff

#53Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#51)
2 attachment(s)
Re: GiST VACUUM

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.

Good. I just noticed, that the README actually does say explicitly, that
the child must be locked before the parent.

I rebased this over the new IntegerSet implementation, from the other
thread, and did another round of refactoring, cleanups, etc. Attached is
a new version of this patch. I'm also including the IntegerSet patch
here, for convenience, but it's the same patch I posted at [1]/messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi.

It's in pretty good shapre, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted
page is recycled, to prevent the deletion from being replayed too early
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the
changes is pretty straightforward, but it's a bit cumbersome to test.

[1]: /messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi
/messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Attachments:

0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchtext/x-patch; name=0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchDownload
From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi
---
 src/backend/lib/Makefile                      |    2 +-
 src/backend/lib/README                        |    2 +
 src/backend/lib/integerset.c                  | 1039 +++++++++++++++++
 src/include/lib/integerset.h                  |   25 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_integerset/.gitignore   |    4 +
 src/test/modules/test_integerset/Makefile     |   21 +
 src/test/modules/test_integerset/README       |    7 +
 .../expected/test_integerset.out              |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |    8 +
 .../modules/test_integerset/test_integerset.c |  622 ++++++++++
 .../test_integerset/test_integerset.control   |    4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+       ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -13,6 +13,8 @@ hyperloglog.c - a streaming cardinality estimator
 
 ilist.c - single and double-linked lists
 
+integerset.c - a data structure for holding large set of integers
+
 knapsack.c - knapsack problem solver
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
new file mode 100644
index 00000000000..c9172fa2005
--- /dev/null
+++ b/src/backend/lib/integerset.c
@@ -0,0 +1,1039 @@
+/*-------------------------------------------------------------------------
+ *
+ * integerset.c
+ *	  Data structure to hold a large set of 64-bit integers efficiently
+ *
+ * IntegerSet provides an in-memory data structure to hold a set of
+ * arbitrary 64-bit integers.  Internally, the values are stored in a
+ * B-tree, with a special packed representation at the leaf level using
+ * the Simple-8b algorithm, which can pack hold clusters of nearby values
+ * very tightly.
+ *
+ * Memory consumption depends on the number of values stored, but also
+ * on how far the values are from each other.  In the best case, with
+ * long runs of consecutive integers, memory consumption can be as low as
+ * 0.1 bytes per integer.  In the worst case, if integers are more than
+ * 2^32 apart, it uses about 8 bytes per integer.  In typical use, the
+ * consumption per integer is somewhere between those extremes, depending
+ * on the range of integers stored, and how "clustered" they are.
+ *
+ *
+ * Interface
+ * ---------
+ *
+ *	intset_create			- Create a new empty set.
+ *	intset_add_member		- Add an integer to the set.
+ *	intset_is_member		- Test if an integer is in the set
+ *	intset_begin_iterate	- Begin iterating through all integers in set
+ *	intset_iterate_next		- Return next integer
+ *
+ *
+ * Limitations
+ * -----------
+ *
+ * - Values must be added in order.  (Random insertions would require
+ *   splitting nodes, which hasn't been implemented.)
+ *
+ * - Values cannot be added while iteration is in progress.
+ *
+ * - No support for removing values.
+ *
+ * None of these limitations are fundamental to the data structure, and
+ * could be lifted if needed, by writing some new code.  But the current
+ * users of this facility don't need them.
+ *
+ *
+ * References
+ * ----------
+ *
+ * Simple-8b encoding is based on:
+ *
+ * Vo Ngoc Anh , Alistair Moffat, Index compression using 64-bit words,
+ *   Software - Practice & Experience, v.40 n.2, p.131-147, February 2010
+ *   (https://doi.org/10.1002/spe.948)
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/integerset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "lib/integerset.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Properties of Simple-8b encoding.  (These are needed here, before
+ * other definitions, so that we can size other arrays accordingly).
+ *
+ * SIMPLE8B_MAX_VALUE is the greatest integer that can be encoded.  Simple-8B
+ * uses 64-bit words, but uses four bits to indicate the "mode" of the
+ * codeword, leaving at most 60 bits for the actual integer.
+ *
+ * SIMPLE8B_MAX_VALUES_PER_CODEWORD is the maximum number of integers that
+ * can be encoded in a single codeword.
+ */
+#define SIMPLE8B_MAX_VALUE		((1L << 60) - 1)
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * Parameters for shape of the in-memory B-tree.
+ *
+ * These set the size of each internal and leaf node.  They don't necessarily
+ * need to be the same, because the tree is just an in-memory structure.
+ * With the default 64, each node is about 1 kb.
+ *
+ * If you change these, you must recalculate MAX_TREE_LEVELS, too!
+ */
+#define MAX_INTERNAL_ITEMS	64
+#define MAX_LEAF_ITEMS	64
+
+/*
+ * Maximum height of the tree.
+ *
+ * MAX_TREE_ITEMS is calculated from the "fan-out" of the B-tree.  The
+ * theoretical maximum number of items that we can store in a set is 2^64,
+ * so MAX_TREE_LEVELS should be set so that:
+ *
+ *   MAX_LEAF_ITEMS * MAX_INTERNAL_ITEMS ^ (MAX_TREE_LEVELS - 1) >= 2^64.
+ *
+ * In practice, we'll need far fewer levels, because you will run out of
+ * memory long before reaching that number, but let's be conservative.
+ */
+#define MAX_TREE_LEVELS		11
+
+/*
+ * Node structures, for the in-memory B-tree.
+ *
+ * An internal node holds a number of downlink pointers to leaf nodes, or
+ * to internal nodes on lower level.  For each downlink, the key value
+ * corresponding the lower level node is stored in a sorted array.  The
+ * stored key values are low keys.  In other words, if the downlink has value
+ * X, then all items stored on that child are >= X.
+ *
+ * Each leaf node holds a number of "items", with a varying number of
+ * integers packed into each item.  Each item consists of two 64-bit words:
+ * The first word holds first integer stored in the item, in plain format.
+ * The second word contains between 0 and 240 more integers, packed using
+ * Simple-8b encoding.  By storing the first integer in plain, unpacked,
+ * format, we can use binary search to quickly find an item that holds (or
+ * would hold) a particular integer.  And by storing the rest in packed form,
+ * we still get pretty good memory density, if there are clusters of integers
+ * with similar values.
+ *
+ * Each leaf node also has a pointer to the next leaf node, so that the leaf
+ * nodes can be easily walked from beginning to end, when iterating.
+ */
+typedef struct intset_node intset_node;
+typedef struct intset_leaf_node intset_leaf_node;
+typedef struct intset_internal_node intset_internal_node;
+
+/* Common structure of both leaf and internal nodes. */
+struct intset_node
+{
+	uint16		level;
+	uint16		num_items;
+};
+
+/* Internal node */
+struct intset_internal_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* >= 1 on internal nodes */
+	uint16		num_items;
+
+	/*
+	 * 'values' is an array of key values, and 'downlinks' are pointers
+	 * to lower-level nodes, corresponding to the key values.
+	 */
+	uint64		values[MAX_INTERNAL_ITEMS];
+	intset_node *downlinks[MAX_INTERNAL_ITEMS];
+};
+
+/* Leaf node */
+typedef struct
+{
+	uint64		first;		/* first integer in this item */
+	uint64		codeword;	/* simple8b encoded differences from 'first' */
+} leaf_item;
+
+#define MAX_VALUES_PER_LEAF_ITEM	(1 + SIMPLE8B_MAX_VALUES_PER_CODEWORD)
+
+struct intset_leaf_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* 0 on leafs */
+	uint16		num_items;
+
+	intset_leaf_node *next;	/* right sibling, if any */
+
+	leaf_item	items[MAX_LEAF_ITEMS];
+};
+
+/*
+ * We buffer insertions in a simple array, before packing and inserting them
+ * into the B-tree.  MAX_BUFFERED_VALUES sets the size of the buffer.  The
+ * encoder assumes that it is large enough, that we can always fill a leaf
+ * item with buffered new items.  In other words, MAX_BUFFERED_VALUES must be
+ * larger than MAX_VALUES_PER_LEAF_ITEM.
+ */
+#define MAX_BUFFERED_VALUES			(MAX_VALUES_PER_LEAF_ITEM * 2)
+
+/*
+ * IntegerSet is the top-level object representing the set.
+ *
+ * The integers are stored in an in-memory B-tree structure, and an array
+ * for newly-added integers.  IntegerSet also tracks information about memory
+ * usage, as well as the current position, when iterating the set with
+ * intset_begin_iterate / intset_iterate_next.
+ */
+struct IntegerSet
+{
+	/*
+	 * 'context' is a dedicated memory context, used to hold the IntegerSet
+	 * struct itself, as well as all the tree nodes.
+	 *
+	 * 'mem_used' tracks the amount of memory used.  We don't do anything with
+	 * it in integerset.c itself, but the callers can ask for it with
+	 * intset_memory_usage().
+	 */
+	MemoryContext context;		/* memory context holding everything */
+	uint64		mem_used;		/* amount of memory used */
+
+	uint64		num_entries;	/* total # of values in the set */
+	uint64		highest_value;	/* highest value stored in this set */
+
+	/*
+	 * B-tree to hold the packed values.
+	 *
+	 * 'rightmost_nodes' hold pointers to the rightmost node on each level.
+	 * rightmost_parent[0] is rightmost leaf, rightmost_parent[1] is its
+	 * parent, and so forth, all the way up to the root. These are needed when
+	 * adding new values. (Currently, we require that new values are added at
+	 * the end.)
+	 */
+	int			num_levels;		/* height of the tree */
+	intset_node *root;			/* root node */
+	intset_node *rightmost_nodes[MAX_TREE_LEVELS];
+	intset_leaf_node *leftmost_leaf;	/* leftmost leaf node */
+
+	/*
+	 * Holding area for new items that haven't been inserted to the tree yet.
+	 */
+	uint64		buffered_values[MAX_BUFFERED_VALUES];
+	int			num_buffered_values;
+
+	/*
+	 * Iterator support.
+	 *
+	 * 'iter_values' is an array of integers ready to be returned to the
+	 * caller.  'item_node' and 'item_itemno' point to the leaf node, and
+	 * item within the leaf node, to get the next batch of values from.
+	 *
+	 * Normally, 'iter_values' points 'iter_values_buf', which holds items
+	 * decoded from a leaf item. But after we have scanned the whole B-tree,
+	 * we iterate through all the unbuffered values, too, by pointing
+	 * iter_values to 'buffered_values'.
+	 */
+	uint64	   *iter_values;
+	int			iter_num_values;	/* number of elements in 'iter_values' */
+	int			iter_valueno;		/* index into 'iter_values' */
+	intset_leaf_node *iter_node;	/* current leaf node */
+	int			iter_itemno;		/* next item 'iter_node' to decode */
+
+	uint64		iter_values_buf[MAX_VALUES_PER_LEAF_ITEM];
+};
+
+/*
+ * prototypes for internal functions.
+ */
+static void intset_update_upper(IntegerSet *intset, int level,
+				 intset_node *new_node, uint64 new_node_item);
+static void intset_flush_buffered_values(IntegerSet *intset);
+
+static int intset_binsrch_uint64(uint64 value, uint64 *arr, int arr_elems,
+				bool nextkey);
+static int intset_binsrch_leaf(uint64 value, leaf_item *arr, int arr_elems,
+				bool nextkey);
+
+static uint64 simple8b_encode(uint64 *ints, int *num_encoded, uint64 base);
+static int simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base);
+static bool simple8b_contains(uint64 codeword, uint64 key, uint64 base);
+
+
+/*
+ * Create a new, initially empty, integer set.
+ */
+IntegerSet *
+intset_create(void)
+{
+	MemoryContext context;
+	IntegerSet *intset;
+
+	/*
+	 * Create a new memory context to hold everything.
+	 *
+	 * We never free any nodes, so the generational allocator works well for
+	 * us.
+	 *
+	 * Use a large block size, in the hopes that if we use a lot of memory,
+	 * the libc allocator will give it back to the OS when we free it, rather
+	 * than add it to a free-list. (On glibc, see M_MMAP_THRESHOLD.  As of this
+	 * writing, the effective threshold is somewhere between 128 kB and 4 MB.)
+	 */
+	context = GenerationContextCreate(CurrentMemoryContext,
+									  "integer set",
+									  SLAB_LARGE_BLOCK_SIZE);
+
+	/* Allocate the IntegerSet object itself, in the context. */
+	intset = (IntegerSet *) MemoryContextAlloc(context, sizeof(IntegerSet));
+	intset->context = context;
+	intset->mem_used = GetMemoryChunkSpace(intset);
+
+	intset->num_entries = 0;
+	intset->highest_value = 0;
+
+	intset->num_levels = 0;
+	intset->root = NULL;
+	memset(intset->rightmost_nodes, 0, sizeof(intset->rightmost_nodes));
+	intset->leftmost_leaf = NULL;
+
+	intset->num_buffered_values = 0;
+
+	intset->iter_node = NULL;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+
+	return intset;
+}
+
+/*
+ * Allocate a new node.
+ */
+static intset_internal_node *
+intset_new_internal_node(IntegerSet *intset)
+{
+	intset_internal_node *n;
+
+	n = (intset_internal_node *) MemoryContextAlloc(intset->context,
+													sizeof(intset_internal_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;		/* caller must set */
+	n->num_items = 0;
+
+	return n;
+}
+
+static intset_leaf_node *
+intset_new_leaf_node(IntegerSet *intset)
+{
+	intset_leaf_node *n;
+
+	n = (intset_leaf_node *) MemoryContextAlloc(intset->context,
+												sizeof(intset_leaf_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;
+	n->num_items = 0;
+	n->next = NULL;
+
+	return n;
+}
+
+/*
+ * Free the integer set
+ */
+void
+intset_free(IntegerSet *intset)
+{
+	/* everything is allocated in the memory context */
+	MemoryContextDelete(intset->context);
+}
+
+/*
+ * Return the number of entries in the integer set.
+ */
+uint64
+intset_num_entries(IntegerSet *intset)
+{
+	return intset->num_entries;
+}
+
+/*
+ * Return the amount of memory used by the integer set.
+ */
+uint64
+intset_memory_usage(IntegerSet *intset)
+{
+	return intset->mem_used;
+}
+
+/*
+ * Add a value to the set.
+ *
+ * Values must be added in order.
+ */
+void
+intset_add_member(IntegerSet *intset, uint64 x)
+{
+	if (intset->iter_node)
+		elog(ERROR, "cannot add new values to integer set when iteration is in progress");
+
+	if (x <= intset->highest_value && intset->num_entries > 0)
+		elog(ERROR, "cannot add value to integer set out of order");
+
+	if (intset->num_buffered_values >= MAX_BUFFERED_VALUES)
+	{
+		/* Time to flush our buffer */
+		intset_flush_buffered_values(intset);
+		Assert(intset->num_buffered_values < MAX_BUFFERED_VALUES);
+	}
+
+	/* Add it to the buffer of newly-added values */
+	intset->buffered_values[intset->num_buffered_values] = x;
+	intset->num_buffered_values++;
+	intset->num_entries++;
+	intset->highest_value = x;
+}
+
+/*
+ * Take a batch of buffered values, and pack them into the B-tree.
+ */
+static void
+intset_flush_buffered_values(IntegerSet *intset)
+{
+	uint64	   *values = intset->buffered_values;
+	uint64		num_values = intset->num_buffered_values;
+	int			num_packed = 0;
+	intset_leaf_node *leaf;
+
+	leaf = (intset_leaf_node *) intset->rightmost_nodes[0];
+
+	/*
+	 * If the tree is completely empty, create the first leaf page, which
+	 * is also the root.
+	 */
+	if (leaf == NULL)
+	{
+		/*
+		 * This is the very first item in the set.
+		 *
+		 * Allocate root node. It's also a leaf.
+		 */
+		leaf = intset_new_leaf_node(intset);
+
+		intset->root = (intset_node *) leaf;
+		intset->leftmost_leaf = leaf;
+		intset->rightmost_nodes[0] = (intset_node *) leaf;
+		intset->num_levels = 1;
+	}
+
+	/*
+	 * If there are less than MAX_VALUES_PER_LEAF_ITEM values in the
+	 * buffer, stop.  In most cases, we cannot encode that many values
+	 * in a single value, but this way, the encoder doesn't have to
+	 * worry about running out of input.
+	 */
+	while (num_values - num_packed >= MAX_VALUES_PER_LEAF_ITEM)
+	{
+		leaf_item	item;
+		int			num_encoded;
+
+		/*
+		 * Construct the next leaf item, packing as many buffered values
+		 * as possible.
+		 */
+		item.first = values[num_packed];
+		item.codeword = simple8b_encode(&values[num_packed + 1],
+										&num_encoded,
+										item.first);
+
+		/*
+		 * Add the item to the node, allocating a new node if the old one
+		 * is full.
+		 */
+		if (leaf->num_items >= MAX_LEAF_ITEMS)
+		{
+			/* Allocate new leaf and link it to the tree */
+			intset_leaf_node *old_leaf = leaf;
+
+			leaf = intset_new_leaf_node(intset);
+			old_leaf->next = leaf;
+			intset->rightmost_nodes[0] = (intset_node *) leaf;
+			intset_update_upper(intset, 1, (intset_node *) leaf, item.first);
+		}
+		leaf->items[leaf->num_items++] = item;
+
+		num_packed += 1 + num_encoded;
+	}
+
+	/*
+	 * Move any remaining buffered values to the beginning of the array.
+	 */
+	if (num_packed < intset->num_buffered_values)
+	{
+		memmove(&intset->buffered_values[0],
+				&intset->buffered_values[num_packed],
+				(intset->num_buffered_values - num_packed) * sizeof(uint64));
+	}
+	intset->num_buffered_values -= num_packed;
+}
+
+/*
+ * Insert a downlink into parent node, after creating a new node.
+ *
+ * Recurses if the parent node is full, too.
+ */
+static void
+intset_update_upper(IntegerSet *intset, int level, intset_node *child,
+					uint64 child_key)
+{
+	intset_internal_node *parent;
+
+	Assert(level > 0);
+
+	/*
+	 * Create a new root node, if necessary.
+	 */
+	if (level >= intset->num_levels)
+	{
+		intset_node *oldroot = intset->root;
+		uint64		downlink_key;
+
+		/* MAX_TREE_LEVELS should be more than enough, this shouldn't happen */
+		if (intset->num_levels == MAX_TREE_LEVELS)
+			elog(ERROR, "could not expand integer set, maximum number of levels reached");
+		intset->num_levels++;
+
+		/*
+		 * Get the first value on the old root page, to be used as the
+		 * downlink.
+		 */
+		if (intset->root->level == 0)
+			downlink_key = ((intset_leaf_node *) oldroot)->items[0].first;
+		else
+			downlink_key = ((intset_internal_node *) oldroot)->values[0];
+
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = downlink_key;
+		parent->downlinks[0] = oldroot;
+		parent->num_items = 1;
+
+		intset->root = (intset_node *) parent;
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+	}
+
+	/*
+	 * Place the downlink on the parent page.
+	 */
+	parent = (intset_internal_node *) intset->rightmost_nodes[level];
+
+	if (parent->num_items < MAX_INTERNAL_ITEMS)
+	{
+		parent->values[parent->num_items] = child_key;
+		parent->downlinks[parent->num_items] = child;
+		parent->num_items++;
+	}
+	else
+	{
+		/*
+		 * Doesn't fit.  Allocate new parent, with the downlink as the first
+		 * item on it, and recursively insert the downlink to the new parent
+		 * to the grandparent.
+		 */
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = child_key;
+		parent->downlinks[0] = child;
+		parent->num_items = 1;
+
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+
+		intset_update_upper(intset, level + 1, (intset_node *) parent, child_key);
+	}
+}
+
+/*
+ * Does the set contain the given value?
+ */
+bool
+intset_is_member(IntegerSet *intset, uint64 x)
+{
+	intset_node   *node;
+	intset_leaf_node *leaf;
+	int			level;
+	int			itemno;
+	leaf_item  *item;
+
+	/*
+	 * The value might be in the buffer of newly-added values.
+	 */
+	if (intset->num_buffered_values > 0 && x >= intset->buffered_values[0])
+	{
+		int			itemno;
+
+		itemno = intset_binsrch_uint64(x,
+									   intset->buffered_values,
+									   intset->num_buffered_values,
+									   false);
+		if (itemno >= intset->num_buffered_values)
+			return false;
+		else
+			return intset->buffered_values[itemno] == x;
+	}
+
+	/*
+	 * Start from the root, and walk down the B-tree to find the right leaf
+	 * node.
+	 */
+	if (!intset->root)
+		return false;
+	node = intset->root;
+	for (level = intset->num_levels - 1; level > 0; level--)
+	{
+		intset_internal_node *n = (intset_internal_node *) node;
+
+		Assert(node->level == level);
+
+		itemno = intset_binsrch_uint64(x, n->values, n->num_items, true);
+		if (itemno == 0)
+			return false;
+		node = n->downlinks[itemno - 1];
+	}
+	Assert(node->level == 0);
+	leaf = (intset_leaf_node *) node;
+
+	/*
+	 * Binary search the right item on the leaf page
+	 */
+	itemno = intset_binsrch_leaf(x, leaf->items, leaf->num_items, true);
+	if (itemno == 0)
+		return false;
+	item = &leaf->items[itemno - 1];
+
+	/* Is this a match to the first value on the item? */
+	if (item->first == x)
+		return true;
+	Assert(x > item->first);
+
+	/* Is it in the packed codeword? */
+	if (simple8b_contains(item->codeword, x, item->first))
+		return true;
+
+	return false;
+}
+
+/*
+ * Begin in-order scan through all the values.
+ *
+ * While the iteration is in-progress, you cannot add new values to the set.
+ */
+void
+intset_begin_iterate(IntegerSet *intset)
+{
+	intset->iter_node = intset->leftmost_leaf;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+	intset->iter_values = intset->iter_values_buf;
+}
+
+/*
+ * Returns the next integer, when iterating.
+ *
+ * intset_begin_iterate() must be called first.  intset_iterate_next() returns
+ * the next value in the set.  If there are no more values, *found is set
+ * to false.
+ */
+uint64
+intset_iterate_next(IntegerSet *intset, bool *found)
+{
+	for (;;)
+	{
+		if (intset->iter_valueno < intset->iter_num_values)
+		{
+			*found = true;
+			return intset->iter_values[intset->iter_valueno++];
+		}
+
+		/* Our queue is empty, decode next leaf item */
+		if (intset->iter_node && intset->iter_itemno < intset->iter_node->num_items)
+		{
+			/* We have reached end of this packed item.  Step to the next one. */
+			leaf_item  *item;
+			int			num_decoded;
+
+			item = &intset->iter_node->items[intset->iter_itemno++];
+
+			intset->iter_values[0] = item->first;
+			num_decoded = simple8b_decode(item->codeword, &intset->iter_values[1], item->first);
+			intset->iter_num_values = num_decoded + 1;
+
+			intset->iter_valueno = 0;
+			continue;
+		}
+
+		/* No more items on this leaf, step to next node */
+		if (intset->iter_node)
+		{
+			/* No more matches on this bucket. Step to the next node. */
+			intset->iter_node = intset->iter_node->next;
+			intset->iter_itemno = 0;
+			intset->iter_valueno = 0;
+			intset->iter_num_values = 0;
+			continue;
+		}
+
+		/*
+		 * We have reached the end of the B-tree.  But we might still have
+		 * some integers in the buffer of newly-added values.
+		 */
+		if (intset->iter_values == intset->iter_values_buf)
+		{
+			intset->iter_values = intset->buffered_values;
+			intset->iter_num_values = intset->num_buffered_values;
+			continue;
+		}
+
+		break;
+	}
+
+	/* No more results. */
+	*found = false;
+	return 0;
+}
+
+/*
+ * intset_binsrch_uint64() -- search a sorted array of uint64s
+ *
+ * Returns the first position with key equal or less than the given key.
+ * The returned position would be the "insert" location for the given key,
+ * that is, the position where the new key should be inserted to.
+ *
+ * 'nextkey' affects the behavior on equal keys.  If true, and there is an
+ * equal key in the array, this returns the position immediately after the
+ * equal key.  If false, this returns the position of the equal key itself.
+ */
+static int
+intset_binsrch_uint64(uint64 item, uint64 *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/* same, but for an array of leaf items */
+static int
+intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * Simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords".  The number of integers packed into a single codeword
+ * depends on the integers being packed: small integers are encoded using
+ * fewer bits than large integers.  A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers.  That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of the absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers
+ * packed into the remaining 60 bits.  The selector allows for 16 different
+ * ways of using the remaining 60 bits, "modes".  The number of integers
+ * packed into a single codeword is listed in the simple8b_modes table below.
+ * For example, consider the following codeword:
+ *
+ *      20-bit integer       20-bit integer       20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal.  From the modes table below, we see
+ * that it means that the codeword encodes three 12-bit integers.  In decimal,
+ * those integers are 18, 500000 and 20.  Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18,  500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeros
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * the deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits.  Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword.  If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items.  To represent that,
+ * we use a magic EMPTY_CODEWORD codeword.
+ */
+static const struct
+{
+	uint8		bits_per_int;
+	uint8		num_ints;
+} simple8b_modes[17] =
+{
+	{  0, 240 },	/* mode  0: 240 zeros */
+	{  0, 120 },	/* mode  1: 120 zeros */
+	{  1,  60 },	/* mode  2: sixty 1-bit integers */
+	{  2,  30 },	/* mode  3: thirty 2-bit integers */
+	{  3,  20 },	/* mode  4: twenty 3-bit integers */
+	{  4,  15 },	/* mode  5: fifteen 4-bit integers */
+	{  5,  12 },	/* mode  6: twelve 5-bit integers */
+	{  6,  10 },	/* mode  7: ten 6-bit integers */
+	{  7,   8 },	/* mode  8: eight 7-bit integers (four bits are wasted) */
+	{  8,   7 },	/* mode  9: seven 8-bit integers (four bits are wasted) */
+	{ 10,   6 },	/* mode 10: six 10-bit integers */
+	{ 12,   5 },	/* mode 11: five 12-bit integers */
+	{ 15,   4 },	/* mode 12: four 15-bit integers */
+	{ 20,   3 },	/* mode 13: three 20-bit integers */
+	{ 30,   2 },	/* mode 14: two 30-bit integers */
+	{ 60,   1 },	/* mode 15: one 60-bit integer */
+	{ 0,    0 }		/* sentinel value */
+};
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a 0-mode codeword,  but we check for it
+ * specifically.  (In a real 0-mode codeword, all the unused bits are zero.)
+ */
+#define EMPTY_CODEWORD		(0xFFFFFFFFFFFFFFF0)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the number of integers that were encoded.
+ */
+static uint64
+simple8b_encode(uint64 *ints, int *num_encoded, uint64 base)
+{
+	int			selector;
+	int			nints;
+	int			bits;
+	uint64		diff;
+	uint64		last_val;
+	uint64		codeword;
+	uint64		diffs[60];
+	int			i;
+
+	Assert(ints[0] > base);
+
+	/*
+	 * Select the "mode" to use for the next codeword.
+	 *
+	 * In each iteration, check if the next value can be represented
+	 * in the current mode we're considering.  If it's too large, then
+	 * step up the mode to a wider one, and repeat.  If it fits, move
+	 * on to next integer.  Repeat until the codeword is full, given
+	 * the current mode.
+	 *
+	 * Note that we don't have any way to represent unused slots in the
+	 * codeword, so we require each codeword to be "full".
+	 */
+	selector = 0;
+	nints = simple8b_modes[0].num_ints;
+	bits = simple8b_modes[0].bits_per_int;
+	diff = ints[0] - base - 1;
+	last_val = ints[0];
+	i = 0;
+	for (;;)
+	{
+		if (diff >= (1L << bits))
+		{
+			/* too large, step up to next mode */
+			selector++;
+			nints = simple8b_modes[selector].num_ints;
+			bits = simple8b_modes[selector].bits_per_int;
+			if (i >= nints)
+				break;
+		}
+		else
+		{
+			if (i < 60)
+				diffs[i] = diff;
+			i++;
+			if (i >= nints)
+				break;
+
+			Assert(ints[i] > last_val);
+			diff = ints[i] - last_val - 1;
+			last_val = ints[i];
+		}
+	}
+
+	if (nints == 0)
+	{
+		/* The next value is too large and be encoded with Simple-8b */
+		Assert(i == 0);
+		*num_encoded = 0;
+		return EMPTY_CODEWORD;
+	}
+
+	/*
+	 * Encode the integers using the selected mode.  Note that we shift them
+	 * into the codeword in reverse order, so that they will come out in the
+	 * correct order in the decoder.
+	 */
+	codeword = 0;
+	if (bits > 0)
+	{
+		for (i = nints - 1; i >= 0; i--)
+		{
+			codeword <<= bits;
+			codeword |= diffs[i];
+		}
+	}
+
+	/* add selector to the codeword, and return */
+	codeword <<= 4;
+	codeword |= selector;
+
+	*num_encoded = nints;
+	return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ */
+static int
+simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	uint64		bits = simple8b_modes[selector].bits_per_int;
+	uint64		mask = (1L << bits) - 1;
+	uint64		prev_value;
+
+	if (codeword == EMPTY_CODEWORD)
+		return 0;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	prev_value = base;
+	for (int i = 0; i < nints; i++)
+	{
+		uint64		diff = codeword & mask;
+
+		decoded[i] = prev_value + 1L + diff;
+		prev_value = decoded[i];
+		codeword >>= bits;
+	}
+	return nints;
+}
+
+/*
+ * This is very similar to simple8b_decode(), but instead of decoding all
+ * the values to an array, it just checks if the given integer is part of
+ * the codeword.
+ */
+static bool
+simple8b_contains(uint64 codeword, uint64 key, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	int			bits = simple8b_modes[selector].bits_per_int;
+
+	if (codeword == EMPTY_CODEWORD)
+		return false;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	if (bits == 0)
+	{
+		/* Special handling for 0-bit cases. */
+		return key - base <= nints;
+	}
+	else
+	{
+		int			mask = (1L << bits) - 1;
+		uint64		prev_value;
+
+		prev_value = base;
+		for (int i = 0; i < nints; i++)
+		{
+			uint64		diff = codeword & mask;
+			uint64		curr_value;
+
+			curr_value = prev_value + 1L + diff;
+
+			if (curr_value >= key)
+			{
+				if (curr_value == key)
+					return true;
+				else
+					return false;
+			}
+
+			codeword >>= bits;
+			prev_value = curr_value;
+		}
+	}
+	return false;
+}
diff --git a/src/include/lib/integerset.h b/src/include/lib/integerset.h
new file mode 100644
index 00000000000..27aa3ee883c
--- /dev/null
+++ b/src/include/lib/integerset.h
@@ -0,0 +1,25 @@
+/*
+ * integerset.h
+ *	  In-memory data structure to hold a large set of integers efficiently
+ *
+ * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
+ *
+ * src/include/lib/integerset.h
+ */
+#ifndef INTEGERSET_H
+#define INTEGERSET_H
+
+typedef struct IntegerSet IntegerSet;
+
+extern IntegerSet *intset_create(void);
+extern void intset_free(IntegerSet *intset);
+extern void intset_add_member(IntegerSet *intset, uint64 x);
+extern bool intset_is_member(IntegerSet *intset, uint64 x);
+
+extern uint64 intset_num_entries(IntegerSet *intset);
+extern uint64 intset_memory_usage(IntegerSet *intset);
+
+extern void intset_begin_iterate(IntegerSet *intset);
+extern uint64 intset_iterate_next(IntegerSet *intset, bool *found);
+
+#endif							/* INTEGERSET_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 19d60a506e1..dfd0956aee3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
+		  test_integerset \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
diff --git a/src/test/modules/test_integerset/.gitignore b/src/test/modules/test_integerset/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_integerset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_integerset/Makefile b/src/test/modules/test_integerset/Makefile
new file mode 100644
index 00000000000..3b7c4999d6f
--- /dev/null
+++ b/src/test/modules/test_integerset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_integerset/Makefile
+
+MODULE_big = test_integerset
+OBJS = test_integerset.o $(WIN32RES)
+PGFILEDESC = "test_integerset - test code for src/backend/lib/integerset.c"
+
+EXTENSION = test_integerset
+DATA = test_integerset--1.0.sql
+
+REGRESS = test_integerset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_integerset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_integerset/README b/src/test/modules/test_integerset/README
new file mode 100644
index 00000000000..3e4226adb55
--- /dev/null
+++ b/src/test/modules/test_integerset/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation,
+in src/backend/lib/integerset.c
+
+The tests verify the correctness of the implemention, but they can also be
+as a micro-benchmark:  If you set the 'intset_tests_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_integerset/expected/test_integerset.out b/src/test/modules/test_integerset/expected/test_integerset.out
new file mode 100644
index 00000000000..d7c88ded092
--- /dev/null
+++ b/src/test/modules/test_integerset/expected/test_integerset.out
@@ -0,0 +1,14 @@
+CREATE EXTENSION test_integerset;
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+SELECT test_integerset();
+ test_integerset 
+-----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_integerset/sql/test_integerset.sql b/src/test/modules/test_integerset/sql/test_integerset.sql
new file mode 100644
index 00000000000..34223afa885
--- /dev/null
+++ b/src/test/modules/test_integerset/sql/test_integerset.sql
@@ -0,0 +1,11 @@
+CREATE EXTENSION test_integerset;
+
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+
+SELECT test_integerset();
diff --git a/src/test/modules/test_integerset/test_integerset--1.0.sql b/src/test/modules/test_integerset/test_integerset--1.0.sql
new file mode 100644
index 00000000000..d6d5a3f6cf7
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_integerset/test_integerset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_integerset" to load this file. \quit
+
+CREATE FUNCTION test_integerset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_integerset/test_integerset.c b/src/test/modules/test_integerset/test_integerset.c
new file mode 100644
index 00000000000..24a2e08c0d1
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.c
@@ -0,0 +1,622 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_integerset.c
+ *		Test integer set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_integerset/test_integerset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/integerset.h"
+#include "nodes/bitmapset.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "miscadmin.h"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns.
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool intset_test_stats = true;
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_integerset);
+
+/*
+ * A struct to define a pattern of integers, for use with test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 100000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 100000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 10000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"clusters of thousand",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 10000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 10000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		10000000000L, 1000000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		10000000000L, 10000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		2000000000000000000L, 23 /* can't be much higher than this, or we overflow uint64 */
+	}
+};
+
+static void test_pattern(const test_spec *spec);
+static void test_empty(void);
+static void test_single_value(uint64 value);
+static void check_with_filler(IntegerSet *intset, uint64 x, uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_huge_distances(void);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_integerset(PG_FUNCTION_ARGS)
+{
+	test_huge_distances();
+
+	test_empty();
+
+	test_single_value(0);
+	test_single_value(1);
+	test_single_value(PG_UINT64_MAX - 1);
+	test_single_value(PG_UINT64_MAX);
+
+	/* Same tests, but with some "filler" values, so that the B-tree gets created */
+	test_single_value_and_filler(0, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000000);
+	test_single_value_and_filler(PG_UINT64_MAX - 1, 1000, 2000);
+	test_single_value_and_filler(PG_UINT64_MAX, 1000, 2000);
+
+	test_huge_distances();
+
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	IntegerSet *intset;
+	MemoryContext test_cxt;
+	MemoryContext old_cxt;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing intset with pattern \"%s\"", spec->test_name);
+	if (intset_test_stats)
+		fprintf(stderr, "-----\ntesting intset with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the integer set.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.  (intset_create() creates a memory context of its
+	 * own, too, but we don't have direct access to it, so we cannot call
+	 * MemoryContextStats() on it directly).
+	 */
+	test_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									 "intset test",
+									 ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(test_cxt, spec->test_name);
+	old_cxt = MemoryContextSwitchTo(test_cxt);
+	intset = intset_create();
+	MemoryContextSwitchTo(old_cxt);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			x = last_int + pattern_values[i];
+
+			intset_add_member(intset, x);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (intset_test_stats)
+		fprintf(stderr, "added %lu values in %lu ms\n",
+				spec->num_values, (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by intset_memory_usage(), as well as the
+	 * stats from the memory context. They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes
+	 * to the implementation, just observe that manually.
+	 */
+	if (intset_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by intset_memory_usage().
+		 * It should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = intset_memory_usage(intset);
+		fprintf(stderr, "intset_memory_usage() reported %lu (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(test_cxt);
+	}
+
+	/* Check that intset_get_num_entries works */
+	n = intset_num_entries(intset);
+	if (n != spec->num_values)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with intset_is_member()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 1000000; n++)
+	{
+		bool		b;
+		bool		expected;
+		uint64		x;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit hit values that are actually in the set.
+		 */
+		x = (pg_lrand48() << 31) | pg_lrand48();
+		x = x % (last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to intset_is_member() ? */
+		b = intset_is_member(intset, x);
+
+		if (b != expected)
+			elog(ERROR, "mismatch at %lu: %d vs %d", x, b, expected);
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "probed %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	intset_begin_iterate(intset);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			bool		found;
+
+			x = intset_iterate_next(intset, &found);
+			if (!found)
+				break;
+
+			if (x != expected)
+				elog(ERROR, "iterate returned wrong value; got %lu, expected %lu", x, expected);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "iterated %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after %lu entries, expected %lu", n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned %lu entries, %lu was expected", n, spec->num_values);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with a set containing a single integer.
+ */
+static void
+test_single_value(uint64 value)
+{
+	IntegerSet *intset;
+	uint64		x;
+	uint64		num_entries;
+	bool		found;
+
+	elog(NOTICE, "testing intset with single value %lu", value);
+
+	/* Create the set. */
+	intset = intset_create();
+	intset_add_member(intset, value);
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != 1)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, 1L);
+
+	/*
+	 * Test intset_is_member() at various special values, like 0 and and maximum
+	 * possible 64-bit integer, as well as the value itself.
+	 */
+	if (intset_is_member(intset, 0) != (value == 0))
+		elog(ERROR, "intset_is_member failed for 0");
+	if (intset_is_member(intset, 1) != (value == 1))
+		elog(ERROR, "intset_is_member failed for 1");
+	if (intset_is_member(intset, PG_UINT64_MAX) != (value == PG_UINT64_MAX))
+		elog(ERROR, "intset_is_member failed for PG_UINT64_MAX");
+	if (intset_is_member(intset, value) != true)
+		elog(ERROR, "intset_is_member failed for the tested value");
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (!found || x != value)
+		elog(ERROR, "intset_iterate_next failed for %lu", x);
+
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with an integer set that contains:
+ *
+ * - a given single 'value', and
+ * - all integers between 'filler_min' and 'filler_max'.
+ *
+ * This exercises different codepaths than testing just with a single value,
+ * because the implementation buffers newly-added values.  If we add just
+ * single value to the set, we won't test the internal B-tree code at all,
+ * just the code that deals with the buffer.
+ */
+static void
+test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	IntegerSet *intset;
+	uint64		x;
+	bool		found;
+	uint64	   *iter_expected;
+	uint64		n = 0;
+	uint64		num_entries = 0;
+	uint64		mem_usage;
+
+	elog(NOTICE, "testing intset with value %lu, and all between %lu and %lu",
+		 value, filler_min, filler_max);
+
+	intset = intset_create();
+
+	iter_expected = palloc(sizeof(uint64) * (filler_max - filler_min + 1));
+	if (value < filler_min)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	for (x = filler_min; x < filler_max; x++)
+	{
+		intset_add_member(intset, x);
+		iter_expected[n++] = x;
+	}
+
+	if (value >= filler_max)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != n)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, n);
+
+	/*
+	 * Test intset_is_member() at various spots, at and around the values that we
+	 * expect to be set, as well as 0 and the maximum possible value.
+	 */
+	check_with_filler(intset, 0,                 value, filler_min, filler_max);
+	check_with_filler(intset, 1,                 value, filler_min, filler_max);
+	check_with_filler(intset, filler_min - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_min,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_min + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, value - 1,         value, filler_min, filler_max);
+	check_with_filler(intset, value,             value, filler_min, filler_max);
+	check_with_filler(intset, value + 1,         value, filler_min, filler_max);
+	check_with_filler(intset, filler_max - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_max,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_max + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX - 1, value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX,     value, filler_min, filler_max);
+
+	intset_begin_iterate(intset);
+	for (uint64 i = 0; i < n; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != iter_expected[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	mem_usage = intset_memory_usage(intset);
+	if (mem_usage < 5000 || mem_usage > 500000000)
+		elog(ERROR, "intset_memory_usage() reported suspicous value: %lu", mem_usage);
+
+	intset_free(intset);
+}
+
+/*
+ * Helper function for test_single_value_and_filler.
+ *
+ * Calls intset_is_member() for value 'x', and checks that the result is what
+ * we expect.
+ */
+static void
+check_with_filler(IntegerSet *intset, uint64 x,
+				  uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	bool		expected;
+	bool		actual;
+
+	expected = (x == value || (filler_min <= x && x < filler_max));
+
+	actual = intset_is_member(intset, x);
+
+	if (actual != expected)
+		elog(ERROR, "intset_is_member failed for %lu", x);
+}
+
+/*
+ * Test empty set
+ */
+static void
+test_empty(void)
+{
+	IntegerSet *intset;
+	bool		found = true;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with empty set");
+
+	intset = intset_create();
+
+	/* Test intset_is_member() */
+	if (intset_is_member(intset, 0) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, 1) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, PG_UINT64_MAX) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+
+	/* Test iterator */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next on empty set returned a value (%lu)", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with integers that are more than 2^60 apart.
+ *
+ * The Simple-8b encoding used by the set implementation can only encode
+ * values up to 2^60.  That makes large differences like this interesting
+ * to test.
+ */
+static void
+test_huge_distances(void)
+{
+	IntegerSet *intset;
+	uint64		values[1000];
+	int			num_values = 0;
+	uint64		val = 0;
+	bool		found;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with distances > 2^60 between values");
+
+	val = 0;
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	/* we're now very close to 2^64, so can't add large values anymore */
+
+	intset = intset_create();
+
+	/*
+	 * Add many more values to the end, to make sure that all the above
+	 * values get flushed and packed into the tree structure.
+	 */
+	while (num_values < 1000)
+	{
+		val += pg_lrand48();
+		values[num_values++] = val;
+	}
+
+	/* Add these numbers to the set */
+	for (int i = 0; i < num_values; i++)
+		intset_add_member(intset, values[i]);
+
+	/*
+	 * Test iterset_is_member() around each of these values
+	 */
+	for (int i = 0; i < num_values; i++)
+	{
+		uint64		x = values[i];
+		bool		result;
+
+		if (x > 0)
+		{
+			result = intset_is_member(intset, x - 1);
+			if (result != false)
+				elog(ERROR, "intset_is_member failed for %lu", x - 1);
+		}
+
+		result = intset_is_member(intset, x);
+		if (result != true)
+			elog(ERROR, "intset_is_member failed for %lu", x);
+
+		result = intset_is_member(intset, x + 1);
+		if (result != false)
+			elog(ERROR, "intset_is_member failed for %lu", x + 1);
+	}
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	for (int i = 0; i < num_values; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != values[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+}
diff --git a/src/test/modules/test_integerset/test_integerset.control b/src/test/modules/test_integerset/test_integerset.control
new file mode 100644
index 00000000000..7d20c2d7b88
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for integerset'
+default_version = '1.0'
+module_pathname = '$libdir/test_integerset'
+relocatable = true
-- 
2.20.1

0002-Delete-empty-pages-during-GiST-VACUUM.patchtext/x-patch; name=0002-Delete-empty-pages-during-GiST-VACUUM.patchDownload
From 1a3690be16be340f288c069c452e8428f1cc48ad Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we scan through internal pages
looking for references to empty leaf pages.

Heikki's CHANGES since v22:

* Only scan the empty pages after the last scan, in a multi-pass vacuum.
  (I think that's what we want...) We could actually be smarter, and
  do this as part of the second pass's scan, in a multi-pass vacuum.

* Call ReadNewTransactionId() while holding lock. I think that's needed
  for correctness.

* Use new IntegerSet implementation.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
---
 src/backend/access/gist/README         |  48 ++++
 src/backend/access/gist/gist.c         |  15 ++
 src/backend/access/gist/gistutil.c     |  15 +-
 src/backend/access/gist/gistvacuum.c   | 350 +++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  65 +++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  12 +-
 src/test/regress/expected/gist.out     |   6 +-
 src/test/regress/sql/gist.sql          |   6 +-
 11 files changed, 489 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..501b1c1a77a 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,54 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+VACUUM works in two stages:
+
+In the first stage, we scan the whole index in physical order. To make sure
+that we don't miss any dead tuples because a concurrent page split moved them,
+we check the F_FOLLOW_RIGHT flags and NSN on each page, to detect if the
+page has been concurrently split. If a concurrent page split is detected, and
+one half the page was moved to a position that we already scanned, we "jump"
+to scan the page again. This is the same mechanism that B-tree VACUUM uses,
+but because we already have NSNs on pages, to detect page splits during
+searches, we don't need a "vacuum cycle ID" concept for that like B-tree does.
+
+While we scan all the pages, we also make note of any completely empty leaf
+pages. We will try to unlink them from the tree in the second stage. We also
+record the block numbers of all internal pages, in an IntegerSet. They are
+needed in the second stage, to locate parents of empty pages.
+
+In the second stage, we try to unlink any empty leaf pages from the tree, so
+that their space can be reused. If we didn't see any empty pages in the first
+stage, the second stage is skipped. In order to delete an empty page, its
+downlink must be removed from the parent. We scan all the internal pages,
+whose block numbers we
+memorized in the first stage, and look for downlinks to pages that we have
+memorized as being empty. Whenever we find one, we acquire a lock on the
+parent and child page, re-check that the child page is still empty. Then, we
+remove the downlink and mark the child as deleted, and release the locks.
+
+The insertion algorithm would get confused, if an internal page was completely
+empty. So we never delete the last child of an internal page, even if it's
+empty. Currently, we only support deleting leaf pages.
+
+This page deletion algorithm works on a best-effor basis. It might fail to
+find a downlink, if a concurrent page split moved it after the first stage.
+In that case, we won't be able to remove all empty pages. That's OK, it's
+not expected to happen very often, and hopefully the next VACUUM will clean
+it up, instead.
+
+When we have deleted a page, it's possible that an in-progress search will
+still descend on the page, if it saw the downlink before we removed it. The
+search will see that it is deleted, and ignore it, but as long as that can
+happen, we cannot reuse the page. To "wait out" any in-progress searches, when
+the page is deleted, it's labeled with the current next-transaction counter
+value. The page is not recycled, until that XID is no longer visible to
+anyone. That's much more conservative than necessary, but let's keep it
+simple.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef98..a746e911f37 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,9 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/* currently, internal pages are never deleted */
+			Assert(!GistPageIsDeleted(stack->page));
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +841,18 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * The page might have been deleted after we scanned the parent
+			 * and saw the downlink.
+			 */
+			if (GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index f32e16eed58..4e511dfb8c2 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -829,13 +830,21 @@ gistNewBuffer(Relation r)
 		{
 			Page		page = BufferGetPage(buffer);
 
+			/*
+			 * If the page was never initialized, it's OK to use.
+			 */
 			if (PageIsNew(page))
-				return buffer;	/* OK to use, if never initialized */
+				return buffer;
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
-				return buffer;	/* OK to use */
+			/*
+			 * Otherwise, recycle it if deleted, and too old to have any processes
+			 * interested in it.
+			 */
+			if (GistPageIsDeleted(page) &&
+				TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
+				return buffer;
 
 			LockBuffer(buffer, GIST_UNLOCK);
 		}
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e8..531b4b73a45 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,26 +16,49 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/integerset.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-/* Working state needed by gistbulkdelete */
+/*
+ * State kept across vacuum stages.
+ */
 typedef struct
 {
+	IndexBulkDeleteResult stats;	/* must be first */
+
 	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
+
+	/*
+	 * These are used to memorize all internal and empty leaf pages in the 1st
+	 * vacuum phase.  They are used in the 2nd phase, to delete all the empty
+	 * pages.
+	 */
+	IntegerSet *internalPagesSet;
+	IntegerSet *emptyLeafPagesSet;
+} GistBulkDeleteResult;
+
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	GistBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
 } GistVacState;
 
-static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
+static void gistvacuum_recycle_pages(GistBulkDeleteResult *stats);
+static bool gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer buffer, OffsetNumber downlink,
+			   Buffer leafBuffer);
 
 /*
  * VACUUM bulkdelete stage: remove index entries.
@@ -44,13 +67,15 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	if (gist_stats == NULL)
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, gist_stats, callback, callback_state);
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -59,6 +84,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -68,10 +95,26 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
 	 * still need to do a pass over the index, to obtain index statistics.
 	 */
-	if (stats == NULL)
+	if (gist_stats == NULL)
 	{
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
+		gistvacuumscan(info, gist_stats, NULL, NULL);
+	}
+
+	/*
+	 * If we saw any empty pages that could be recycle, try to unlink them from
+	 * the tree so that they can be reused.
+	 */
+	if (gist_stats->emptyLeafPagesSet)
+	{
+		gistvacuum_recycle_pages(gist_stats);
+		intset_free(gist_stats->emptyLeafPagesSet);
+		gist_stats->emptyLeafPagesSet = NULL;
+	}
+	if (gist_stats->internalPagesSet)
+	{
+		intset_free(gist_stats->internalPagesSet);
+		gist_stats->internalPagesSet = NULL;
 	}
 
 	/*
@@ -82,11 +125,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	if (!info->estimated_count)
 	{
-		if (stats->num_index_tuples > info->num_heap_tuples)
-			stats->num_index_tuples = info->num_heap_tuples;
+		if (gist_stats->stats.num_index_tuples > info->num_heap_tuples)
+			gist_stats->stats.num_index_tuples = info->num_heap_tuples;
 	}
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -97,15 +140,16 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * btvacuumcleanup invoke this (the latter only if no btbulkdelete call
  * occurred).
  *
- * This also adds unused/delete pages to the free space map, although that
- * is currently not very useful.  There is currently no support for deleting
- * empty pages, so recycleable pages can only be found if an error occurs
- * while the index is being expanded, leaving an all-zeros page behind.
+ * This also makes note of any empty leaf pages, as well as all internal
+ * pages. gistvacuum_recycle_pages() needs that information.  Any deleted
+ * pages are added directly to the free space map.  (They should've been
+ * added there when they were originally deleted, already, but it's possible
+ * that the FSM was lost at a crash, for example.)
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
  */
 static void
-gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
@@ -118,12 +162,19 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command.
 	 */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
-	stats->pages_deleted = 0;
+	stats->stats.estimated_count = false;
+	stats->stats.num_index_tuples = 0;
+	stats->stats.pages_deleted = 0;
+
+	if (stats->internalPagesSet != NULL)
+		intset_free(stats->internalPagesSet);
+	stats->internalPagesSet = intset_create();
+	if (stats->emptyLeafPagesSet != NULL)
+		intset_free(stats->emptyLeafPagesSet);
+	stats->emptyLeafPagesSet = intset_create();
 
 	/* Set up info to pass down to gistvacuumpage */
-	vstate.info = info;
+	stats->info = info;
 	vstate.stats = stats;
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
@@ -171,6 +222,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -192,8 +244,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 
 	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
+	stats->stats.num_pages = num_pages;
+	stats->stats.pages_free = vstate.totFreePages;
 }
 
 /*
@@ -210,8 +262,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void
 gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 {
-	IndexVacuumInfo *info = vstate->info;
-	IndexBulkDeleteResult *stats = vstate->stats;
+	GistBulkDeleteResult *stats = vstate->stats;
+	IndexVacuumInfo *info = stats->info;
 	IndexBulkDeleteCallback callback = vstate->callback;
 	void	   *callback_state = vstate->callback_state;
 	Relation	rel = info->index;
@@ -240,12 +292,13 @@ restart:
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
 		vstate->totFreePages++;
-		stats->pages_deleted++;
+		stats->stats.pages_deleted++;
 	}
 	else if (GistPageIsLeaf(page))
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -314,12 +367,28 @@ restart:
 
 			END_CRIT_SECTION();
 
-			stats->tuples_removed += ntodelete;
+			stats->stats.tuples_removed += ntodelete;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			/*
+			 * The page is now completely empty.  Remember its block number,
+			 * we will try to delete the page in the second stage, in
+			 * gistvacuum_recycle_pages().
+			 *
+			 * Skip this when recursing, because IntegerSet requires that the
+			 * values are added in ascending order.  The next VACUUM will pick
+			 * it up...
+			 */
+			if (blkno == orig_blkno)
+				intset_add_member(stats->emptyLeafPagesSet, blkno);
+		}
+		else
+			stats->stats.num_index_tuples += nremain;
 	}
 	else
 	{
@@ -347,6 +416,14 @@ restart:
 						 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
 						 errhint("Please REINDEX it.")));
 		}
+
+		/*
+		 * Remember the block number of this page, so that we can revisit
+		 * it later in gistvacuum_recycle_pages(), when we search for parents
+		 * of empty children.
+		 */
+		if (blkno == orig_blkno)
+			intset_add_member(stats->internalPagesSet, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -364,3 +441,218 @@ restart:
 		goto restart;
 	}
 }
+
+/*
+ * Scan all internal pages, and try to delete their empty child pages.
+ */
+static void
+gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
+{
+	IndexVacuumInfo *info = stats->info;
+	Relation	rel = info->index;
+	BlockNumber	empty_pages_remaining;
+
+	empty_pages_remaining = intset_num_entries(stats->emptyLeafPagesSet);
+
+	/*
+	 * Rescan all inner pages to find those that have empty child pages.
+	 */
+	intset_begin_iterate(stats->internalPagesSet);
+	while (empty_pages_remaining > 0)
+	{
+		BlockNumber	blkno;
+		bool		found;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber off,
+					maxoff;
+		OffsetNumber todelete[MaxOffsetNumber];
+		BlockNumber	leafs_to_delete[MaxOffsetNumber];
+		int			ntodelete;
+		int			deleted;
+
+		blkno = intset_iterate_next(stats->internalPagesSet, &found);
+		if (!found)
+			break;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_SHARE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+		{
+			/*
+			 * This page was an internal page earlier, but now it's something
+			 * else. Shouldn't happen...
+			 */
+			Assert(false);
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/*
+		 * Scan all the downlinks, and see if any of them point to empty leaf
+		 * pages.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		ntodelete = 0;
+		for (off = FirstOffsetNumber;
+			 off <= maxoff && ntodelete < maxoff - 1;
+			 off = OffsetNumberNext(off))
+		{
+			ItemId		iid = PageGetItemId(page, off);
+			IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
+			BlockNumber leafblk;
+
+			leafblk = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+			if (intset_is_member(stats->emptyLeafPagesSet, leafblk))
+			{
+				leafs_to_delete[ntodelete] = leafblk;
+				todelete[ntodelete++] = off;
+			}
+		}
+
+		/*
+		 * In order to avoid deadlock, child page must be locked before
+		 * parent, so we must release the lock on the parent, lock the child,
+		 * and then re-acquire the lock the parent. (And we wouldn't want
+		 * to do I/O, while holding a lock, anyway.)
+		 *
+		 * At the instant that we're not holding a lock on the parent, the
+		 * downlink might get moved by a concurrent, so we must re-check that
+		 * it still points to the same child page after we have acquired both
+		 * locks. Also, another backend might have inserted a tuple to the
+		 * page, so that it is no longer empty. gistdeletepage() re-checks all
+		 * these conditions.
+		 */
+		LockBuffer(buffer, GIST_UNLOCK);
+
+		deleted = 0;
+		for (int i = 0; i < ntodelete; i++)
+		{
+			Buffer		leafbuf;
+
+			/*
+			 * Don't remove the last downlink from the parent. That would
+			 * confuse the insertion code.
+			 */
+			if (PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+				break;
+
+			leafbuf = ReadBufferExtended(rel, MAIN_FORKNUM, leafs_to_delete[i],
+										 RBM_NORMAL, info->strategy);
+			LockBuffer(leafbuf, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafbuf);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			if (gistdeletepage(stats, buffer, todelete[i] - deleted, leafbuf))
+				deleted++;
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			UnlockReleaseBuffer(leafbuf);
+		}
+		empty_pages_remaining -= deleted;
+
+		ReleaseBuffer(buffer);
+	}
+}
+
+
+/*
+ * gistdeletepage takes parent page and leaf page and tries to delete leaf.
+ * Both pages must be locked. Returns true if delete actually happened.
+ * Does not remove last downlink.
+ */
+static bool
+gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer parentBuffer, OffsetNumber downlink,
+			   Buffer leafBuffer)
+{
+	Page		parentPage = BufferGetPage(parentBuffer);
+	Page		leafPage = BufferGetPage(leafBuffer);
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	XLogRecPtr	recptr;
+	TransactionId txid;
+
+	/* Check that the leaf is still empty */
+	if (!GistPageIsLeaf(leafPage))
+	{
+		Assert(false);
+		return false;
+	}
+	if (PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber)
+		return false;		/* no longer empty */
+
+	if (GistFollowRight(leafPage)
+		|| GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+	{
+		/* Don't mess with a concurrent page split. */
+		return false;
+	}
+
+	/*
+	 * Check that the parent page still looks valid.
+	 */
+	if (PageIsNew(parentPage) ||
+		GistPageIsDeleted(parentPage) ||
+		GistPageIsLeaf(parentPage))
+	{
+		Assert(false);
+		return false;
+	}
+
+	/*
+	 * Check that old downlink is still pointing to leafBuffer.
+	 *
+	 * It might have moved by a concurrent insert. We could try to relocate
+	 * it, by scanning the page again, or perhaps even by moving right if
+	 * the page was split, but let's keep it simple and just give up.
+	 * The next VACUUM will pick this up again.
+	 */
+	if (PageGetMaxOffsetNumber(parentPage) < downlink
+		|| PageGetMaxOffsetNumber(parentPage) <= FirstOffsetNumber)
+		return false;
+
+	iid = PageGetItemId(parentPage, downlink);
+	idxtuple = (IndexTuple) PageGetItem(parentPage, iid);
+	if (BufferGetBlockNumber(leafBuffer) !=
+		ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+		return false;
+
+	/*
+	 * All good.  Proceed with the deletion.
+	 *
+	 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+	 * that could hold downlinks to this page. We use
+	 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+	 * since we are in a VACUUM.
+	 */
+	txid = ReadNewTransactionId();
+
+	/* Mark page as deleted dropping references from internal pages */
+	START_CRIT_SECTION();
+
+	/* Remember xid of last transaction that could see this page */
+	GistPageSetDeleteXid(leafPage,txid);
+	GistPageSetDeleted(leafPage);
+	MarkBufferDirty(leafBuffer);
+	stats->stats.pages_deleted++;
+
+	MarkBufferDirty(parentBuffer);
+	/* Offsets are changed as long as we delete tuples from internal page */
+	PageIndexTupleDelete(parentPage, downlink);
+
+	if (RelationNeedsWAL(stats->info->index))
+		recptr 	= gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
+	else
+		recptr = gistGetFakeLSN(stats->info->index);
+	PageSetLSN(parentPage, recptr);
+	PageSetLSN(leafPage, recptr);
+
+	END_CRIT_SECTION();
+
+	return true;
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390af..4dbca41bab1 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -508,6 +508,43 @@ gistRedoCreateIndex(XLogReaderState *record)
 	UnlockReleaseBuffer(buffer);
 }
 
+/* redo page deletion */
+static void
+gistRedoPageDelete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	/* FIXME: Are we locking the pages in correct order, for hot standby? */
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
 void
 gist_redo(XLogReaderState *record)
 {
@@ -535,6 +572,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageDelete(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +693,31 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page deletion. This also includes removal of
+ * downlink from the parent page.
+ */
+XLogRecPtr
+gistXLogPageDelete(Buffer buffer, TransactionId xid,
+				   Buffer parentBuffer, OffsetNumber downlinkOffset)
+{
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = downlinkOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, parentBuffer, REGBUF_STANDARD);
+
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15aa..0861f829922 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f241560..ce8bfd83ea4 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 463d2bfc7b9..c77d0b4dd81 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -414,12 +414,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
+				   TransactionId xid, Buffer parentBuffer,
+				   OffsetNumber downlinkOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1af..939a1ea7559 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -23,7 +23,7 @@
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +76,16 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+/*
+ * Backup Blk 0: page that was deleted.
+ * Backup Blk 1: parent page, containing the downlink to the deleted page.
+ */
+typedef struct gistxlogPageDelete
+{
+	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..0a43449f003 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,10 +27,8 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
 alter index gist_pointidx SET (fillfactor = 40);
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..657b1954847 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,10 +28,8 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 
 vacuum analyze gist_point_tbl;
 
-- 
2.20.1

#54Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#51)
2 attachment(s)
Re: GiST VACUUM

On 15/03/2019 20:25, Andrey Borodin wrote:

11 марта 2019 г., в 20:03, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

On 10/03/2019 18:40, Andrey Borodin wrote:

One thing still bothers me. Let's assume that we have internal
page with 2 deletable leaves. We lock these leaves in order of
items on internal page. Is it possible that 2nd page have
follow-right link on 1st and someone will lock 2nd page, try to
lock 1st and deadlock with VACUUM?

Hmm. If the follow-right flag is set on a page, it means that its
right sibling doesn't have a downlink in the parent yet.
Nevertheless, I think I'd sleep better, if we acquired the locks in
left-to-right order, to be safe.

Actually, I did not found lock coupling in GiST code. But I decided
to lock just two pages at once (leaf, then parent, for every pair).
PFA v22 with this concurrency logic.

Good. I just noticed, that the README actually does say explicitly, that
the child must be locked before the parent.

I rebased this over the new IntegerSet implementation, from the other
thread, and did another round of refactoring, cleanups, etc. Attached is
a new version of this patch. I'm also including the IntegerSet patch
here, for convenience, but it's the same patch I posted at [1]/messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi.

It's in pretty good shape, but one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted
page is recycled, to prevent the deletion from being replayed too early
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the
changes is pretty straightforward, but it's a bit cumbersome to test.

[1]: /messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi
/messages/by-id/1035d8e6-cfd1-0c27-8902-40d8d45eb6e8@iki.fi

- Heikki

Attachments:

0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchtext/x-patch; name=0001-Add-IntegerSet-to-hold-large-sets-of-64-bit-ints-eff.patchDownload
From 4c05c69020334babdc1aa406c5032ae2861e5cb5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 02:26:08 +0200
Subject: [PATCH 1/2] Add IntegerSet, to hold large sets of 64-bit ints
 efficiently.

The set is implemented as a B-tree, with a compact representation at leaf
items, using Simple-8b algorithm, so that clusters of nearby values take
less space.

This doesn't include any use of the code yet, but we have two patches in
the works that would benefit from this:

* the GiST vacuum patch, to track empty GiST pages and internal GiST pages.

* Reducing memory usage, and also allowing more than 1 GB of memory to be
  used, to hold the dead TIDs in VACUUM.

This includes a unit test module, in src/test/modules/test_integerset.
It can be used to verify correctness, as a regression test, but if you run
it manully, it can also print memory usage and execution time of some of
the tests.

Author: Heikki Linnakangas, Andrey Borodin
Discussion: https://www.postgresql.org/message-id/b5e82599-1966-5783-733c-1a947ddb729f@iki.fi
---
 src/backend/lib/Makefile                      |    2 +-
 src/backend/lib/README                        |    2 +
 src/backend/lib/integerset.c                  | 1039 +++++++++++++++++
 src/include/lib/integerset.h                  |   25 +
 src/test/modules/Makefile                     |    1 +
 src/test/modules/test_integerset/.gitignore   |    4 +
 src/test/modules/test_integerset/Makefile     |   21 +
 src/test/modules/test_integerset/README       |    7 +
 .../expected/test_integerset.out              |   14 +
 .../test_integerset/sql/test_integerset.sql   |   11 +
 .../test_integerset/test_integerset--1.0.sql  |    8 +
 .../modules/test_integerset/test_integerset.c |  622 ++++++++++
 .../test_integerset/test_integerset.control   |    4 +
 13 files changed, 1759 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/lib/integerset.c
 create mode 100644 src/include/lib/integerset.h
 create mode 100644 src/test/modules/test_integerset/.gitignore
 create mode 100644 src/test/modules/test_integerset/Makefile
 create mode 100644 src/test/modules/test_integerset/README
 create mode 100644 src/test/modules/test_integerset/expected/test_integerset.out
 create mode 100644 src/test/modules/test_integerset/sql/test_integerset.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset--1.0.sql
 create mode 100644 src/test/modules/test_integerset/test_integerset.c
 create mode 100644 src/test/modules/test_integerset/test_integerset.control

diff --git a/src/backend/lib/Makefile b/src/backend/lib/Makefile
index 191ea9bca26..3c1ee1df83a 100644
--- a/src/backend/lib/Makefile
+++ b/src/backend/lib/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = binaryheap.o bipartite_match.o bloomfilter.o dshash.o hyperloglog.o \
-       ilist.o knapsack.o pairingheap.o rbtree.o stringinfo.o
+       ilist.o integerset.o knapsack.o pairingheap.o rbtree.o stringinfo.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/lib/README b/src/backend/lib/README
index ae5debe1bc6..f2fb591237d 100644
--- a/src/backend/lib/README
+++ b/src/backend/lib/README
@@ -13,6 +13,8 @@ hyperloglog.c - a streaming cardinality estimator
 
 ilist.c - single and double-linked lists
 
+integerset.c - a data structure for holding large set of integers
+
 knapsack.c - knapsack problem solver
 
 pairingheap.c - a pairing heap
diff --git a/src/backend/lib/integerset.c b/src/backend/lib/integerset.c
new file mode 100644
index 00000000000..c9172fa2005
--- /dev/null
+++ b/src/backend/lib/integerset.c
@@ -0,0 +1,1039 @@
+/*-------------------------------------------------------------------------
+ *
+ * integerset.c
+ *	  Data structure to hold a large set of 64-bit integers efficiently
+ *
+ * IntegerSet provides an in-memory data structure to hold a set of
+ * arbitrary 64-bit integers.  Internally, the values are stored in a
+ * B-tree, with a special packed representation at the leaf level using
+ * the Simple-8b algorithm, which can pack hold clusters of nearby values
+ * very tightly.
+ *
+ * Memory consumption depends on the number of values stored, but also
+ * on how far the values are from each other.  In the best case, with
+ * long runs of consecutive integers, memory consumption can be as low as
+ * 0.1 bytes per integer.  In the worst case, if integers are more than
+ * 2^32 apart, it uses about 8 bytes per integer.  In typical use, the
+ * consumption per integer is somewhere between those extremes, depending
+ * on the range of integers stored, and how "clustered" they are.
+ *
+ *
+ * Interface
+ * ---------
+ *
+ *	intset_create			- Create a new empty set.
+ *	intset_add_member		- Add an integer to the set.
+ *	intset_is_member		- Test if an integer is in the set
+ *	intset_begin_iterate	- Begin iterating through all integers in set
+ *	intset_iterate_next		- Return next integer
+ *
+ *
+ * Limitations
+ * -----------
+ *
+ * - Values must be added in order.  (Random insertions would require
+ *   splitting nodes, which hasn't been implemented.)
+ *
+ * - Values cannot be added while iteration is in progress.
+ *
+ * - No support for removing values.
+ *
+ * None of these limitations are fundamental to the data structure, and
+ * could be lifted if needed, by writing some new code.  But the current
+ * users of this facility don't need them.
+ *
+ *
+ * References
+ * ----------
+ *
+ * Simple-8b encoding is based on:
+ *
+ * Vo Ngoc Anh , Alistair Moffat, Index compression using 64-bit words,
+ *   Software - Practice & Experience, v.40 n.2, p.131-147, February 2010
+ *   (https://doi.org/10.1002/spe.948)
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/lib/integerset.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "lib/integerset.h"
+#include "port/pg_bitutils.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Properties of Simple-8b encoding.  (These are needed here, before
+ * other definitions, so that we can size other arrays accordingly).
+ *
+ * SIMPLE8B_MAX_VALUE is the greatest integer that can be encoded.  Simple-8B
+ * uses 64-bit words, but uses four bits to indicate the "mode" of the
+ * codeword, leaving at most 60 bits for the actual integer.
+ *
+ * SIMPLE8B_MAX_VALUES_PER_CODEWORD is the maximum number of integers that
+ * can be encoded in a single codeword.
+ */
+#define SIMPLE8B_MAX_VALUE		((1L << 60) - 1)
+#define SIMPLE8B_MAX_VALUES_PER_CODEWORD 240
+
+/*
+ * Parameters for shape of the in-memory B-tree.
+ *
+ * These set the size of each internal and leaf node.  They don't necessarily
+ * need to be the same, because the tree is just an in-memory structure.
+ * With the default 64, each node is about 1 kb.
+ *
+ * If you change these, you must recalculate MAX_TREE_LEVELS, too!
+ */
+#define MAX_INTERNAL_ITEMS	64
+#define MAX_LEAF_ITEMS	64
+
+/*
+ * Maximum height of the tree.
+ *
+ * MAX_TREE_ITEMS is calculated from the "fan-out" of the B-tree.  The
+ * theoretical maximum number of items that we can store in a set is 2^64,
+ * so MAX_TREE_LEVELS should be set so that:
+ *
+ *   MAX_LEAF_ITEMS * MAX_INTERNAL_ITEMS ^ (MAX_TREE_LEVELS - 1) >= 2^64.
+ *
+ * In practice, we'll need far fewer levels, because you will run out of
+ * memory long before reaching that number, but let's be conservative.
+ */
+#define MAX_TREE_LEVELS		11
+
+/*
+ * Node structures, for the in-memory B-tree.
+ *
+ * An internal node holds a number of downlink pointers to leaf nodes, or
+ * to internal nodes on lower level.  For each downlink, the key value
+ * corresponding the lower level node is stored in a sorted array.  The
+ * stored key values are low keys.  In other words, if the downlink has value
+ * X, then all items stored on that child are >= X.
+ *
+ * Each leaf node holds a number of "items", with a varying number of
+ * integers packed into each item.  Each item consists of two 64-bit words:
+ * The first word holds first integer stored in the item, in plain format.
+ * The second word contains between 0 and 240 more integers, packed using
+ * Simple-8b encoding.  By storing the first integer in plain, unpacked,
+ * format, we can use binary search to quickly find an item that holds (or
+ * would hold) a particular integer.  And by storing the rest in packed form,
+ * we still get pretty good memory density, if there are clusters of integers
+ * with similar values.
+ *
+ * Each leaf node also has a pointer to the next leaf node, so that the leaf
+ * nodes can be easily walked from beginning to end, when iterating.
+ */
+typedef struct intset_node intset_node;
+typedef struct intset_leaf_node intset_leaf_node;
+typedef struct intset_internal_node intset_internal_node;
+
+/* Common structure of both leaf and internal nodes. */
+struct intset_node
+{
+	uint16		level;
+	uint16		num_items;
+};
+
+/* Internal node */
+struct intset_internal_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* >= 1 on internal nodes */
+	uint16		num_items;
+
+	/*
+	 * 'values' is an array of key values, and 'downlinks' are pointers
+	 * to lower-level nodes, corresponding to the key values.
+	 */
+	uint64		values[MAX_INTERNAL_ITEMS];
+	intset_node *downlinks[MAX_INTERNAL_ITEMS];
+};
+
+/* Leaf node */
+typedef struct
+{
+	uint64		first;		/* first integer in this item */
+	uint64		codeword;	/* simple8b encoded differences from 'first' */
+} leaf_item;
+
+#define MAX_VALUES_PER_LEAF_ITEM	(1 + SIMPLE8B_MAX_VALUES_PER_CODEWORD)
+
+struct intset_leaf_node
+{
+	/* common header, must match intset_node */
+	uint16		level;			/* 0 on leafs */
+	uint16		num_items;
+
+	intset_leaf_node *next;	/* right sibling, if any */
+
+	leaf_item	items[MAX_LEAF_ITEMS];
+};
+
+/*
+ * We buffer insertions in a simple array, before packing and inserting them
+ * into the B-tree.  MAX_BUFFERED_VALUES sets the size of the buffer.  The
+ * encoder assumes that it is large enough, that we can always fill a leaf
+ * item with buffered new items.  In other words, MAX_BUFFERED_VALUES must be
+ * larger than MAX_VALUES_PER_LEAF_ITEM.
+ */
+#define MAX_BUFFERED_VALUES			(MAX_VALUES_PER_LEAF_ITEM * 2)
+
+/*
+ * IntegerSet is the top-level object representing the set.
+ *
+ * The integers are stored in an in-memory B-tree structure, and an array
+ * for newly-added integers.  IntegerSet also tracks information about memory
+ * usage, as well as the current position, when iterating the set with
+ * intset_begin_iterate / intset_iterate_next.
+ */
+struct IntegerSet
+{
+	/*
+	 * 'context' is a dedicated memory context, used to hold the IntegerSet
+	 * struct itself, as well as all the tree nodes.
+	 *
+	 * 'mem_used' tracks the amount of memory used.  We don't do anything with
+	 * it in integerset.c itself, but the callers can ask for it with
+	 * intset_memory_usage().
+	 */
+	MemoryContext context;		/* memory context holding everything */
+	uint64		mem_used;		/* amount of memory used */
+
+	uint64		num_entries;	/* total # of values in the set */
+	uint64		highest_value;	/* highest value stored in this set */
+
+	/*
+	 * B-tree to hold the packed values.
+	 *
+	 * 'rightmost_nodes' hold pointers to the rightmost node on each level.
+	 * rightmost_parent[0] is rightmost leaf, rightmost_parent[1] is its
+	 * parent, and so forth, all the way up to the root. These are needed when
+	 * adding new values. (Currently, we require that new values are added at
+	 * the end.)
+	 */
+	int			num_levels;		/* height of the tree */
+	intset_node *root;			/* root node */
+	intset_node *rightmost_nodes[MAX_TREE_LEVELS];
+	intset_leaf_node *leftmost_leaf;	/* leftmost leaf node */
+
+	/*
+	 * Holding area for new items that haven't been inserted to the tree yet.
+	 */
+	uint64		buffered_values[MAX_BUFFERED_VALUES];
+	int			num_buffered_values;
+
+	/*
+	 * Iterator support.
+	 *
+	 * 'iter_values' is an array of integers ready to be returned to the
+	 * caller.  'item_node' and 'item_itemno' point to the leaf node, and
+	 * item within the leaf node, to get the next batch of values from.
+	 *
+	 * Normally, 'iter_values' points 'iter_values_buf', which holds items
+	 * decoded from a leaf item. But after we have scanned the whole B-tree,
+	 * we iterate through all the unbuffered values, too, by pointing
+	 * iter_values to 'buffered_values'.
+	 */
+	uint64	   *iter_values;
+	int			iter_num_values;	/* number of elements in 'iter_values' */
+	int			iter_valueno;		/* index into 'iter_values' */
+	intset_leaf_node *iter_node;	/* current leaf node */
+	int			iter_itemno;		/* next item 'iter_node' to decode */
+
+	uint64		iter_values_buf[MAX_VALUES_PER_LEAF_ITEM];
+};
+
+/*
+ * prototypes for internal functions.
+ */
+static void intset_update_upper(IntegerSet *intset, int level,
+				 intset_node *new_node, uint64 new_node_item);
+static void intset_flush_buffered_values(IntegerSet *intset);
+
+static int intset_binsrch_uint64(uint64 value, uint64 *arr, int arr_elems,
+				bool nextkey);
+static int intset_binsrch_leaf(uint64 value, leaf_item *arr, int arr_elems,
+				bool nextkey);
+
+static uint64 simple8b_encode(uint64 *ints, int *num_encoded, uint64 base);
+static int simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base);
+static bool simple8b_contains(uint64 codeword, uint64 key, uint64 base);
+
+
+/*
+ * Create a new, initially empty, integer set.
+ */
+IntegerSet *
+intset_create(void)
+{
+	MemoryContext context;
+	IntegerSet *intset;
+
+	/*
+	 * Create a new memory context to hold everything.
+	 *
+	 * We never free any nodes, so the generational allocator works well for
+	 * us.
+	 *
+	 * Use a large block size, in the hopes that if we use a lot of memory,
+	 * the libc allocator will give it back to the OS when we free it, rather
+	 * than add it to a free-list. (On glibc, see M_MMAP_THRESHOLD.  As of this
+	 * writing, the effective threshold is somewhere between 128 kB and 4 MB.)
+	 */
+	context = GenerationContextCreate(CurrentMemoryContext,
+									  "integer set",
+									  SLAB_LARGE_BLOCK_SIZE);
+
+	/* Allocate the IntegerSet object itself, in the context. */
+	intset = (IntegerSet *) MemoryContextAlloc(context, sizeof(IntegerSet));
+	intset->context = context;
+	intset->mem_used = GetMemoryChunkSpace(intset);
+
+	intset->num_entries = 0;
+	intset->highest_value = 0;
+
+	intset->num_levels = 0;
+	intset->root = NULL;
+	memset(intset->rightmost_nodes, 0, sizeof(intset->rightmost_nodes));
+	intset->leftmost_leaf = NULL;
+
+	intset->num_buffered_values = 0;
+
+	intset->iter_node = NULL;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+
+	return intset;
+}
+
+/*
+ * Allocate a new node.
+ */
+static intset_internal_node *
+intset_new_internal_node(IntegerSet *intset)
+{
+	intset_internal_node *n;
+
+	n = (intset_internal_node *) MemoryContextAlloc(intset->context,
+													sizeof(intset_internal_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;		/* caller must set */
+	n->num_items = 0;
+
+	return n;
+}
+
+static intset_leaf_node *
+intset_new_leaf_node(IntegerSet *intset)
+{
+	intset_leaf_node *n;
+
+	n = (intset_leaf_node *) MemoryContextAlloc(intset->context,
+												sizeof(intset_leaf_node));
+	intset->mem_used += GetMemoryChunkSpace(n);
+
+	n->level = 0;
+	n->num_items = 0;
+	n->next = NULL;
+
+	return n;
+}
+
+/*
+ * Free the integer set
+ */
+void
+intset_free(IntegerSet *intset)
+{
+	/* everything is allocated in the memory context */
+	MemoryContextDelete(intset->context);
+}
+
+/*
+ * Return the number of entries in the integer set.
+ */
+uint64
+intset_num_entries(IntegerSet *intset)
+{
+	return intset->num_entries;
+}
+
+/*
+ * Return the amount of memory used by the integer set.
+ */
+uint64
+intset_memory_usage(IntegerSet *intset)
+{
+	return intset->mem_used;
+}
+
+/*
+ * Add a value to the set.
+ *
+ * Values must be added in order.
+ */
+void
+intset_add_member(IntegerSet *intset, uint64 x)
+{
+	if (intset->iter_node)
+		elog(ERROR, "cannot add new values to integer set when iteration is in progress");
+
+	if (x <= intset->highest_value && intset->num_entries > 0)
+		elog(ERROR, "cannot add value to integer set out of order");
+
+	if (intset->num_buffered_values >= MAX_BUFFERED_VALUES)
+	{
+		/* Time to flush our buffer */
+		intset_flush_buffered_values(intset);
+		Assert(intset->num_buffered_values < MAX_BUFFERED_VALUES);
+	}
+
+	/* Add it to the buffer of newly-added values */
+	intset->buffered_values[intset->num_buffered_values] = x;
+	intset->num_buffered_values++;
+	intset->num_entries++;
+	intset->highest_value = x;
+}
+
+/*
+ * Take a batch of buffered values, and pack them into the B-tree.
+ */
+static void
+intset_flush_buffered_values(IntegerSet *intset)
+{
+	uint64	   *values = intset->buffered_values;
+	uint64		num_values = intset->num_buffered_values;
+	int			num_packed = 0;
+	intset_leaf_node *leaf;
+
+	leaf = (intset_leaf_node *) intset->rightmost_nodes[0];
+
+	/*
+	 * If the tree is completely empty, create the first leaf page, which
+	 * is also the root.
+	 */
+	if (leaf == NULL)
+	{
+		/*
+		 * This is the very first item in the set.
+		 *
+		 * Allocate root node. It's also a leaf.
+		 */
+		leaf = intset_new_leaf_node(intset);
+
+		intset->root = (intset_node *) leaf;
+		intset->leftmost_leaf = leaf;
+		intset->rightmost_nodes[0] = (intset_node *) leaf;
+		intset->num_levels = 1;
+	}
+
+	/*
+	 * If there are less than MAX_VALUES_PER_LEAF_ITEM values in the
+	 * buffer, stop.  In most cases, we cannot encode that many values
+	 * in a single value, but this way, the encoder doesn't have to
+	 * worry about running out of input.
+	 */
+	while (num_values - num_packed >= MAX_VALUES_PER_LEAF_ITEM)
+	{
+		leaf_item	item;
+		int			num_encoded;
+
+		/*
+		 * Construct the next leaf item, packing as many buffered values
+		 * as possible.
+		 */
+		item.first = values[num_packed];
+		item.codeword = simple8b_encode(&values[num_packed + 1],
+										&num_encoded,
+										item.first);
+
+		/*
+		 * Add the item to the node, allocating a new node if the old one
+		 * is full.
+		 */
+		if (leaf->num_items >= MAX_LEAF_ITEMS)
+		{
+			/* Allocate new leaf and link it to the tree */
+			intset_leaf_node *old_leaf = leaf;
+
+			leaf = intset_new_leaf_node(intset);
+			old_leaf->next = leaf;
+			intset->rightmost_nodes[0] = (intset_node *) leaf;
+			intset_update_upper(intset, 1, (intset_node *) leaf, item.first);
+		}
+		leaf->items[leaf->num_items++] = item;
+
+		num_packed += 1 + num_encoded;
+	}
+
+	/*
+	 * Move any remaining buffered values to the beginning of the array.
+	 */
+	if (num_packed < intset->num_buffered_values)
+	{
+		memmove(&intset->buffered_values[0],
+				&intset->buffered_values[num_packed],
+				(intset->num_buffered_values - num_packed) * sizeof(uint64));
+	}
+	intset->num_buffered_values -= num_packed;
+}
+
+/*
+ * Insert a downlink into parent node, after creating a new node.
+ *
+ * Recurses if the parent node is full, too.
+ */
+static void
+intset_update_upper(IntegerSet *intset, int level, intset_node *child,
+					uint64 child_key)
+{
+	intset_internal_node *parent;
+
+	Assert(level > 0);
+
+	/*
+	 * Create a new root node, if necessary.
+	 */
+	if (level >= intset->num_levels)
+	{
+		intset_node *oldroot = intset->root;
+		uint64		downlink_key;
+
+		/* MAX_TREE_LEVELS should be more than enough, this shouldn't happen */
+		if (intset->num_levels == MAX_TREE_LEVELS)
+			elog(ERROR, "could not expand integer set, maximum number of levels reached");
+		intset->num_levels++;
+
+		/*
+		 * Get the first value on the old root page, to be used as the
+		 * downlink.
+		 */
+		if (intset->root->level == 0)
+			downlink_key = ((intset_leaf_node *) oldroot)->items[0].first;
+		else
+			downlink_key = ((intset_internal_node *) oldroot)->values[0];
+
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = downlink_key;
+		parent->downlinks[0] = oldroot;
+		parent->num_items = 1;
+
+		intset->root = (intset_node *) parent;
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+	}
+
+	/*
+	 * Place the downlink on the parent page.
+	 */
+	parent = (intset_internal_node *) intset->rightmost_nodes[level];
+
+	if (parent->num_items < MAX_INTERNAL_ITEMS)
+	{
+		parent->values[parent->num_items] = child_key;
+		parent->downlinks[parent->num_items] = child;
+		parent->num_items++;
+	}
+	else
+	{
+		/*
+		 * Doesn't fit.  Allocate new parent, with the downlink as the first
+		 * item on it, and recursively insert the downlink to the new parent
+		 * to the grandparent.
+		 */
+		parent = intset_new_internal_node(intset);
+		parent->level = level;
+		parent->values[0] = child_key;
+		parent->downlinks[0] = child;
+		parent->num_items = 1;
+
+		intset->rightmost_nodes[level] = (intset_node *) parent;
+
+		intset_update_upper(intset, level + 1, (intset_node *) parent, child_key);
+	}
+}
+
+/*
+ * Does the set contain the given value?
+ */
+bool
+intset_is_member(IntegerSet *intset, uint64 x)
+{
+	intset_node   *node;
+	intset_leaf_node *leaf;
+	int			level;
+	int			itemno;
+	leaf_item  *item;
+
+	/*
+	 * The value might be in the buffer of newly-added values.
+	 */
+	if (intset->num_buffered_values > 0 && x >= intset->buffered_values[0])
+	{
+		int			itemno;
+
+		itemno = intset_binsrch_uint64(x,
+									   intset->buffered_values,
+									   intset->num_buffered_values,
+									   false);
+		if (itemno >= intset->num_buffered_values)
+			return false;
+		else
+			return intset->buffered_values[itemno] == x;
+	}
+
+	/*
+	 * Start from the root, and walk down the B-tree to find the right leaf
+	 * node.
+	 */
+	if (!intset->root)
+		return false;
+	node = intset->root;
+	for (level = intset->num_levels - 1; level > 0; level--)
+	{
+		intset_internal_node *n = (intset_internal_node *) node;
+
+		Assert(node->level == level);
+
+		itemno = intset_binsrch_uint64(x, n->values, n->num_items, true);
+		if (itemno == 0)
+			return false;
+		node = n->downlinks[itemno - 1];
+	}
+	Assert(node->level == 0);
+	leaf = (intset_leaf_node *) node;
+
+	/*
+	 * Binary search the right item on the leaf page
+	 */
+	itemno = intset_binsrch_leaf(x, leaf->items, leaf->num_items, true);
+	if (itemno == 0)
+		return false;
+	item = &leaf->items[itemno - 1];
+
+	/* Is this a match to the first value on the item? */
+	if (item->first == x)
+		return true;
+	Assert(x > item->first);
+
+	/* Is it in the packed codeword? */
+	if (simple8b_contains(item->codeword, x, item->first))
+		return true;
+
+	return false;
+}
+
+/*
+ * Begin in-order scan through all the values.
+ *
+ * While the iteration is in-progress, you cannot add new values to the set.
+ */
+void
+intset_begin_iterate(IntegerSet *intset)
+{
+	intset->iter_node = intset->leftmost_leaf;
+	intset->iter_itemno = 0;
+	intset->iter_valueno = 0;
+	intset->iter_num_values = 0;
+	intset->iter_values = intset->iter_values_buf;
+}
+
+/*
+ * Returns the next integer, when iterating.
+ *
+ * intset_begin_iterate() must be called first.  intset_iterate_next() returns
+ * the next value in the set.  If there are no more values, *found is set
+ * to false.
+ */
+uint64
+intset_iterate_next(IntegerSet *intset, bool *found)
+{
+	for (;;)
+	{
+		if (intset->iter_valueno < intset->iter_num_values)
+		{
+			*found = true;
+			return intset->iter_values[intset->iter_valueno++];
+		}
+
+		/* Our queue is empty, decode next leaf item */
+		if (intset->iter_node && intset->iter_itemno < intset->iter_node->num_items)
+		{
+			/* We have reached end of this packed item.  Step to the next one. */
+			leaf_item  *item;
+			int			num_decoded;
+
+			item = &intset->iter_node->items[intset->iter_itemno++];
+
+			intset->iter_values[0] = item->first;
+			num_decoded = simple8b_decode(item->codeword, &intset->iter_values[1], item->first);
+			intset->iter_num_values = num_decoded + 1;
+
+			intset->iter_valueno = 0;
+			continue;
+		}
+
+		/* No more items on this leaf, step to next node */
+		if (intset->iter_node)
+		{
+			/* No more matches on this bucket. Step to the next node. */
+			intset->iter_node = intset->iter_node->next;
+			intset->iter_itemno = 0;
+			intset->iter_valueno = 0;
+			intset->iter_num_values = 0;
+			continue;
+		}
+
+		/*
+		 * We have reached the end of the B-tree.  But we might still have
+		 * some integers in the buffer of newly-added values.
+		 */
+		if (intset->iter_values == intset->iter_values_buf)
+		{
+			intset->iter_values = intset->buffered_values;
+			intset->iter_num_values = intset->num_buffered_values;
+			continue;
+		}
+
+		break;
+	}
+
+	/* No more results. */
+	*found = false;
+	return 0;
+}
+
+/*
+ * intset_binsrch_uint64() -- search a sorted array of uint64s
+ *
+ * Returns the first position with key equal or less than the given key.
+ * The returned position would be the "insert" location for the given key,
+ * that is, the position where the new key should be inserted to.
+ *
+ * 'nextkey' affects the behavior on equal keys.  If true, and there is an
+ * equal key in the array, this returns the position immediately after the
+ * equal key.  If false, this returns the position of the equal key itself.
+ */
+static int
+intset_binsrch_uint64(uint64 item, uint64 *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid])
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/* same, but for an array of leaf items */
+static int
+intset_binsrch_leaf(uint64 item, leaf_item *arr, int arr_elems, bool nextkey)
+{
+	int			low,
+				high,
+				mid;
+
+	low = 0;
+	high = arr_elems;
+	while (high > low)
+	{
+		mid = low + (high - low) / 2;
+
+		if (nextkey)
+		{
+			if (item >= arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+		else
+		{
+			if (item > arr[mid].first)
+				low = mid + 1;
+			else
+				high = mid;
+		}
+	}
+
+	return low;
+}
+
+/*
+ * Simple-8b encoding.
+ *
+ * Simple-8b algorithm packs between 1 and 240 integers into 64-bit words,
+ * called "codewords".  The number of integers packed into a single codeword
+ * depends on the integers being packed: small integers are encoded using
+ * fewer bits than large integers.  A single codeword can store a single
+ * 60-bit integer, or two 30-bit integers, for example.
+ *
+ * Since we're storing a unique, sorted, set of integers, we actually encode
+ * the *differences* between consecutive integers.  That way, clusters of
+ * integers that are close to each other are packed efficiently, regardless
+ * of the absolute values.
+ *
+ * In Simple-8b, each codeword consists of a 4-bit selector, which indicates
+ * how many integers are encoded in the codeword, and the encoded integers
+ * packed into the remaining 60 bits.  The selector allows for 16 different
+ * ways of using the remaining 60 bits, "modes".  The number of integers
+ * packed into a single codeword is listed in the simple8b_modes table below.
+ * For example, consider the following codeword:
+ *
+ *      20-bit integer       20-bit integer       20-bit integer
+ * 1101 00000000000000010010 01111010000100100000 00000000000000010100
+ * ^
+ * selector
+ *
+ * The selector 1101 is 13 in decimal.  From the modes table below, we see
+ * that it means that the codeword encodes three 12-bit integers.  In decimal,
+ * those integers are 18, 500000 and 20.  Because we encode deltas rather than
+ * absolute values, the actual values that they represent are 18,  500018 and
+ * 500038.
+ *
+ * Modes 0 and 1 are a bit special; they encode a run of 240 or 120 zeros
+ * (which means 240 or 120 consecutive integers, since we're encoding the
+ * the deltas between integers), without using the rest of the codeword bits
+ * for anything.
+ *
+ * Simple-8b cannot encode integers larger than 60 bits.  Values larger than
+ * that are always stored in the 'first' field of a leaf item, never in the
+ * packed codeword.  If there is a sequence of integers that are more than
+ * 2^60 apart, the codeword will go unused on those items.  To represent that,
+ * we use a magic EMPTY_CODEWORD codeword.
+ */
+static const struct
+{
+	uint8		bits_per_int;
+	uint8		num_ints;
+} simple8b_modes[17] =
+{
+	{  0, 240 },	/* mode  0: 240 zeros */
+	{  0, 120 },	/* mode  1: 120 zeros */
+	{  1,  60 },	/* mode  2: sixty 1-bit integers */
+	{  2,  30 },	/* mode  3: thirty 2-bit integers */
+	{  3,  20 },	/* mode  4: twenty 3-bit integers */
+	{  4,  15 },	/* mode  5: fifteen 4-bit integers */
+	{  5,  12 },	/* mode  6: twelve 5-bit integers */
+	{  6,  10 },	/* mode  7: ten 6-bit integers */
+	{  7,   8 },	/* mode  8: eight 7-bit integers (four bits are wasted) */
+	{  8,   7 },	/* mode  9: seven 8-bit integers (four bits are wasted) */
+	{ 10,   6 },	/* mode 10: six 10-bit integers */
+	{ 12,   5 },	/* mode 11: five 12-bit integers */
+	{ 15,   4 },	/* mode 12: four 15-bit integers */
+	{ 20,   3 },	/* mode 13: three 20-bit integers */
+	{ 30,   2 },	/* mode 14: two 30-bit integers */
+	{ 60,   1 },	/* mode 15: one 60-bit integer */
+	{ 0,    0 }		/* sentinel value */
+};
+
+/*
+ * EMPTY_CODEWORD is a special value, used to indicate "no values".
+ * It is used if the next value is too large to be encoded with Simple-8b.
+ *
+ * This value looks like a 0-mode codeword,  but we check for it
+ * specifically.  (In a real 0-mode codeword, all the unused bits are zero.)
+ */
+#define EMPTY_CODEWORD		(0xFFFFFFFFFFFFFFF0)
+
+/*
+ * Encode a number of integers into a Simple-8b codeword.
+ *
+ * Returns the number of integers that were encoded.
+ */
+static uint64
+simple8b_encode(uint64 *ints, int *num_encoded, uint64 base)
+{
+	int			selector;
+	int			nints;
+	int			bits;
+	uint64		diff;
+	uint64		last_val;
+	uint64		codeword;
+	uint64		diffs[60];
+	int			i;
+
+	Assert(ints[0] > base);
+
+	/*
+	 * Select the "mode" to use for the next codeword.
+	 *
+	 * In each iteration, check if the next value can be represented
+	 * in the current mode we're considering.  If it's too large, then
+	 * step up the mode to a wider one, and repeat.  If it fits, move
+	 * on to next integer.  Repeat until the codeword is full, given
+	 * the current mode.
+	 *
+	 * Note that we don't have any way to represent unused slots in the
+	 * codeword, so we require each codeword to be "full".
+	 */
+	selector = 0;
+	nints = simple8b_modes[0].num_ints;
+	bits = simple8b_modes[0].bits_per_int;
+	diff = ints[0] - base - 1;
+	last_val = ints[0];
+	i = 0;
+	for (;;)
+	{
+		if (diff >= (1L << bits))
+		{
+			/* too large, step up to next mode */
+			selector++;
+			nints = simple8b_modes[selector].num_ints;
+			bits = simple8b_modes[selector].bits_per_int;
+			if (i >= nints)
+				break;
+		}
+		else
+		{
+			if (i < 60)
+				diffs[i] = diff;
+			i++;
+			if (i >= nints)
+				break;
+
+			Assert(ints[i] > last_val);
+			diff = ints[i] - last_val - 1;
+			last_val = ints[i];
+		}
+	}
+
+	if (nints == 0)
+	{
+		/* The next value is too large and be encoded with Simple-8b */
+		Assert(i == 0);
+		*num_encoded = 0;
+		return EMPTY_CODEWORD;
+	}
+
+	/*
+	 * Encode the integers using the selected mode.  Note that we shift them
+	 * into the codeword in reverse order, so that they will come out in the
+	 * correct order in the decoder.
+	 */
+	codeword = 0;
+	if (bits > 0)
+	{
+		for (i = nints - 1; i >= 0; i--)
+		{
+			codeword <<= bits;
+			codeword |= diffs[i];
+		}
+	}
+
+	/* add selector to the codeword, and return */
+	codeword <<= 4;
+	codeword |= selector;
+
+	*num_encoded = nints;
+	return codeword;
+}
+
+/*
+ * Decode a codeword into an array of integers.
+ */
+static int
+simple8b_decode(uint64 codeword, uint64 *decoded, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	uint64		bits = simple8b_modes[selector].bits_per_int;
+	uint64		mask = (1L << bits) - 1;
+	uint64		prev_value;
+
+	if (codeword == EMPTY_CODEWORD)
+		return 0;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	prev_value = base;
+	for (int i = 0; i < nints; i++)
+	{
+		uint64		diff = codeword & mask;
+
+		decoded[i] = prev_value + 1L + diff;
+		prev_value = decoded[i];
+		codeword >>= bits;
+	}
+	return nints;
+}
+
+/*
+ * This is very similar to simple8b_decode(), but instead of decoding all
+ * the values to an array, it just checks if the given integer is part of
+ * the codeword.
+ */
+static bool
+simple8b_contains(uint64 codeword, uint64 key, uint64 base)
+{
+	int			selector = codeword & 0x0f;
+	int			nints = simple8b_modes[selector].num_ints;
+	int			bits = simple8b_modes[selector].bits_per_int;
+
+	if (codeword == EMPTY_CODEWORD)
+		return false;
+
+	codeword >>= 4;		/* shift out the selector */
+
+	if (bits == 0)
+	{
+		/* Special handling for 0-bit cases. */
+		return key - base <= nints;
+	}
+	else
+	{
+		int			mask = (1L << bits) - 1;
+		uint64		prev_value;
+
+		prev_value = base;
+		for (int i = 0; i < nints; i++)
+		{
+			uint64		diff = codeword & mask;
+			uint64		curr_value;
+
+			curr_value = prev_value + 1L + diff;
+
+			if (curr_value >= key)
+			{
+				if (curr_value == key)
+					return true;
+				else
+					return false;
+			}
+
+			codeword >>= bits;
+			prev_value = curr_value;
+		}
+	}
+	return false;
+}
diff --git a/src/include/lib/integerset.h b/src/include/lib/integerset.h
new file mode 100644
index 00000000000..27aa3ee883c
--- /dev/null
+++ b/src/include/lib/integerset.h
@@ -0,0 +1,25 @@
+/*
+ * integerset.h
+ *	  In-memory data structure to hold a large set of integers efficiently
+ *
+ * Portions Copyright (c) 2012-2019, PostgreSQL Global Development Group
+ *
+ * src/include/lib/integerset.h
+ */
+#ifndef INTEGERSET_H
+#define INTEGERSET_H
+
+typedef struct IntegerSet IntegerSet;
+
+extern IntegerSet *intset_create(void);
+extern void intset_free(IntegerSet *intset);
+extern void intset_add_member(IntegerSet *intset, uint64 x);
+extern bool intset_is_member(IntegerSet *intset, uint64 x);
+
+extern uint64 intset_num_entries(IntegerSet *intset);
+extern uint64 intset_memory_usage(IntegerSet *intset);
+
+extern void intset_begin_iterate(IntegerSet *intset);
+extern uint64 intset_iterate_next(IntegerSet *intset, bool *found);
+
+#endif							/* INTEGERSET_H */
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 19d60a506e1..dfd0956aee3 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  test_bloomfilter \
 		  test_ddl_deparse \
 		  test_extensions \
+		  test_integerset \
 		  test_parser \
 		  test_pg_dump \
 		  test_predtest \
diff --git a/src/test/modules/test_integerset/.gitignore b/src/test/modules/test_integerset/.gitignore
new file mode 100644
index 00000000000..5dcb3ff9723
--- /dev/null
+++ b/src/test/modules/test_integerset/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_integerset/Makefile b/src/test/modules/test_integerset/Makefile
new file mode 100644
index 00000000000..3b7c4999d6f
--- /dev/null
+++ b/src/test/modules/test_integerset/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_integerset/Makefile
+
+MODULE_big = test_integerset
+OBJS = test_integerset.o $(WIN32RES)
+PGFILEDESC = "test_integerset - test code for src/backend/lib/integerset.c"
+
+EXTENSION = test_integerset
+DATA = test_integerset--1.0.sql
+
+REGRESS = test_integerset
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_integerset
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_integerset/README b/src/test/modules/test_integerset/README
new file mode 100644
index 00000000000..3e4226adb55
--- /dev/null
+++ b/src/test/modules/test_integerset/README
@@ -0,0 +1,7 @@
+test_integerset contains unit tests for testing the integer set implementation,
+in src/backend/lib/integerset.c
+
+The tests verify the correctness of the implemention, but they can also be
+as a micro-benchmark:  If you set the 'intset_tests_stats' flag in
+test_integerset.c, the tests will print extra information about execution time
+and memory usage.
diff --git a/src/test/modules/test_integerset/expected/test_integerset.out b/src/test/modules/test_integerset/expected/test_integerset.out
new file mode 100644
index 00000000000..d7c88ded092
--- /dev/null
+++ b/src/test/modules/test_integerset/expected/test_integerset.out
@@ -0,0 +1,14 @@
+CREATE EXTENSION test_integerset;
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+SELECT test_integerset();
+ test_integerset 
+-----------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_integerset/sql/test_integerset.sql b/src/test/modules/test_integerset/sql/test_integerset.sql
new file mode 100644
index 00000000000..34223afa885
--- /dev/null
+++ b/src/test/modules/test_integerset/sql/test_integerset.sql
@@ -0,0 +1,11 @@
+CREATE EXTENSION test_integerset;
+
+--
+-- These tests don't produce any interesting output.  We're checking that
+-- the operations complete without crashing or hanging and that none of their
+-- internal sanity tests fail.  They print progress information as INFOs,
+-- which are not interesting for automated tests, so suppress those.
+--
+SET client_min_messages = 'warning';
+
+SELECT test_integerset();
diff --git a/src/test/modules/test_integerset/test_integerset--1.0.sql b/src/test/modules/test_integerset/test_integerset--1.0.sql
new file mode 100644
index 00000000000..d6d5a3f6cf7
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset--1.0.sql
@@ -0,0 +1,8 @@
+/* src/test/modules/test_integerset/test_integerset--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_integerset" to load this file. \quit
+
+CREATE FUNCTION test_integerset()
+RETURNS pg_catalog.void STRICT
+AS 'MODULE_PATHNAME' LANGUAGE C;
diff --git a/src/test/modules/test_integerset/test_integerset.c b/src/test/modules/test_integerset/test_integerset.c
new file mode 100644
index 00000000000..24a2e08c0d1
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.c
@@ -0,0 +1,622 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_integerset.c
+ *		Test integer set data structure.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *		src/test/modules/test_integerset/test_integerset.c
+ *
+ * -------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "fmgr.h"
+#include "lib/integerset.h"
+#include "nodes/bitmapset.h"
+#include "utils/memutils.h"
+#include "utils/timestamp.h"
+#include "storage/block.h"
+#include "storage/itemptr.h"
+#include "miscadmin.h"
+
+/*
+ * If you enable this, the "pattern" tests will print information about
+ * how long populating, probing, and iterating the test set takes, and
+ * how much memory the test set consumed.  That can be used as
+ * micro-benchmark of various operations and input patterns.
+ *
+ * The information is printed to the server's stderr, mostly because
+ * that's where MemoryContextStats() output goes.
+ */
+static const bool intset_test_stats = true;
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_integerset);
+
+/*
+ * A struct to define a pattern of integers, for use with test_pattern()
+ * function.
+ */
+typedef struct
+{
+	char	   *test_name;		/* short name of the test, for humans */
+	char	   *pattern_str;	/* a bit pattern */
+	uint64		spacing;		/* pattern repeats at this interval */
+	uint64		num_values;		/* number of integers to set in total */
+} test_spec;
+
+static const test_spec test_specs[] = {
+	{
+		"all ones", "1111111111",
+		10, 100000000
+	},
+	{
+		"alternating bits", "0101010101",
+		10, 100000000
+	},
+	{
+		"clusters of ten", "1111111111",
+		10000, 10000000
+	},
+	{
+		"clusters of hundred",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"clusters of thousand",
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111"
+		"1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111",
+		10000, 100000000
+	},
+	{
+		"one-every-64k", "1",
+		65536, 10000000
+	},
+	{
+		"sparse", "100000000000000000000000000000001",
+		10000000, 10000000
+	},
+	{
+		"single values, distance > 2^32", "1",
+		10000000000L, 1000000
+	},
+	{
+		"clusters, distance > 2^32", "10101010",
+		10000000000L, 10000000
+	},
+	{
+		"clusters, distance > 2^60", "10101010",
+		2000000000000000000L, 23 /* can't be much higher than this, or we overflow uint64 */
+	}
+};
+
+static void test_pattern(const test_spec *spec);
+static void test_empty(void);
+static void test_single_value(uint64 value);
+static void check_with_filler(IntegerSet *intset, uint64 x, uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max);
+static void test_huge_distances(void);
+
+/*
+ * SQL-callable entry point to perform all tests.
+ */
+Datum
+test_integerset(PG_FUNCTION_ARGS)
+{
+	test_huge_distances();
+
+	test_empty();
+
+	test_single_value(0);
+	test_single_value(1);
+	test_single_value(PG_UINT64_MAX - 1);
+	test_single_value(PG_UINT64_MAX);
+
+	/* Same tests, but with some "filler" values, so that the B-tree gets created */
+	test_single_value_and_filler(0, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000);
+	test_single_value_and_filler(1, 1000, 2000000);
+	test_single_value_and_filler(PG_UINT64_MAX - 1, 1000, 2000);
+	test_single_value_and_filler(PG_UINT64_MAX, 1000, 2000);
+
+	test_huge_distances();
+
+	for (int i = 0; i < lengthof(test_specs); i++)
+		test_pattern(&test_specs[i]);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Test with a repeating pattern, defined by the 'spec'.
+ */
+static void
+test_pattern(const test_spec *spec)
+{
+	IntegerSet *intset;
+	MemoryContext test_cxt;
+	MemoryContext old_cxt;
+	TimestampTz starttime;
+	TimestampTz endtime;
+	uint64		n;
+	uint64		last_int;
+	int			patternlen;
+	uint64	   *pattern_values;
+	uint64		pattern_num_values;
+
+	elog(NOTICE, "testing intset with pattern \"%s\"", spec->test_name);
+	if (intset_test_stats)
+		fprintf(stderr, "-----\ntesting intset with pattern \"%s\"\n", spec->test_name);
+
+	/* Pre-process the pattern, creating an array of integers from it. */
+	patternlen = strlen(spec->pattern_str);
+	pattern_values = palloc(patternlen * sizeof(uint64));
+	pattern_num_values = 0;
+	for (int i = 0; i < patternlen; i++)
+	{
+		if (spec->pattern_str[i] == '1')
+			pattern_values[pattern_num_values++] = i;
+	}
+
+	/*
+	 * Allocate the integer set.
+	 *
+	 * Allocate it in a separate memory context, so that we can print its
+	 * memory usage easily.  (intset_create() creates a memory context of its
+	 * own, too, but we don't have direct access to it, so we cannot call
+	 * MemoryContextStats() on it directly).
+	 */
+	test_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									 "intset test",
+									 ALLOCSET_SMALL_SIZES);
+	MemoryContextSetIdentifier(test_cxt, spec->test_name);
+	old_cxt = MemoryContextSwitchTo(test_cxt);
+	intset = intset_create();
+	MemoryContextSwitchTo(old_cxt);
+
+	/*
+	 * Add values to the set.
+	 */
+	starttime = GetCurrentTimestamp();
+
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		uint64		x = 0;
+
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			x = last_int + pattern_values[i];
+
+			intset_add_member(intset, x);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+
+	endtime = GetCurrentTimestamp();
+
+	if (intset_test_stats)
+		fprintf(stderr, "added %lu values in %lu ms\n",
+				spec->num_values, (endtime - starttime) / 1000);
+
+	/*
+	 * Print stats on the amount of memory used.
+	 *
+	 * We print the usage reported by intset_memory_usage(), as well as the
+	 * stats from the memory context. They should be in the same ballpark,
+	 * but it's hard to automate testing that, so if you're making changes
+	 * to the implementation, just observe that manually.
+	 */
+	if (intset_test_stats)
+	{
+		uint64		mem_usage;
+
+		/*
+		 * Also print memory usage as reported by intset_memory_usage().
+		 * It should be in the same ballpark as the usage reported by
+		 * MemoryContextStats().
+		 */
+		mem_usage = intset_memory_usage(intset);
+		fprintf(stderr, "intset_memory_usage() reported %lu (%0.2f bytes / integer)\n",
+				mem_usage, (double) mem_usage / spec->num_values);
+
+		MemoryContextStats(test_cxt);
+	}
+
+	/* Check that intset_get_num_entries works */
+	n = intset_num_entries(intset);
+	if (n != spec->num_values)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", n, spec->num_values);
+
+	/*
+	 * Test random-access probes with intset_is_member()
+	 */
+	starttime = GetCurrentTimestamp();
+
+	for (n = 0; n < 1000000; n++)
+	{
+		bool		b;
+		bool		expected;
+		uint64		x;
+
+		/*
+		 * Pick next value to probe at random.  We limit the probes to the
+		 * last integer that we added to the set, plus an arbitrary constant
+		 * (1000).  There's no point in probing the whole 0 - 2^64 range, if
+		 * only a small part of the integer space is used.  We would very
+		 * rarely hit hit values that are actually in the set.
+		 */
+		x = (pg_lrand48() << 31) | pg_lrand48();
+		x = x % (last_int + 1000);
+
+		/* Do we expect this value to be present in the set? */
+		if (x >= last_int)
+			expected = false;
+		else
+		{
+			uint64		idx = x % spec->spacing;
+
+			if (idx >= patternlen)
+				expected = false;
+			else if (spec->pattern_str[idx] == '1')
+				expected = true;
+			else
+				expected = false;
+		}
+
+		/* Is it present according to intset_is_member() ? */
+		b = intset_is_member(intset, x);
+
+		if (b != expected)
+			elog(ERROR, "mismatch at %lu: %d vs %d", x, b, expected);
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "probed %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	/*
+	 * Test iterator
+	 */
+	starttime = GetCurrentTimestamp();
+
+	intset_begin_iterate(intset);
+	n = 0;
+	last_int = 0;
+	while (n < spec->num_values)
+	{
+		for (int i = 0; i < pattern_num_values && n < spec->num_values; i++)
+		{
+			uint64		expected = last_int + pattern_values[i];
+			uint64		x;
+			bool		found;
+
+			x = intset_iterate_next(intset, &found);
+			if (!found)
+				break;
+
+			if (x != expected)
+				elog(ERROR, "iterate returned wrong value; got %lu, expected %lu", x, expected);
+			n++;
+		}
+		last_int += spec->spacing;
+	}
+	endtime = GetCurrentTimestamp();
+	if (intset_test_stats)
+		fprintf(stderr, "iterated %lu values in %lu ms\n", n, (endtime - starttime) / 1000);
+
+	if (n < spec->num_values)
+		elog(ERROR, "iterator stopped short after %lu entries, expected %lu", n, spec->num_values);
+	if (n > spec->num_values)
+		elog(ERROR, "iterator returned %lu entries, %lu was expected", n, spec->num_values);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with a set containing a single integer.
+ */
+static void
+test_single_value(uint64 value)
+{
+	IntegerSet *intset;
+	uint64		x;
+	uint64		num_entries;
+	bool		found;
+
+	elog(NOTICE, "testing intset with single value %lu", value);
+
+	/* Create the set. */
+	intset = intset_create();
+	intset_add_member(intset, value);
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != 1)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, 1L);
+
+	/*
+	 * Test intset_is_member() at various special values, like 0 and and maximum
+	 * possible 64-bit integer, as well as the value itself.
+	 */
+	if (intset_is_member(intset, 0) != (value == 0))
+		elog(ERROR, "intset_is_member failed for 0");
+	if (intset_is_member(intset, 1) != (value == 1))
+		elog(ERROR, "intset_is_member failed for 1");
+	if (intset_is_member(intset, PG_UINT64_MAX) != (value == PG_UINT64_MAX))
+		elog(ERROR, "intset_is_member failed for PG_UINT64_MAX");
+	if (intset_is_member(intset, value) != true)
+		elog(ERROR, "intset_is_member failed for the tested value");
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (!found || x != value)
+		elog(ERROR, "intset_iterate_next failed for %lu", x);
+
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with an integer set that contains:
+ *
+ * - a given single 'value', and
+ * - all integers between 'filler_min' and 'filler_max'.
+ *
+ * This exercises different codepaths than testing just with a single value,
+ * because the implementation buffers newly-added values.  If we add just
+ * single value to the set, we won't test the internal B-tree code at all,
+ * just the code that deals with the buffer.
+ */
+static void
+test_single_value_and_filler(uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	IntegerSet *intset;
+	uint64		x;
+	bool		found;
+	uint64	   *iter_expected;
+	uint64		n = 0;
+	uint64		num_entries = 0;
+	uint64		mem_usage;
+
+	elog(NOTICE, "testing intset with value %lu, and all between %lu and %lu",
+		 value, filler_min, filler_max);
+
+	intset = intset_create();
+
+	iter_expected = palloc(sizeof(uint64) * (filler_max - filler_min + 1));
+	if (value < filler_min)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	for (x = filler_min; x < filler_max; x++)
+	{
+		intset_add_member(intset, x);
+		iter_expected[n++] = x;
+	}
+
+	if (value >= filler_max)
+	{
+		intset_add_member(intset, value);
+		iter_expected[n++] = value;
+	}
+
+	/* Test intset_get_num_entries() */
+	num_entries = intset_num_entries(intset);
+	if (num_entries != n)
+		elog(ERROR, "intset_num_entries returned %lu, expected %lu", num_entries, n);
+
+	/*
+	 * Test intset_is_member() at various spots, at and around the values that we
+	 * expect to be set, as well as 0 and the maximum possible value.
+	 */
+	check_with_filler(intset, 0,                 value, filler_min, filler_max);
+	check_with_filler(intset, 1,                 value, filler_min, filler_max);
+	check_with_filler(intset, filler_min - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_min,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_min + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, value - 1,         value, filler_min, filler_max);
+	check_with_filler(intset, value,             value, filler_min, filler_max);
+	check_with_filler(intset, value + 1,         value, filler_min, filler_max);
+	check_with_filler(intset, filler_max - 1,    value, filler_min, filler_max);
+	check_with_filler(intset, filler_max,        value, filler_min, filler_max);
+	check_with_filler(intset, filler_max + 1,    value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX - 1, value, filler_min, filler_max);
+	check_with_filler(intset, PG_UINT64_MAX,     value, filler_min, filler_max);
+
+	intset_begin_iterate(intset);
+	for (uint64 i = 0; i < n; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != iter_expected[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+
+	mem_usage = intset_memory_usage(intset);
+	if (mem_usage < 5000 || mem_usage > 500000000)
+		elog(ERROR, "intset_memory_usage() reported suspicous value: %lu", mem_usage);
+
+	intset_free(intset);
+}
+
+/*
+ * Helper function for test_single_value_and_filler.
+ *
+ * Calls intset_is_member() for value 'x', and checks that the result is what
+ * we expect.
+ */
+static void
+check_with_filler(IntegerSet *intset, uint64 x,
+				  uint64 value, uint64 filler_min, uint64 filler_max)
+{
+	bool		expected;
+	bool		actual;
+
+	expected = (x == value || (filler_min <= x && x < filler_max));
+
+	actual = intset_is_member(intset, x);
+
+	if (actual != expected)
+		elog(ERROR, "intset_is_member failed for %lu", x);
+}
+
+/*
+ * Test empty set
+ */
+static void
+test_empty(void)
+{
+	IntegerSet *intset;
+	bool		found = true;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with empty set");
+
+	intset = intset_create();
+
+	/* Test intset_is_member() */
+	if (intset_is_member(intset, 0) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, 1) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+	if (intset_is_member(intset, PG_UINT64_MAX) != false)
+		elog(ERROR, "intset_is_member on empty set returned true");
+
+	/* Test iterator */
+	intset_begin_iterate(intset);
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next on empty set returned a value (%lu)", x);
+
+	intset_free(intset);
+}
+
+/*
+ * Test with integers that are more than 2^60 apart.
+ *
+ * The Simple-8b encoding used by the set implementation can only encode
+ * values up to 2^60.  That makes large differences like this interesting
+ * to test.
+ */
+static void
+test_huge_distances(void)
+{
+	IntegerSet *intset;
+	uint64		values[1000];
+	int			num_values = 0;
+	uint64		val = 0;
+	bool		found;
+	uint64		x;
+
+	elog(NOTICE, "testing intset with distances > 2^60 between values");
+
+	val = 0;
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L - 1;	/* 2^60 - 1*/
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L + 1;	/* 2^60 + 1 */
+	values[num_values++] = val;
+
+	val += 1152921504606846976L;		/* 2^60 */
+	values[num_values++] = val;
+
+	/* we're now very close to 2^64, so can't add large values anymore */
+
+	intset = intset_create();
+
+	/*
+	 * Add many more values to the end, to make sure that all the above
+	 * values get flushed and packed into the tree structure.
+	 */
+	while (num_values < 1000)
+	{
+		val += pg_lrand48();
+		values[num_values++] = val;
+	}
+
+	/* Add these numbers to the set */
+	for (int i = 0; i < num_values; i++)
+		intset_add_member(intset, values[i]);
+
+	/*
+	 * Test iterset_is_member() around each of these values
+	 */
+	for (int i = 0; i < num_values; i++)
+	{
+		uint64		x = values[i];
+		bool		result;
+
+		if (x > 0)
+		{
+			result = intset_is_member(intset, x - 1);
+			if (result != false)
+				elog(ERROR, "intset_is_member failed for %lu", x - 1);
+		}
+
+		result = intset_is_member(intset, x);
+		if (result != true)
+			elog(ERROR, "intset_is_member failed for %lu", x);
+
+		result = intset_is_member(intset, x + 1);
+		if (result != false)
+			elog(ERROR, "intset_is_member failed for %lu", x + 1);
+	}
+
+	/*
+	 * Test iterator
+	 */
+	intset_begin_iterate(intset);
+	for (int i = 0; i < num_values; i++)
+	{
+		x = intset_iterate_next(intset, &found);
+		if (!found || x != values[i])
+			elog(ERROR, "intset_iterate_next failed for %lu", x);
+	}
+	x = intset_iterate_next(intset, &found);
+	if (found)
+		elog(ERROR, "intset_iterate_next failed %lu", x);
+}
diff --git a/src/test/modules/test_integerset/test_integerset.control b/src/test/modules/test_integerset/test_integerset.control
new file mode 100644
index 00000000000..7d20c2d7b88
--- /dev/null
+++ b/src/test/modules/test_integerset/test_integerset.control
@@ -0,0 +1,4 @@
+comment = 'Test code for integerset'
+default_version = '1.0'
+module_pathname = '$libdir/test_integerset'
+relocatable = true
-- 
2.20.1

0002-Delete-empty-pages-during-GiST-VACUUM.patchtext/x-patch; name=0002-Delete-empty-pages-during-GiST-VACUUM.patchDownload
From 1a3690be16be340f288c069c452e8428f1cc48ad Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we scan through internal pages
looking for references to empty leaf pages.

Heikki's CHANGES since v22:

* Only scan the empty pages after the last scan, in a multi-pass vacuum.
  (I think that's what we want...) We could actually be smarter, and
  do this as part of the second pass's scan, in a multi-pass vacuum.

* Call ReadNewTransactionId() while holding lock. I think that's needed
  for correctness.

* Use new IntegerSet implementation.

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
---
 src/backend/access/gist/README         |  48 ++++
 src/backend/access/gist/gist.c         |  15 ++
 src/backend/access/gist/gistutil.c     |  15 +-
 src/backend/access/gist/gistvacuum.c   | 350 +++++++++++++++++++++++--
 src/backend/access/gist/gistxlog.c     |  65 +++++
 src/backend/access/rmgrdesc/gistdesc.c |   3 +
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |   7 +-
 src/include/access/gistxlog.h          |  12 +-
 src/test/regress/expected/gist.out     |   6 +-
 src/test/regress/sql/gist.sql          |   6 +-
 11 files changed, 489 insertions(+), 42 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..501b1c1a77a 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,54 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+VACUUM works in two stages:
+
+In the first stage, we scan the whole index in physical order. To make sure
+that we don't miss any dead tuples because a concurrent page split moved them,
+we check the F_FOLLOW_RIGHT flags and NSN on each page, to detect if the
+page has been concurrently split. If a concurrent page split is detected, and
+one half the page was moved to a position that we already scanned, we "jump"
+to scan the page again. This is the same mechanism that B-tree VACUUM uses,
+but because we already have NSNs on pages, to detect page splits during
+searches, we don't need a "vacuum cycle ID" concept for that like B-tree does.
+
+While we scan all the pages, we also make note of any completely empty leaf
+pages. We will try to unlink them from the tree in the second stage. We also
+record the block numbers of all internal pages, in an IntegerSet. They are
+needed in the second stage, to locate parents of empty pages.
+
+In the second stage, we try to unlink any empty leaf pages from the tree, so
+that their space can be reused. If we didn't see any empty pages in the first
+stage, the second stage is skipped. In order to delete an empty page, its
+downlink must be removed from the parent. We scan all the internal pages,
+whose block numbers we
+memorized in the first stage, and look for downlinks to pages that we have
+memorized as being empty. Whenever we find one, we acquire a lock on the
+parent and child page, re-check that the child page is still empty. Then, we
+remove the downlink and mark the child as deleted, and release the locks.
+
+The insertion algorithm would get confused, if an internal page was completely
+empty. So we never delete the last child of an internal page, even if it's
+empty. Currently, we only support deleting leaf pages.
+
+This page deletion algorithm works on a best-effor basis. It might fail to
+find a downlink, if a concurrent page split moved it after the first stage.
+In that case, we won't be able to remove all empty pages. That's OK, it's
+not expected to happen very often, and hopefully the next VACUUM will clean
+it up, instead.
+
+When we have deleted a page, it's possible that an in-progress search will
+still descend on the page, if it saw the downlink before we removed it. The
+search will see that it is deleted, and ignore it, but as long as that can
+happen, we cannot reuse the page. To "wait out" any in-progress searches, when
+the page is deleted, it's labeled with the current next-transaction counter
+value. The page is not recycled, until that XID is no longer visible to
+anyone. That's much more conservative than necessary, but let's keep it
+simple.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef98..a746e911f37 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,9 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/* currently, internal pages are never deleted */
+			Assert(!GistPageIsDeleted(stack->page));
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +841,18 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * The page might have been deleted after we scanned the parent
+			 * and saw the downlink.
+			 */
+			if (GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index f32e16eed58..4e511dfb8c2 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -829,13 +830,21 @@ gistNewBuffer(Relation r)
 		{
 			Page		page = BufferGetPage(buffer);
 
+			/*
+			 * If the page was never initialized, it's OK to use.
+			 */
 			if (PageIsNew(page))
-				return buffer;	/* OK to use, if never initialized */
+				return buffer;
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
-				return buffer;	/* OK to use */
+			/*
+			 * Otherwise, recycle it if deleted, and too old to have any processes
+			 * interested in it.
+			 */
+			if (GistPageIsDeleted(page) &&
+				TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin))
+				return buffer;
 
 			LockBuffer(buffer, GIST_UNLOCK);
 		}
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e8..531b4b73a45 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,26 +16,49 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/integerset.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-/* Working state needed by gistbulkdelete */
+/*
+ * State kept across vacuum stages.
+ */
 typedef struct
 {
+	IndexBulkDeleteResult stats;	/* must be first */
+
 	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
+
+	/*
+	 * These are used to memorize all internal and empty leaf pages in the 1st
+	 * vacuum phase.  They are used in the 2nd phase, to delete all the empty
+	 * pages.
+	 */
+	IntegerSet *internalPagesSet;
+	IntegerSet *emptyLeafPagesSet;
+} GistBulkDeleteResult;
+
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	GistBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	GistNSN		startNSN;
 	BlockNumber totFreePages;	/* true total # of free pages */
 } GistVacState;
 
-static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
+static void gistvacuum_recycle_pages(GistBulkDeleteResult *stats);
+static bool gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer buffer, OffsetNumber downlink,
+			   Buffer leafBuffer);
 
 /*
  * VACUUM bulkdelete stage: remove index entries.
@@ -44,13 +67,15 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	if (gist_stats == NULL)
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, gist_stats, callback, callback_state);
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -59,6 +84,8 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -68,10 +95,26 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
 	 * still need to do a pass over the index, to obtain index statistics.
 	 */
-	if (stats == NULL)
+	if (gist_stats == NULL)
 	{
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
+		gistvacuumscan(info, gist_stats, NULL, NULL);
+	}
+
+	/*
+	 * If we saw any empty pages that could be recycle, try to unlink them from
+	 * the tree so that they can be reused.
+	 */
+	if (gist_stats->emptyLeafPagesSet)
+	{
+		gistvacuum_recycle_pages(gist_stats);
+		intset_free(gist_stats->emptyLeafPagesSet);
+		gist_stats->emptyLeafPagesSet = NULL;
+	}
+	if (gist_stats->internalPagesSet)
+	{
+		intset_free(gist_stats->internalPagesSet);
+		gist_stats->internalPagesSet = NULL;
 	}
 
 	/*
@@ -82,11 +125,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	if (!info->estimated_count)
 	{
-		if (stats->num_index_tuples > info->num_heap_tuples)
-			stats->num_index_tuples = info->num_heap_tuples;
+		if (gist_stats->stats.num_index_tuples > info->num_heap_tuples)
+			gist_stats->stats.num_index_tuples = info->num_heap_tuples;
 	}
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -97,15 +140,16 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * btvacuumcleanup invoke this (the latter only if no btbulkdelete call
  * occurred).
  *
- * This also adds unused/delete pages to the free space map, although that
- * is currently not very useful.  There is currently no support for deleting
- * empty pages, so recycleable pages can only be found if an error occurs
- * while the index is being expanded, leaving an all-zeros page behind.
+ * This also makes note of any empty leaf pages, as well as all internal
+ * pages. gistvacuum_recycle_pages() needs that information.  Any deleted
+ * pages are added directly to the free space map.  (They should've been
+ * added there when they were originally deleted, already, but it's possible
+ * that the FSM was lost at a crash, for example.)
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
  */
 static void
-gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
@@ -118,12 +162,19 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command.
 	 */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
-	stats->pages_deleted = 0;
+	stats->stats.estimated_count = false;
+	stats->stats.num_index_tuples = 0;
+	stats->stats.pages_deleted = 0;
+
+	if (stats->internalPagesSet != NULL)
+		intset_free(stats->internalPagesSet);
+	stats->internalPagesSet = intset_create();
+	if (stats->emptyLeafPagesSet != NULL)
+		intset_free(stats->emptyLeafPagesSet);
+	stats->emptyLeafPagesSet = intset_create();
 
 	/* Set up info to pass down to gistvacuumpage */
-	vstate.info = info;
+	stats->info = info;
 	vstate.stats = stats;
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
@@ -171,6 +222,7 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		/* Quit if we've scanned the whole relation */
 		if (blkno >= num_pages)
 			break;
+
 		/* Iterate over pages, then loop back to recheck length */
 		for (; blkno < num_pages; blkno++)
 			gistvacuumpage(&vstate, blkno, blkno);
@@ -192,8 +244,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		IndexFreeSpaceMapVacuum(rel);
 
 	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
+	stats->stats.num_pages = num_pages;
+	stats->stats.pages_free = vstate.totFreePages;
 }
 
 /*
@@ -210,8 +262,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void
 gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 {
-	IndexVacuumInfo *info = vstate->info;
-	IndexBulkDeleteResult *stats = vstate->stats;
+	GistBulkDeleteResult *stats = vstate->stats;
+	IndexVacuumInfo *info = stats->info;
 	IndexBulkDeleteCallback callback = vstate->callback;
 	void	   *callback_state = vstate->callback_state;
 	Relation	rel = info->index;
@@ -240,12 +292,13 @@ restart:
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
 		vstate->totFreePages++;
-		stats->pages_deleted++;
+		stats->stats.pages_deleted++;
 	}
 	else if (GistPageIsLeaf(page))
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -314,12 +367,28 @@ restart:
 
 			END_CRIT_SECTION();
 
-			stats->tuples_removed += ntodelete;
+			stats->stats.tuples_removed += ntodelete;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			/*
+			 * The page is now completely empty.  Remember its block number,
+			 * we will try to delete the page in the second stage, in
+			 * gistvacuum_recycle_pages().
+			 *
+			 * Skip this when recursing, because IntegerSet requires that the
+			 * values are added in ascending order.  The next VACUUM will pick
+			 * it up...
+			 */
+			if (blkno == orig_blkno)
+				intset_add_member(stats->emptyLeafPagesSet, blkno);
+		}
+		else
+			stats->stats.num_index_tuples += nremain;
 	}
 	else
 	{
@@ -347,6 +416,14 @@ restart:
 						 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
 						 errhint("Please REINDEX it.")));
 		}
+
+		/*
+		 * Remember the block number of this page, so that we can revisit
+		 * it later in gistvacuum_recycle_pages(), when we search for parents
+		 * of empty children.
+		 */
+		if (blkno == orig_blkno)
+			intset_add_member(stats->internalPagesSet, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -364,3 +441,218 @@ restart:
 		goto restart;
 	}
 }
+
+/*
+ * Scan all internal pages, and try to delete their empty child pages.
+ */
+static void
+gistvacuum_recycle_pages(GistBulkDeleteResult *stats)
+{
+	IndexVacuumInfo *info = stats->info;
+	Relation	rel = info->index;
+	BlockNumber	empty_pages_remaining;
+
+	empty_pages_remaining = intset_num_entries(stats->emptyLeafPagesSet);
+
+	/*
+	 * Rescan all inner pages to find those that have empty child pages.
+	 */
+	intset_begin_iterate(stats->internalPagesSet);
+	while (empty_pages_remaining > 0)
+	{
+		BlockNumber	blkno;
+		bool		found;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber off,
+					maxoff;
+		OffsetNumber todelete[MaxOffsetNumber];
+		BlockNumber	leafs_to_delete[MaxOffsetNumber];
+		int			ntodelete;
+		int			deleted;
+
+		blkno = intset_iterate_next(stats->internalPagesSet, &found);
+		if (!found)
+			break;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_SHARE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+		{
+			/*
+			 * This page was an internal page earlier, but now it's something
+			 * else. Shouldn't happen...
+			 */
+			Assert(false);
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/*
+		 * Scan all the downlinks, and see if any of them point to empty leaf
+		 * pages.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		ntodelete = 0;
+		for (off = FirstOffsetNumber;
+			 off <= maxoff && ntodelete < maxoff - 1;
+			 off = OffsetNumberNext(off))
+		{
+			ItemId		iid = PageGetItemId(page, off);
+			IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
+			BlockNumber leafblk;
+
+			leafblk = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+			if (intset_is_member(stats->emptyLeafPagesSet, leafblk))
+			{
+				leafs_to_delete[ntodelete] = leafblk;
+				todelete[ntodelete++] = off;
+			}
+		}
+
+		/*
+		 * In order to avoid deadlock, child page must be locked before
+		 * parent, so we must release the lock on the parent, lock the child,
+		 * and then re-acquire the lock the parent. (And we wouldn't want
+		 * to do I/O, while holding a lock, anyway.)
+		 *
+		 * At the instant that we're not holding a lock on the parent, the
+		 * downlink might get moved by a concurrent, so we must re-check that
+		 * it still points to the same child page after we have acquired both
+		 * locks. Also, another backend might have inserted a tuple to the
+		 * page, so that it is no longer empty. gistdeletepage() re-checks all
+		 * these conditions.
+		 */
+		LockBuffer(buffer, GIST_UNLOCK);
+
+		deleted = 0;
+		for (int i = 0; i < ntodelete; i++)
+		{
+			Buffer		leafbuf;
+
+			/*
+			 * Don't remove the last downlink from the parent. That would
+			 * confuse the insertion code.
+			 */
+			if (PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+				break;
+
+			leafbuf = ReadBufferExtended(rel, MAIN_FORKNUM, leafs_to_delete[i],
+										 RBM_NORMAL, info->strategy);
+			LockBuffer(leafbuf, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafbuf);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			if (gistdeletepage(stats, buffer, todelete[i] - deleted, leafbuf))
+				deleted++;
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			UnlockReleaseBuffer(leafbuf);
+		}
+		empty_pages_remaining -= deleted;
+
+		ReleaseBuffer(buffer);
+	}
+}
+
+
+/*
+ * gistdeletepage takes parent page and leaf page and tries to delete leaf.
+ * Both pages must be locked. Returns true if delete actually happened.
+ * Does not remove last downlink.
+ */
+static bool
+gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer parentBuffer, OffsetNumber downlink,
+			   Buffer leafBuffer)
+{
+	Page		parentPage = BufferGetPage(parentBuffer);
+	Page		leafPage = BufferGetPage(leafBuffer);
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	XLogRecPtr	recptr;
+	TransactionId txid;
+
+	/* Check that the leaf is still empty */
+	if (!GistPageIsLeaf(leafPage))
+	{
+		Assert(false);
+		return false;
+	}
+	if (PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber)
+		return false;		/* no longer empty */
+
+	if (GistFollowRight(leafPage)
+		|| GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+	{
+		/* Don't mess with a concurrent page split. */
+		return false;
+	}
+
+	/*
+	 * Check that the parent page still looks valid.
+	 */
+	if (PageIsNew(parentPage) ||
+		GistPageIsDeleted(parentPage) ||
+		GistPageIsLeaf(parentPage))
+	{
+		Assert(false);
+		return false;
+	}
+
+	/*
+	 * Check that old downlink is still pointing to leafBuffer.
+	 *
+	 * It might have moved by a concurrent insert. We could try to relocate
+	 * it, by scanning the page again, or perhaps even by moving right if
+	 * the page was split, but let's keep it simple and just give up.
+	 * The next VACUUM will pick this up again.
+	 */
+	if (PageGetMaxOffsetNumber(parentPage) < downlink
+		|| PageGetMaxOffsetNumber(parentPage) <= FirstOffsetNumber)
+		return false;
+
+	iid = PageGetItemId(parentPage, downlink);
+	idxtuple = (IndexTuple) PageGetItem(parentPage, iid);
+	if (BufferGetBlockNumber(leafBuffer) !=
+		ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+		return false;
+
+	/*
+	 * All good.  Proceed with the deletion.
+	 *
+	 * Like in _bt_unlink_halfdead_page we need an upper bound on xid
+	 * that could hold downlinks to this page. We use
+	 * ReadNewTransactionId() to instead of GetCurrentTransactionId
+	 * since we are in a VACUUM.
+	 */
+	txid = ReadNewTransactionId();
+
+	/* Mark page as deleted dropping references from internal pages */
+	START_CRIT_SECTION();
+
+	/* Remember xid of last transaction that could see this page */
+	GistPageSetDeleteXid(leafPage,txid);
+	GistPageSetDeleted(leafPage);
+	MarkBufferDirty(leafBuffer);
+	stats->stats.pages_deleted++;
+
+	MarkBufferDirty(parentBuffer);
+	/* Offsets are changed as long as we delete tuples from internal page */
+	PageIndexTupleDelete(parentPage, downlink);
+
+	if (RelationNeedsWAL(stats->info->index))
+		recptr 	= gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
+	else
+		recptr = gistGetFakeLSN(stats->info->index);
+	PageSetLSN(parentPage, recptr);
+	PageSetLSN(leafPage, recptr);
+
+	END_CRIT_SECTION();
+
+	return true;
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390af..4dbca41bab1 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -508,6 +508,43 @@ gistRedoCreateIndex(XLogReaderState *record)
 	UnlockReleaseBuffer(buffer);
 }
 
+/* redo page deletion */
+static void
+gistRedoPageDelete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		buffer;
+	Page		page;
+
+	/* FIXME: Are we locking the pages in correct order, for hot standby? */
+
+	if (XLogReadBufferForRedo(record, 0, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+
+	if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(buffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buffer);
+	}
+	if (BufferIsValid(buffer))
+		UnlockReleaseBuffer(buffer);
+}
+
 void
 gist_redo(XLogReaderState *record)
 {
@@ -535,6 +572,9 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageDelete(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +693,31 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page deletion. This also includes removal of
+ * downlink from the parent page.
+ */
+XLogRecPtr
+gistXLogPageDelete(Buffer buffer, TransactionId xid,
+				   Buffer parentBuffer, OffsetNumber downlinkOffset)
+{
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = downlinkOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(gistxlogPageDelete));
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, parentBuffer, REGBUF_STANDARD);
+
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+
+	return recptr;
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15aa..0861f829922 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -76,6 +76,9 @@ gist_identify(uint8 info)
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f241560..ce8bfd83ea4 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 463d2bfc7b9..c77d0b4dd81 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -414,12 +414,17 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
+				   TransactionId xid, Buffer parentBuffer,
+				   OffsetNumber downlinkOffset);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1af..939a1ea7559 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -23,7 +23,7 @@
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +76,16 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+/*
+ * Backup Blk 0: page that was deleted.
+ * Backup Blk 1: parent page, containing the downlink to the deleted page.
+ */
+typedef struct gistxlogPageDelete
+{
+	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
+} gistxlogPageDelete;
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..0a43449f003 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,10 +27,8 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
 alter index gist_pointidx SET (fillfactor = 40);
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..657b1954847 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,10 +28,8 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 
 vacuum analyze gist_point_tbl;
 
-- 
2.20.1

#55Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#54)
Re: GiST VACUUM

21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a deleted
page is recycled, to prevent the deletion from being replayed too early
in the hot standby. See _bt_getbuf() and btree_xlog_reuse_page(). I
think we need to do something similar in GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making the
changes is pretty straightforward, but it's a bit cumbersome to test.

I've tried to deal with it and stuck... I think we should make B-tree WAL record for this shared, remove BlockNumber and other unused stuff, leaving just xid and db oid.
And reuse this record for B-tree, GiST and GIN (yeah, it is not checking for that conflict).

Though, I'm not sure it is important for GIN. Scariest thing that can happen: it will return same tid twice. But it is doing bitmap scan, you cannot return same bit twice...

Eventually, hash, spgist and others will have same thing too.

Best regards, Andrey Borodin.

#56Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#55)
1 attachment(s)
Re: GiST VACUUM

On 21/03/2019 18:06, Andrey Borodin wrote:

21 марта 2019 г., в 2:30, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а): one remaining issue that needs to be fixed:

During Hot Standby, the B-tree code writes a WAL reord, when a
deleted page is recycled, to prevent the deletion from being
replayed too early in the hot standby. See _bt_getbuf() and
btree_xlog_reuse_page(). I think we need to do something similar in
GiST.

I'll try fixing that tomorrow, unless you beat me to it. Making
the changes is pretty straightforward, but it's a bit cumbersome to
test.

I've tried to deal with it and stuck...

So, I came up with the attached. I basically copy-pasted the page-reuse
WAL-logging stuff from nbtree.

When I started testing this, I quickly noticed that empty pages were not
being deleted nearly as much as I expected. I tracked it to this check
in gistdeletepage:

+       if (GistFollowRight(leafPage)
+               || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+       {
+               /* Don't mess with a concurrent page split. */
+               return false;
+       }

That NSN test was bogus. It prevented the leaf page from being reused,
if the parent page was *ever* split after the leaf page was created. I
don't see any reason to check the NSN here. The NSN is normally used to
detect if a (leaf) page has been concurrently split, when you descend
the tree. We don't need to care about that here; as long as the
FOLLOW_RIGHT flag is not set, the page has a downlink, and if we can
find the downlink and the page is empty, we can delete it.

After removing that bogus NSN check, page reuse become much more
effective. I've been testing this by running this test script repeatedly:

----------
/*
create sequence gist_test_seq;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
*/

insert into gist_point_tbl (id, p)
select nextval('gist_test_seq'), point(nextval('gist_test_seq'),
1000 + g) from generate_series(1, 10000) g;

delete from gist_point_tbl where id < currval('gist_test_seq') - 20000;
vacuum gist_point_tbl;

select pg_table_size('gist_point_tbl'), pg_indexes_size('gist_point_tbl');
----------

It inserts a bunch of rows, deletes a bunch of older rows, and vacuums.
The interesting thing here is that the key values keep "moving", so that
new tuples are added to different places than where old ones are
removed. That's the case where page reuse is needed.

Before this patch, the index bloated very quickly. With the patch, it
still bloats, because we still don't delete internal pages. But it's a
small fraction of the bloat you got before.

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.

I think we should make B-tree WAL record for this shared, remove
BlockNumber and other unused stuff, leaving just xid and db oid. And
reuse this record for B-tree, GiST and GIN (yeah, it is not checking
for that conflict).

Good point. For now, I didn't try to generalize this, but perhaps we
should.

Though, I'm not sure it is important for GIN. Scariest thing that can
happen: it will return same tid twice. But it is doing bitmap scan,
you cannot return same bit twice...

Hmm. Could it return a completely unrelated tuple? We don't always
recheck the original index quals in a bitmap index scan, IIRC. Also, a
search might get confused if it's descending down a posting tree, and
lands on a different kind of a page, altogether.

Alexander, you added the mechanism to GIN recently, to prevent pages
from being reused too early (commit 52ac6cd2d0). Do we need something
like B-tree's REUSE_PAGE records in GIN, too, to prevent the same bug
from happening in hot standby?

PS. for Gist, we could almost use the LSN / NSN mechanism to detect the
case that a deleted page is reused: Add a new field to the GiST page
header, to store a new "deleteNSN" field. When a page is deleted, the
deleted page's deleteNSN is set to the LSN of the deletion record. When
the page is reused, the deleteNSN field is kept unchanged. When you
follow a downlink during search, if you see that the page's deleteNSN >
parent's LSN, you know that it was concurrently deleted and recycled,
and should be ignored. That would allow reusing deleted pages
immediately. Unfortunately that would require adding a new field to the
gist page header/footer, which requires upgrade work :-(. Maybe one day,
we'll bite the bullet. Something to keep in mind, if we have to change
the page format anyway, for some reason.

- Heikki

Attachments:

0002-Delete-empty-pages-during-GiST-VACUUM.patchtext/x-patch; name=0002-Delete-empty-pages-during-GiST-VACUUM.patchDownload
From d7a77ad483251b62514778d2235516f6f9237ad7 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 20 Mar 2019 20:24:44 +0200
Subject: [PATCH 2/2] Delete empty pages during GiST VACUUM

This commit teaches GiST to actually delete pages during VACUUM.

To do this we scan GiST two times. At first pass we make note of empty
pages and internal pages. At second pass we scan through internal pages
looking for references to empty leaf pages.

Heikki's CHANGES since v22:

* Only scan the empty pages after the last scan, in a multi-pass vacuum.
  (I think that's what we want...) We could actually be smarter, and
  do this as part of the second pass's scan, in a multi-pass vacuum.

* Call ReadNewTransactionId() while holding lock. I think that's needed
  for correctness.

* Use new IntegerSet implementation.

* Removed the check between the parent and child's NSNs. That seemed
  bogus, and with that, we would only very rarely actually be able to
  delete a page.

* Write "page reuse" records, for hot standby

Author: Andrey Borodin
Discussion: https://www.postgresql.org/message-id/B1E4DF12-6CD3-4706-BDBD-BF3283328F60@yandex-team.ru
---
 src/backend/access/gist/README         |  48 ++++
 src/backend/access/gist/gist.c         |  15 +
 src/backend/access/gist/gistutil.c     |  34 ++-
 src/backend/access/gist/gistvacuum.c   | 375 ++++++++++++++++++++++---
 src/backend/access/gist/gistxlog.c     | 115 ++++++++
 src/backend/access/rmgrdesc/gistdesc.c |  28 ++
 src/include/access/gist.h              |   4 +
 src/include/access/gist_private.h      |  11 +-
 src/include/access/gistxlog.h          |  30 +-
 src/test/regress/expected/gist.out     |   6 +-
 src/test/regress/sql/gist.sql          |   6 +-
 11 files changed, 623 insertions(+), 49 deletions(-)

diff --git a/src/backend/access/gist/README b/src/backend/access/gist/README
index 02228662b81..501b1c1a77a 100644
--- a/src/backend/access/gist/README
+++ b/src/backend/access/gist/README
@@ -413,6 +413,54 @@ emptied yet; tuples never move upwards in the tree. The final emptying loops
 through buffers at a given level until all buffers at that level have been
 emptied, and then moves down to the next level.
 
+Bulk delete algorithm (VACUUM)
+------------------------------
+
+VACUUM works in two stages:
+
+In the first stage, we scan the whole index in physical order. To make sure
+that we don't miss any dead tuples because a concurrent page split moved them,
+we check the F_FOLLOW_RIGHT flags and NSN on each page, to detect if the
+page has been concurrently split. If a concurrent page split is detected, and
+one half the page was moved to a position that we already scanned, we "jump"
+to scan the page again. This is the same mechanism that B-tree VACUUM uses,
+but because we already have NSNs on pages, to detect page splits during
+searches, we don't need a "vacuum cycle ID" concept for that like B-tree does.
+
+While we scan all the pages, we also make note of any completely empty leaf
+pages. We will try to unlink them from the tree in the second stage. We also
+record the block numbers of all internal pages, in an IntegerSet. They are
+needed in the second stage, to locate parents of empty pages.
+
+In the second stage, we try to unlink any empty leaf pages from the tree, so
+that their space can be reused. If we didn't see any empty pages in the first
+stage, the second stage is skipped. In order to delete an empty page, its
+downlink must be removed from the parent. We scan all the internal pages,
+whose block numbers we
+memorized in the first stage, and look for downlinks to pages that we have
+memorized as being empty. Whenever we find one, we acquire a lock on the
+parent and child page, re-check that the child page is still empty. Then, we
+remove the downlink and mark the child as deleted, and release the locks.
+
+The insertion algorithm would get confused, if an internal page was completely
+empty. So we never delete the last child of an internal page, even if it's
+empty. Currently, we only support deleting leaf pages.
+
+This page deletion algorithm works on a best-effor basis. It might fail to
+find a downlink, if a concurrent page split moved it after the first stage.
+In that case, we won't be able to remove all empty pages. That's OK, it's
+not expected to happen very often, and hopefully the next VACUUM will clean
+it up, instead.
+
+When we have deleted a page, it's possible that an in-progress search will
+still descend on the page, if it saw the downlink before we removed it. The
+search will see that it is deleted, and ignore it, but as long as that can
+happen, we cannot reuse the page. To "wait out" any in-progress searches, when
+the page is deleted, it's labeled with the current next-transaction counter
+value. The page is not recycled, until that XID is no longer visible to
+anyone. That's much more conservative than necessary, but let's keep it
+simple.
+
 
 Authors:
 	Teodor Sigaev	<teodor@sigaev.ru>
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2ce5425ef98..a746e911f37 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -704,6 +704,9 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
+			/* currently, internal pages are never deleted */
+			Assert(!GistPageIsDeleted(stack->page));
+
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -838,6 +841,18 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
+			/*
+			 * The page might have been deleted after we scanned the parent
+			 * and saw the downlink.
+			 */
+			if (GistPageIsDeleted(stack->page))
+			{
+				UnlockReleaseBuffer(stack->buffer);
+				xlocked = false;
+				state.stack = stack = stack->parent;
+				continue;
+			}
+
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index f32e16eed58..2163cc482d8 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -23,6 +23,7 @@
 #include "storage/lmgr.h"
 #include "utils/float.h"
 #include "utils/syscache.h"
+#include "utils/snapmgr.h"
 #include "utils/lsyscache.h"
 
 
@@ -829,13 +830,31 @@ gistNewBuffer(Relation r)
 		{
 			Page		page = BufferGetPage(buffer);
 
+			/*
+			 * If the page was never initialized, it's OK to use.
+			 */
 			if (PageIsNew(page))
-				return buffer;	/* OK to use, if never initialized */
+				return buffer;
 
 			gistcheckpage(r, buffer);
 
-			if (GistPageIsDeleted(page))
-				return buffer;	/* OK to use */
+			/*
+			 * Otherwise, recycle it if deleted, and too old to have any processes
+			 * interested in it.
+			 */
+			if (gistPageRecyclable(page))
+			{
+				/*
+				 * If we are generating WAL for Hot Standby then create a
+				 * WAL record that will allow us to conflict with queries
+				 * running on standby, in case they have snapshots older
+				 * than the page's deleteXid.
+				 */
+				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
+					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
+
+				return buffer;
+			}
 
 			LockBuffer(buffer, GIST_UNLOCK);
 		}
@@ -859,6 +878,15 @@ gistNewBuffer(Relation r)
 	return buffer;
 }
 
+/* Can this page be recycled yet? */
+bool
+gistPageRecyclable(Page page)
+{
+	return PageIsNew(page) ||
+		(GistPageIsDeleted(page) &&
+		 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
+}
+
 bytea *
 gistoptions(Datum reloptions, bool validate)
 {
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3c1d75691e8..d35060d41cd 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -16,26 +16,48 @@
 
 #include "access/genam.h"
 #include "access/gist_private.h"
+#include "access/transam.h"
 #include "commands/vacuum.h"
+#include "lib/integerset.h"
 #include "miscadmin.h"
 #include "storage/indexfsm.h"
 #include "storage/lmgr.h"
 
-/* Working state needed by gistbulkdelete */
+/*
+ * State kept across vacuum stages.
+ */
 typedef struct
 {
+	IndexBulkDeleteResult stats;	/* must be first */
+
 	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
+
+	/*
+	 * These are used to memorize all internal and empty leaf pages in the 1st
+	 * vacuum phase.  They are used in the 2nd phase, to delete all the empty
+	 * pages.
+	 */
+	IntegerSet *internalPagesSet;
+	IntegerSet *emptyLeafPagesSet;
+} GistBulkDeleteResult;
+
+/* Working state needed by gistbulkdelete */
+typedef struct
+{
+	GistBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	GistNSN		startNSN;
-	BlockNumber totFreePages;	/* true total # of free pages */
 } GistVacState;
 
-static void gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+static void gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state);
 static void gistvacuumpage(GistVacState *vstate, BlockNumber blkno,
 			   BlockNumber orig_blkno);
+static void gistvacuum_delete_empty_pages(GistBulkDeleteResult *stats);
+static bool gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer buffer, OffsetNumber downlink,
+			   Buffer leafBuffer);
 
 /*
  * VACUUM bulkdelete stage: remove index entries.
@@ -44,21 +66,25 @@ IndexBulkDeleteResult *
 gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* allocate stats if first time through, else re-use existing struct */
-	if (stats == NULL)
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
+	if (gist_stats == NULL)
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
 
-	gistvacuumscan(info, stats, callback, callback_state);
+	gistvacuumscan(info, gist_stats, callback, callback_state);
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
- * VACUUM cleanup stage: update index statistics.
+ * VACUUM cleanup stage: delete empty pages, and update index statistics.
  */
 IndexBulkDeleteResult *
 gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 {
+	GistBulkDeleteResult *gist_stats = (GistBulkDeleteResult *) stats;
+
 	/* No-op in ANALYZE ONLY mode */
 	if (info->analyze_only)
 		return stats;
@@ -68,10 +94,26 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * stats from the latest gistbulkdelete call.  If it wasn't called, we
 	 * still need to do a pass over the index, to obtain index statistics.
 	 */
-	if (stats == NULL)
+	if (gist_stats == NULL)
 	{
-		stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
-		gistvacuumscan(info, stats, NULL, NULL);
+		gist_stats = (GistBulkDeleteResult *) palloc0(sizeof(GistBulkDeleteResult));
+		gistvacuumscan(info, gist_stats, NULL, NULL);
+	}
+
+	/*
+	 * If we saw any empty pages that could be recycle, try to unlink them
+	 * from the tree so that they can be reused.
+	 */
+	if (gist_stats->emptyLeafPagesSet)
+	{
+		gistvacuum_delete_empty_pages(gist_stats);
+		intset_free(gist_stats->emptyLeafPagesSet);
+		gist_stats->emptyLeafPagesSet = NULL;
+	}
+	if (gist_stats->internalPagesSet)
+	{
+		intset_free(gist_stats->internalPagesSet);
+		gist_stats->internalPagesSet = NULL;
 	}
 
 	/*
@@ -82,11 +124,11 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	if (!info->estimated_count)
 	{
-		if (stats->num_index_tuples > info->num_heap_tuples)
-			stats->num_index_tuples = info->num_heap_tuples;
+		if (gist_stats->stats.num_index_tuples > info->num_heap_tuples)
+			gist_stats->stats.num_index_tuples = info->num_heap_tuples;
 	}
 
-	return stats;
+	return (IndexBulkDeleteResult *) gist_stats;
 }
 
 /*
@@ -97,15 +139,16 @@ gistvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
  * btvacuumcleanup invoke this (the latter only if no btbulkdelete call
  * occurred).
  *
- * This also adds unused/delete pages to the free space map, although that
- * is currently not very useful.  There is currently no support for deleting
- * empty pages, so recycleable pages can only be found if an error occurs
- * while the index is being expanded, leaving an all-zeros page behind.
+ * This also makes note of any empty leaf pages, as well as all internal
+ * pages. gistvacuum_delete_empty_pages() needs that information.  Any deleted
+ * pages are added directly to the free space map.  (They should've been
+ * added there when they were originally deleted, already, but it's possible
+ * that the FSM was lost at a crash, for example.)
  *
  * The caller is responsible for initially allocating/zeroing a stats struct.
  */
 static void
-gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
+gistvacuumscan(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 			   IndexBulkDeleteCallback callback, void *callback_state)
 {
 	Relation	rel = info->index;
@@ -118,12 +161,20 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Reset counts that will be incremented during the scan; needed in case
 	 * of multiple scans during a single VACUUM command.
 	 */
-	stats->estimated_count = false;
-	stats->num_index_tuples = 0;
-	stats->pages_deleted = 0;
+	stats->stats.estimated_count = false;
+	stats->stats.num_index_tuples = 0;
+	stats->stats.pages_deleted = 0;
+	stats->stats.pages_free = 0;
+
+	if (stats->internalPagesSet != NULL)
+		intset_free(stats->internalPagesSet);
+	stats->internalPagesSet = intset_create();
+	if (stats->emptyLeafPagesSet != NULL)
+		intset_free(stats->emptyLeafPagesSet);
+	stats->emptyLeafPagesSet = intset_create();
 
 	/* Set up info to pass down to gistvacuumpage */
-	vstate.info = info;
+	stats->info = info;
 	vstate.stats = stats;
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
@@ -131,7 +182,6 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		vstate.startNSN = GetInsertRecPtr();
 	else
 		vstate.startNSN = gistGetFakeLSN(rel);
-	vstate.totFreePages = 0;
 
 	/*
 	 * The outer loop iterates over all index pages, in physical order (we
@@ -188,12 +238,11 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
 	 */
-	if (vstate.totFreePages > 0)
+	if (stats->stats.pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 
 	/* update statistics */
-	stats->num_pages = num_pages;
-	stats->pages_free = vstate.totFreePages;
+	stats->stats.num_pages = num_pages;
 }
 
 /*
@@ -210,8 +259,8 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 static void
 gistvacuumpage(GistVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 {
-	IndexVacuumInfo *info = vstate->info;
-	IndexBulkDeleteResult *stats = vstate->stats;
+	GistBulkDeleteResult *stats = vstate->stats;
+	IndexVacuumInfo *info = stats->info;
 	IndexBulkDeleteCallback callback = vstate->callback;
 	void	   *callback_state = vstate->callback_state;
 	Relation	rel = info->index;
@@ -235,17 +284,23 @@ restart:
 	LockBuffer(buffer, GIST_EXCLUSIVE);
 	page = (Page) BufferGetPage(buffer);
 
-	if (PageIsNew(page) || GistPageIsDeleted(page))
+	if (gistPageRecyclable(page))
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->totFreePages++;
-		stats->pages_deleted++;
+		stats->stats.pages_free++;
+		stats->stats.pages_deleted++;
+	}
+	else if (GistPageIsDeleted(page))
+	{
+		/* Already deleted, but can't recycle yet */
+		stats->stats.pages_deleted++;
 	}
 	else if (GistPageIsLeaf(page))
 	{
 		OffsetNumber todelete[MaxOffsetNumber];
 		int			ntodelete = 0;
+		int			nremain;
 		GISTPageOpaque opaque = GistPageGetOpaque(page);
 		OffsetNumber maxoff = PageGetMaxOffsetNumber(page);
 
@@ -314,12 +369,28 @@ restart:
 
 			END_CRIT_SECTION();
 
-			stats->tuples_removed += ntodelete;
+			stats->stats.tuples_removed += ntodelete;
 			/* must recompute maxoff */
 			maxoff = PageGetMaxOffsetNumber(page);
 		}
 
-		stats->num_index_tuples += maxoff - FirstOffsetNumber + 1;
+		nremain = maxoff - FirstOffsetNumber + 1;
+		if (nremain == 0)
+		{
+			/*
+			 * The page is now completely empty.  Remember its block number,
+			 * so that we will try to delete the page in the second stage, in
+			 * gistvacuum_delete_empty_pages().
+			 *
+			 * Skip this when recursing, because IntegerSet requires that the
+			 * values are added in ascending order.  The next VACUUM will pick
+			 * it up.
+			 */
+			if (blkno == orig_blkno)
+				intset_add_member(stats->emptyLeafPagesSet, blkno);
+		}
+		else
+			stats->stats.num_index_tuples += nremain;
 	}
 	else
 	{
@@ -347,6 +418,14 @@ restart:
 						 errdetail("This is caused by an incomplete page split at crash recovery before upgrading to PostgreSQL 9.1."),
 						 errhint("Please REINDEX it.")));
 		}
+
+		/*
+		 * Remember the block number of this page, so that we can revisit it
+		 * later in gistvacuum_delete_empty_pages(), when we search for
+		 * parents of empty children.
+		 */
+		if (blkno == orig_blkno)
+			intset_add_member(stats->internalPagesSet, blkno);
 	}
 
 	UnlockReleaseBuffer(buffer);
@@ -364,3 +443,229 @@ restart:
 		goto restart;
 	}
 }
+
+/*
+ * Scan all internal pages, and try to delete their empty child pages.
+ */
+static void
+gistvacuum_delete_empty_pages(GistBulkDeleteResult *stats)
+{
+	IndexVacuumInfo *info = stats->info;
+	Relation	rel = info->index;
+	BlockNumber empty_pages_remaining;
+
+	/*
+	 * Rescan all inner pages to find those that have empty child pages.
+	 */
+	empty_pages_remaining = intset_num_entries(stats->emptyLeafPagesSet);
+	intset_begin_iterate(stats->internalPagesSet);
+	while (empty_pages_remaining)
+	{
+		BlockNumber blkno;
+		bool		found;
+		Buffer		buffer;
+		Page		page;
+		OffsetNumber off,
+					maxoff;
+		OffsetNumber todelete[MaxOffsetNumber];
+		BlockNumber leafs_to_delete[MaxOffsetNumber];
+		int			ntodelete;
+		int			deleted;
+
+		blkno = intset_iterate_next(stats->internalPagesSet, &found);
+		if (!found)
+			break;
+
+		buffer = ReadBufferExtended(rel, MAIN_FORKNUM, blkno, RBM_NORMAL,
+									info->strategy);
+
+		LockBuffer(buffer, GIST_SHARE);
+		page = (Page) BufferGetPage(buffer);
+
+		if (PageIsNew(page) || GistPageIsDeleted(page) || GistPageIsLeaf(page))
+		{
+			/*
+			 * This page was an internal page earlier, but now it's something
+			 * else. Shouldn't happen...
+			 */
+			Assert(false);
+			UnlockReleaseBuffer(buffer);
+			continue;
+		}
+
+		/*
+		 * Scan all the downlinks, and see if any of them point to empty leaf
+		 * pages.
+		 */
+		maxoff = PageGetMaxOffsetNumber(page);
+		ntodelete = 0;
+		for (off = FirstOffsetNumber;
+			 off <= maxoff && ntodelete < maxoff - 1;
+			 off = OffsetNumberNext(off))
+		{
+			ItemId		iid = PageGetItemId(page, off);
+			IndexTuple	idxtuple = (IndexTuple) PageGetItem(page, iid);
+			BlockNumber leafblk;
+
+			leafblk = ItemPointerGetBlockNumber(&(idxtuple->t_tid));
+			if (intset_is_member(stats->emptyLeafPagesSet, leafblk))
+			{
+				leafs_to_delete[ntodelete] = leafblk;
+				todelete[ntodelete++] = off;
+			}
+		}
+
+		/*
+		 * In order to avoid deadlock, child page must be locked before
+		 * parent, so we must release the lock on the parent, lock the child,
+		 * and then re-acquire the lock the parent. (And we wouldn't want to
+		 * do I/O, while holding a lock, anyway.)
+		 *
+		 * At the instant that we're not holding a lock on the parent, the
+		 * downlink might get moved by a concurrent, so we must re-check that
+		 * it still points to the same child page after we have acquired both
+		 * locks. Also, another backend might have inserted a tuple to the
+		 * page, so that it is no longer empty. gistdeletepage() re-checks all
+		 * these conditions.
+		 */
+		LockBuffer(buffer, GIST_UNLOCK);
+
+		deleted = 0;
+		for (int i = 0; i < ntodelete; i++)
+		{
+			Buffer		leafbuf;
+
+			/*
+			 * Don't remove the last downlink from the parent. That would
+			 * confuse the insertion code.
+			 */
+			if (PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
+				break;
+
+			leafbuf = ReadBufferExtended(rel, MAIN_FORKNUM, leafs_to_delete[i],
+										 RBM_NORMAL, info->strategy);
+			LockBuffer(leafbuf, GIST_EXCLUSIVE);
+			gistcheckpage(rel, leafbuf);
+
+			LockBuffer(buffer, GIST_EXCLUSIVE);
+			if (gistdeletepage(stats, buffer, todelete[i] - deleted, leafbuf))
+				deleted++;
+			LockBuffer(buffer, GIST_UNLOCK);
+
+			UnlockReleaseBuffer(leafbuf);
+		}
+
+		ReleaseBuffer(buffer);
+
+		/* update stats */
+		stats->stats.pages_removed += deleted;
+
+		/*
+		 * We can stop the scan as soon as we have seen the downlinks, even if
+		 * we were not able to remove them all.
+		 */
+		empty_pages_remaining -= ntodelete;
+	}
+}
+
+
+/*
+ * gistdeletepage takes a leaf page, and its parent, and tries to delete the
+ * leaf.  Both pages must be locked.
+ *
+ * Even if the page was empty when we first saw it, a concurrent inserter might
+ * have added a tuple to it since.  Similarly, the downlink might have moved.
+ * We re-check all the conditions, to make sure the page is still deletable,
+ * before modifying anything.
+ *
+ * Returns true, if the page was deleted, and false if a concurrent update
+ * prevented it.
+ */
+static bool
+gistdeletepage(GistBulkDeleteResult *stats,
+			   Buffer parentBuffer, OffsetNumber downlink,
+			   Buffer leafBuffer)
+{
+	Page		parentPage = BufferGetPage(parentBuffer);
+	Page		leafPage = BufferGetPage(leafBuffer);
+	ItemId		iid;
+	IndexTuple	idxtuple;
+	XLogRecPtr	recptr;
+	TransactionId txid;
+
+	/*
+	 * Check that the leaf is still empty and deletable.
+	 */
+	if (!GistPageIsLeaf(leafPage))
+	{
+		/* a leaf page should never become a non-leaf page */
+		Assert(false);
+		return false;
+	}
+
+	if (GistFollowRight(leafPage))
+		return false;			/* don't mess with a concurrent page split */
+
+	if (PageGetMaxOffsetNumber(leafPage) != InvalidOffsetNumber)
+		return false;			/* not empty anymore */
+
+	/*
+	 * Ok, the leaf is deletable.  Is the downlink in the parent page still
+	 * valid?  It might have been moved by a concurrent insert.  We could try
+	 * to re-find it by scanning the page again, possibly moving right if the
+	 * was split.  But for now, let's keep it simple and just give up.  The
+	 * next VACUUM will pick it up.
+	 */
+	if (PageIsNew(parentPage) || GistPageIsDeleted(parentPage) ||
+		GistPageIsLeaf(parentPage))
+	{
+		/* shouldn't happen, internal pages are never deleted */
+		Assert(false);
+		return false;
+	}
+
+	if (PageGetMaxOffsetNumber(parentPage) < downlink
+		|| PageGetMaxOffsetNumber(parentPage) <= FirstOffsetNumber)
+		return false;
+
+	iid = PageGetItemId(parentPage, downlink);
+	idxtuple = (IndexTuple) PageGetItem(parentPage, iid);
+	if (BufferGetBlockNumber(leafBuffer) !=
+		ItemPointerGetBlockNumber(&(idxtuple->t_tid)))
+		return false;
+
+	/*
+	 * All good, proceed with the deletion.
+	 *
+	 * The page cannot be immediately recycled, because in-progress scans that
+	 * saw the downlink might still visit it.  Mark the page with the current
+	 * next-XID counter, so that we know when it can be recycled.  Once that
+	 * XID becomes older than GlobalXmin, we know that all scans that are
+	 * currently in progress must have ended.  (That's much more conservative
+	 * than needed, but let's keep it safe and simple.)
+	 */
+	txid = ReadNewTransactionId();
+
+	START_CRIT_SECTION();
+
+	/* mark the page as deleted */
+	MarkBufferDirty(leafBuffer);
+	GistPageSetDeleteXid(leafPage, txid);
+	GistPageSetDeleted(leafPage);
+	stats->stats.pages_deleted++;
+
+	/* remove the downlink from the parent */
+	MarkBufferDirty(parentBuffer);
+	PageIndexTupleDelete(parentPage, downlink);
+
+	if (RelationNeedsWAL(stats->info->index))
+		recptr = gistXLogPageDelete(leafBuffer, txid, parentBuffer, downlink);
+	else
+		recptr = gistGetFakeLSN(stats->info->index);
+	PageSetLSN(parentPage, recptr);
+	PageSetLSN(leafPage, recptr);
+
+	END_CRIT_SECTION();
+
+	return true;
+}
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 408bd5390af..cb80ab00cd7 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -23,6 +23,7 @@
 #include "miscadmin.h"
 #include "storage/procarray.h"
 #include "utils/memutils.h"
+#include "utils/rel.h"
 
 static MemoryContext opCtx;		/* working memory for operations */
 
@@ -508,6 +509,64 @@ gistRedoCreateIndex(XLogReaderState *record)
 	UnlockReleaseBuffer(buffer);
 }
 
+/* redo page deletion */
+static void
+gistRedoPageDelete(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	gistxlogPageDelete *xldata = (gistxlogPageDelete *) XLogRecGetData(record);
+	Buffer		parentBuffer;
+	Buffer		leafBuffer;
+
+	if (XLogReadBufferForRedo(record, 0, &leafBuffer) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(leafBuffer);
+
+		GistPageSetDeleteXid(page, xldata->deleteXid);
+		GistPageSetDeleted(page);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(leafBuffer);
+	}
+
+	if (XLogReadBufferForRedo(record, 1, &parentBuffer) == BLK_NEEDS_REDO)
+	{
+		Page		page = (Page) BufferGetPage(parentBuffer);
+
+		PageIndexTupleDelete(page, xldata->downlinkOffset);
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(parentBuffer);
+	}
+
+	if (BufferIsValid(parentBuffer))
+		UnlockReleaseBuffer(parentBuffer);
+	if (BufferIsValid(leafBuffer))
+		UnlockReleaseBuffer(leafBuffer);
+}
+
+static void
+gistRedoPageReuse(XLogReaderState *record)
+{
+	gistxlogPageReuse *xlrec = (gistxlogPageReuse *) XLogRecGetData(record);
+
+	/*
+	 * PAGE_REUSE records exist to provide a conflict point when we reuse
+	 * pages in the index via the FSM.  That's all they do though.
+	 *
+	 * latestRemovedXid was the page's deleteXid.  The deleteXid <
+	 * RecentGlobalXmin test in gistPageRecyclable() conceptually mirrors the
+	 * pgxact->xmin > limitXmin test in GetConflictingVirtualXIDs().
+	 * Consequently, one XID value achieves the same exclusion effect on
+	 * master and standby.
+	 */
+	if (InHotStandby)
+	{
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
+											xlrec->node);
+	}
+}
+
 void
 gist_redo(XLogReaderState *record)
 {
@@ -529,12 +588,18 @@ gist_redo(XLogReaderState *record)
 		case XLOG_GIST_DELETE:
 			gistRedoDeleteRecord(record);
 			break;
+		case XLOG_GIST_PAGE_REUSE:
+			gistRedoPageReuse(record);
+			break;
 		case XLOG_GIST_PAGE_SPLIT:
 			gistRedoPageSplitRecord(record);
 			break;
 		case XLOG_GIST_CREATE_INDEX:
 			gistRedoCreateIndex(record);
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			gistRedoPageDelete(record);
+			break;
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
@@ -653,6 +718,56 @@ gistXLogSplit(bool page_is_leaf,
 	return recptr;
 }
 
+/*
+ * Write XLOG record describing a page deletion. This also includes removal of
+ * downlink from the parent page.
+ */
+XLogRecPtr
+gistXLogPageDelete(Buffer buffer, TransactionId xid,
+				   Buffer parentBuffer, OffsetNumber downlinkOffset)
+{
+	gistxlogPageDelete xlrec;
+	XLogRecPtr	recptr;
+
+	xlrec.deleteXid = xid;
+	xlrec.downlinkOffset = downlinkOffset;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfGistxlogPageDelete);
+
+	XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+	XLogRegisterBuffer(1, parentBuffer, REGBUF_STANDARD);
+
+	recptr = XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_DELETE);
+
+	return recptr;
+}
+
+/*
+ * Write XLOG record about reuse of a deleted page.
+ */
+void
+gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+{
+	gistxlogPageReuse xlrec_reuse;
+
+	/*
+	 * Note that we don't register the buffer with the record, because this
+	 * operation doesn't modify the page. This record only exists to provide a
+	 * conflict point for Hot Standby.
+	 */
+
+	/* XLOG stuff */
+	xlrec_reuse.node = rel->rd_node;
+	xlrec_reuse.block = blkno;
+	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec_reuse, SizeOfGistxlogPageReuse);
+
+	XLogInsert(RM_GIST_ID, XLOG_GIST_PAGE_REUSE);
+}
+
 /*
  * Write XLOG record describing a page update. The update can include any
  * number of deletions and/or insertions of tuples on a single index page.
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index e468c9e15aa..3ff4f83d387 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -23,6 +23,15 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate *xlrec)
 {
 }
 
+static void
+out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec)
+{
+	appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u",
+					 xlrec->node.spcNode, xlrec->node.dbNode,
+					 xlrec->node.relNode, xlrec->block,
+					 xlrec->latestRemovedXid);
+}
+
 static void
 out_gistxlogDelete(StringInfo buf, gistxlogPageUpdate *xlrec)
 {
@@ -35,6 +44,13 @@ out_gistxlogPageSplit(StringInfo buf, gistxlogPageSplit *xlrec)
 					 xlrec->npage);
 }
 
+static void
+out_gistxlogPageDelete(StringInfo buf, gistxlogPageDelete *xlrec)
+{
+	appendStringInfo(buf, "deleteXid %u; downlink %u",
+					 xlrec->deleteXid, xlrec->downlinkOffset);
+}
+
 void
 gist_desc(StringInfo buf, XLogReaderState *record)
 {
@@ -46,6 +62,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 		case XLOG_GIST_PAGE_UPDATE:
 			out_gistxlogPageUpdate(buf, (gistxlogPageUpdate *) rec);
 			break;
+		case XLOG_GIST_PAGE_REUSE:
+			out_gistxlogPageReuse(buf, (gistxlogPageReuse *) rec);
+			break;
 		case XLOG_GIST_DELETE:
 			out_gistxlogDelete(buf, (gistxlogPageUpdate *) rec);
 			break;
@@ -54,6 +73,9 @@ gist_desc(StringInfo buf, XLogReaderState *record)
 			break;
 		case XLOG_GIST_CREATE_INDEX:
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			out_gistxlogPageDelete(buf, (gistxlogPageDelete *) rec);
+			break;
 	}
 }
 
@@ -70,12 +92,18 @@ gist_identify(uint8 info)
 		case XLOG_GIST_DELETE:
 			id = "DELETE";
 			break;
+		case XLOG_GIST_PAGE_REUSE:
+			id = "PAGE_REUSE";
+			break;
 		case XLOG_GIST_PAGE_SPLIT:
 			id = "PAGE_SPLIT";
 			break;
 		case XLOG_GIST_CREATE_INDEX:
 			id = "CREATE_INDEX";
 			break;
+		case XLOG_GIST_PAGE_DELETE:
+			id = "PAGE_DELETE";
+			break;
 	}
 
 	return id;
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 3234f241560..ce8bfd83ea4 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -151,6 +151,10 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
+/* For deleted pages we store last xid which could see the page in scan */
+#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
+#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
  * take it as one of their arguments
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 463d2bfc7b9..02dc285a78a 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -414,12 +414,20 @@ extern bool gistplacetopage(Relation rel, Size freespace, GISTSTATE *giststate,
 extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 		  int len, GISTSTATE *giststate);
 
+/* gistxlog.c */
+extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
+				   TransactionId xid, Buffer parentBuffer,
+				   OffsetNumber downlinkOffset);
+
+extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
+				  TransactionId latestRemovedXid);
+
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
 			   IndexTuple *itup, int ntup,
 			   Buffer leftchild);
 
-XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
+extern XLogRecPtr gistXLogDelete(Buffer buffer, OffsetNumber *todelete,
 			   int ntodelete, RelFileNode hnode);
 
 extern XLogRecPtr gistXLogSplit(bool page_is_leaf,
@@ -451,6 +459,7 @@ extern bool gistfitpage(IndexTuple *itvec, int len);
 extern bool gistnospace(Page page, IndexTuple *itvec, int len, OffsetNumber todelete, Size freespace);
 extern void gistcheckpage(Relation rel, Buffer buf);
 extern Buffer gistNewBuffer(Relation r);
+extern bool gistPageRecyclable(Page page);
 extern void gistfillbuffer(Page page, IndexTuple *itup, int len,
 			   OffsetNumber off);
 extern IndexTuple *gistextractpage(Page page, int *len /* out */ );
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 5117aabf1af..2f87b67a53a 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -19,11 +19,12 @@
 
 #define XLOG_GIST_PAGE_UPDATE		0x00
 #define XLOG_GIST_DELETE			0x10 /* delete leaf index tuples for a page */
- /* #define XLOG_GIST_NEW_ROOT			 0x20 */	/* not used anymore */
+#define XLOG_GIST_PAGE_REUSE		0x20 /* old page is about to be reused from
+										  * FSM */
 #define XLOG_GIST_PAGE_SPLIT		0x30
  /* #define XLOG_GIST_INSERT_COMPLETE	 0x40 */	/* not used anymore */
 #define XLOG_GIST_CREATE_INDEX		0x50
- /* #define XLOG_GIST_PAGE_DELETE		 0x60 */	/* not used anymore */
+#define XLOG_GIST_PAGE_DELETE		0x60
 
 /*
  * Backup Blk 0: updated page.
@@ -76,6 +77,31 @@ typedef struct gistxlogPageSplit
 	 */
 } gistxlogPageSplit;
 
+/*
+ * Backup Blk 0: page that was deleted.
+ * Backup Blk 1: parent page, containing the downlink to the deleted page.
+ */
+typedef struct gistxlogPageDelete
+{
+	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
+} gistxlogPageDelete;
+
+#define SizeOfGistxlogPageDelete	(offsetof(gistxlogPageDelete, downlinkOffset) + sizeof(OffsetNumber))
+
+
+/*
+ * This is what we need to know about page reuse, for hot standby.
+ */
+typedef struct gistxlogPageReuse
+{
+	RelFileNode node;
+	BlockNumber block;
+	TransactionId latestRemovedXid;
+} gistxlogPageReuse;
+
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, latestRemovedXid) + sizeof(TransactionId))
+
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
diff --git a/src/test/regress/expected/gist.out b/src/test/regress/expected/gist.out
index f5a2993aaf2..0a43449f003 100644
--- a/src/test/regress/expected/gist.out
+++ b/src/test/regress/expected/gist.out
@@ -27,10 +27,8 @@ insert into gist_point_tbl (id, p)
 select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 vacuum analyze gist_point_tbl;
 -- rebuild the index with a different fillfactor
 alter index gist_pointidx SET (fillfactor = 40);
diff --git a/src/test/regress/sql/gist.sql b/src/test/regress/sql/gist.sql
index bae722fe13c..657b1954847 100644
--- a/src/test/regress/sql/gist.sql
+++ b/src/test/regress/sql/gist.sql
@@ -28,10 +28,8 @@ select g+100000, point(g*10+1, g*10+1) from generate_series(1, 10000) g;
 -- To test vacuum, delete some entries from all over the index.
 delete from gist_point_tbl where id % 2 = 1;
 
--- And also delete some concentration of values. (GiST doesn't currently
--- attempt to delete pages even when they become empty, but if it did, this
--- would exercise it)
-delete from gist_point_tbl where id < 10000;
+-- And also delete some concentration of values.
+delete from gist_point_tbl where id > 5000;
 
 vacuum analyze gist_point_tbl;
 
-- 
2.20.1

#57Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#56)
Re: GiST VACUUM

22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
...
When I started testing this, I quickly noticed that empty pages were not being deleted nearly as much as I expected. I tracked it to this check in gistdeletepage:

+       if (GistFollowRight(leafPage)
+               || GistPageGetNSN(parentPage) > GistPageGetNSN(leafPage))
+       {
+               /* Don't mess with a concurrent page split. */
+               return false;
+       }

That NSN test was bogus. It prevented the leaf page from being reused, if the parent page was *ever* split after the leaf page was created. I don't see any reason to check the NSN here.

That's true. This check had sense only when parent page was not locked (and seems like comparison should be opposite). When both pages are locked, this test as no sense at all. Check was incorrectly "fixed" by me when transitioning from single-scan delete to two-scan delete during summer 2018. Just wandering how hard is it to simply delete a page...

Though, I'm not sure it is important for GIN. Scariest thing that can
happen: it will return same tid twice. But it is doing bitmap scan,
you cannot return same bit twice...

Hmm. Could it return a completely unrelated tuple?

No, I do not think so, it will do comparisons on posting tree tuples.

We don't always recheck the original index quals in a bitmap index scan, IIRC. Also, a search might get confused if it's descending down a posting tree, and lands on a different kind of a page, altogether.

Yes, search will return an error, user will reissue query and everything will be almost fine.

PS. for Gist, we could almost use the LSN / NSN mechanism to detect the case that a deleted page is reused: Add a new field to the GiST page header, to store a new "deleteNSN" field. When a page is deleted, the deleted page's deleteNSN is set to the LSN of the deletion record. When the page is reused, the deleteNSN field is kept unchanged. When you follow a downlink during search, if you see that the page's deleteNSN > parent's LSN, you know that it was concurrently deleted and recycled, and should be ignored. That would allow reusing deleted pages immediately. Unfortunately that would require adding a new field to the gist page header/footer, which requires upgrade work :-(. Maybe one day, we'll bite the bullet. Something to keep in mind, if we have to change the page format anyway, for some reason.

Yeah, the same day we will get rid of invalid tuples. I can make a patch for v13. Actually, I have a lot of patches that I want in GiST in v13. Or v14.

Best regards, Andrey Borodin.

#58Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#57)
Re: GiST VACUUM

On 22/03/2019 10:00, Andrey Borodin wrote:

22 марта 2019 г., в 1:04, Heikki Linnakangas <hlinnaka@iki.fi>
написал(а):

PS. for Gist, we could almost use the LSN / NSN mechanism to detect
the case that a deleted page is reused: Add a new field to the GiST
page header, to store a new "deleteNSN" field. When a page is
deleted, the deleted page's deleteNSN is set to the LSN of the
deletion record. When the page is reused, the deleteNSN field is
kept unchanged. When you follow a downlink during search, if you
see that the page's deleteNSN > parent's LSN, you know that it was
concurrently deleted and recycled, and should be ignored. That
would allow reusing deleted pages immediately. Unfortunately that
would require adding a new field to the gist page header/footer,
which requires upgrade work :-(. Maybe one day, we'll bite the
bullet. Something to keep in mind, if we have to change the page
format anyway, for some reason.

Yeah, the same day we will get rid of invalid tuples. I can make a
patch for v13. Actually, I have a lot of patches that I want in GiST
in v13. Or v14.

Cool! Here's my wishlist:

* That deleteNSN thing
* Add a metapage to blk #0.
* Add a "level"-field to page header.
* Currently, a search needs to scan all items on a page. If the keys are
small, that can be pretty slow. Divide each page further into e.g. 4
sub-pages, with a "bounding box" key for each sub-page, to speed up search.

- Heikki

#59Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#56)
Re: GiST VACUUM

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.

And committed. Thanks Andrey!

- Heikki

#60Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#59)
Re: GiST VACUUM

On 22/03/2019 13:37, Heikki Linnakangas wrote:

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.

And committed. Thanks Andrey!

This caused the buildfarm to go pink... I was able to reproduce it, by
running the regression tests in one terminal, and repeatedly running
"VACUUM;" in another terminal. Strange that it seems to happen all the
time on the buildfarm, but never happened to me locally.

Anyway, I'm investigating.

- Heikki

#61Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#59)
Re: GiST VACUUM

22 марта 2019 г., в 19:37, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 21/03/2019 19:04, Heikki Linnakangas wrote:

Attached is the latest patch version, to be applied on top of the
IntegerSet patch.

And committed. Thanks Andrey!

- Heikki

Cool! Thank you very much! At the beginning I could not image how many caveats are there.

22 марта 2019 г., в 18:28, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

* Currently, a search needs to scan all items on a page. If the keys are small, that can be pretty slow. Divide each page further into e.g. 4 sub-pages, with a "bounding box" key for each sub-page, to speed up search.

BTW, I already have intra-page indexing patch. But now it obviously need a rebase :) Along with gist amcheck patch.

Thanks!

Best regards, Andrey Borodin.

#62Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#60)
Re: GiST VACUUM

22 марта 2019 г., в 17:03, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

Best regards, Andrey Borodin.

#63Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#62)
Re: GiST VACUUM

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.

We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.

- Heikki

#64Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#63)
2 attachment(s)
Re: GiST VACUUM

On 25/03/2019 15:20, Heikki Linnakangas wrote:

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.

We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.

I suggest that we do the attached. It fixes this for GiST. The patch
changes expands the "deletion XID" to 64-bits, and changes where it's
stored. Instead of storing it pd_prune_xid, it's stored in the page
contents. Luckily, a deleted page has no real content.

I think we should fix this in a similar manner in B-tree, too, but that
can be done separately. For B-tree, we need to worry about
backwards-compatibility, but that seems simple enough; we just need to
continue to understand old deleted pages, where the deletion XID is
stored in the page opaque field.

- Heikki

Attachments:

0001-Refactor-checks-for-deleted-GiST-pages.patchtext/x-patch; name=0001-Refactor-checks-for-deleted-GiST-pages.patchDownload
From b7897577c83a81ec04394ce7113d1d8a47804086 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c    | 40 ++++++++++++-------------------
 src/backend/access/gist/gistget.c | 14 +++++++++++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 2db790c840..028b06b264 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -693,14 +693,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -719,9 +720,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -842,12 +840,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 					 * leaf/inner is enough to recognize split for root
 					 */
 				}
-				else if (GistFollowRight(stack->page) ||
-						 stack->parent->lsn < GistPageGetNSN(stack->page))
+				else if ((GistFollowRight(stack->page) ||
+						  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+						 GistPageIsDeleted(stack->page))
 				{
 					/*
-					 * The page was split while we momentarily unlocked the
-					 * page. Go back to parent.
+					 * The page was split or deleted while we momentarily
+					 * unlocked the page. Go back to parent.
 					 */
 					UnlockReleaseBuffer(stack->buffer);
 					xlocked = false;
@@ -856,18 +855,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-				UnlockReleaseBuffer(stack->buffer);
-				xlocked = false;
-				state.stack = stack = stack->parent;
-				continue;
-			}
-
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
@@ -931,6 +918,9 @@ gistFindPath(Relation r, BlockNumber child, OffsetNumber *downlinkoffnum)
 			break;
 		}
 
+		/* currently, internal pages are never deleted */
+		Assert(!GistPageIsDeleted(page));
+
 		top->lsn = BufferGetLSNAtomic(buffer);
 
 		/*
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 8108fbb7d8..77ae2fb339 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -377,6 +377,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/*
+	 * Check if the page was deleted after we saw the downlink. There's
+	 * nothing of interest on a deleted page. Note that we must do this
+	 * after checking the NSN for concurrent splits! It's possible that
+	 * the page originally contained some tuples that are visible to use,
+	 * but was split so that all the visible tuples were moved to another
+	 * page, and then this page was deleted.
+	 */
+	if (GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
 	so->nPageData = so->curPageData = 0;
 	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
 	if (so->pageDataCxt)
-- 
2.20.1

0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchtext/x-patch; name=0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchDownload
From 38e676b20ab57370e3761898f3657dc64329f211 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Apr 2019 18:05:48 +0300
Subject: [PATCH 2/2] Use full 64-bit XID for checking if a deleted GiST page
 is old enough.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again. B-tree has the same problem, and has had since time immemorial,
but let's at least fix this in GiST, where this is new.
---
 src/backend/access/gist/gistutil.c     | 63 ++++++++++++++++++++++----
 src/backend/access/gist/gistvacuum.c   |  4 +-
 src/backend/access/gist/gistxlog.c     | 20 ++++++--
 src/backend/access/rmgrdesc/gistdesc.c |  6 ++-
 src/include/access/gist.h              | 20 +++++++-
 src/include/access/gist_private.h      |  4 +-
 src/include/access/gistxlog.h          |  2 +-
 7 files changed, 98 insertions(+), 21 deletions(-)

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 94b6ad6a59..f6bbc0426c 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -839,16 +839,16 @@ gistNewBuffer(Relation r)
 			gistcheckpage(r, buffer);
 
 			/*
-			 * Otherwise, recycle it if deleted, and too old to have any processes
-			 * interested in it.
+			 * Otherwise, recycle it if deleted, and too old to have any
+			 * processes interested in it.
 			 */
 			if (gistPageRecyclable(page))
 			{
 				/*
-				 * If we are generating WAL for Hot Standby then create a
-				 * WAL record that will allow us to conflict with queries
-				 * running on standby, in case they have snapshots older
-				 * than the page's deleteXid.
+				 * If we are generating WAL for Hot Standby then create a WAL
+				 * record that will allow us to conflict with queries running
+				 * on standby, in case they have snapshots older than the
+				 * page's deleteXid.
 				 */
 				if (XLogStandbyInfoActive() && RelationNeedsWAL(r))
 					gistXLogPageReuse(r, blkno, GistPageGetDeleteXid(page));
@@ -882,9 +882,54 @@ gistNewBuffer(Relation r)
 bool
 gistPageRecyclable(Page page)
 {
-	return PageIsNew(page) ||
-		(GistPageIsDeleted(page) &&
-		 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
+	if (PageIsNew(page))
+		return true;
+	if (GistPageIsDeleted(page))
+	{
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
+		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 *
+		 * One complication here is that the delete XID is a "full" 64-bit
+		 * transaction ID, but RecentGlobalXmin doesn't include the epoch. So
+		 * we first have to form a full 64-bit version of RecentGlobalXmin to
+		 * compare with.
+		 */
+		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
+		FullTransactionId nextxid_full;
+		uint32		nextxid_epoch;
+		TransactionId nextxid_xid;
+		FullTransactionId recentxmin_full;
+		uint32		recentxmin_epoch;
+		TransactionId recentxmin_xid;
+
+		nextxid_full = ReadNextFullTransactionId();
+		nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
+		nextxid_xid = XidFromFullTransactionId(nextxid_full);
+
+		recentxmin_xid = RecentGlobalXmin;
+		if (recentxmin_xid > nextxid_xid)
+			recentxmin_epoch = nextxid_epoch - 1;
+		else
+			recentxmin_epoch = nextxid_epoch;
+		recentxmin_full =
+			FullTransactionIdFromEpochAndXid(recentxmin_epoch,
+											 recentxmin_xid);
+
+		/*
+		 * Now we have a 64-bit version of RecentGlobalXmin. Compare deletion
+		 * XID against it.
+		 */
+		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
+			return true;
+	}
+	return false;
 }
 
 bytea *
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index e2029d842c..07096194b2 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -595,7 +595,7 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	ItemId		iid;
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
-	TransactionId txid;
+	FullTransactionId txid;
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -648,7 +648,7 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	 * currently in progress must have ended.  (That's much more conservative
 	 * than needed, but let's keep it safe and simple.)
 	 */
-	txid = ReadNewTransactionId();
+	txid = ReadNextFullTransactionId();
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 4fb1855e89..37a190219e 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -701,7 +701,7 @@ gistXLogSplit(bool page_is_leaf,
  * downlink from the parent page.
  */
 XLogRecPtr
-gistXLogPageDelete(Buffer buffer, TransactionId xid,
+gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
 				   Buffer parentBuffer, OffsetNumber downlinkOffset)
 {
 	gistxlogPageDelete xlrec;
@@ -725,9 +725,23 @@ gistXLogPageDelete(Buffer buffer, TransactionId xid,
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
+	FullTransactionId nextxid;
+	uint64		diff;
+
+	/*
+	 * We can skip this if the page was deleted so long ago, that no scan can possibly
+	 * still see it, even in a standby. One measure might be anything older than the
+	 * table's frozen-xid, but we don't have that at hand here. But anything older than
+	 * 2 billion, from the next XID, is surely old enough, because you would hit XID
+	 * wraparound at that point.
+	 */
+	nextxid = ReadNextFullTransactionId();
+	diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+	if (diff < 0x7fffffff)
+		return;
 
 	/*
 	 * Note that we don't register the buffer with the record, because this
@@ -738,7 +752,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedXid = XidFromFullTransactionId(latestRemovedXid);
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfGistxlogPageReuse);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index eb308c72d6..ba00315260 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -47,8 +47,10 @@ out_gistxlogPageSplit(StringInfo buf, gistxlogPageSplit *xlrec)
 static void
 out_gistxlogPageDelete(StringInfo buf, gistxlogPageDelete *xlrec)
 {
-	appendStringInfo(buf, "deleteXid %u; downlink %u",
-					 xlrec->deleteXid, xlrec->downlinkOffset);
+	appendStringInfo(buf, "deleteXid %u:%u; downlink %u",
+					 EpochFromFullTransactionId(xlrec->deleteXid),
+					 XidFromFullTransactionId(xlrec->deleteXid),
+					 xlrec->downlinkOffset);
 }
 
 void
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 6902f4115b..8cdc1f0fa9 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -16,6 +16,7 @@
 #ifndef GIST_H
 #define GIST_H
 
+#include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogdefs.h"
 #include "storage/block.h"
@@ -159,8 +160,23 @@ typedef struct GISTENTRY
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
 /* For deleted pages we store last xid which could see the page in scan */
-#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
-#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+static inline FullTransactionId
+GistPageGetDeleteXid(Page page)
+{
+	Assert(GistPageIsDeleted(page));
+	Assert(((PageHeader) page)->pd_lower == MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId));
+
+	return *(FullTransactionId *) PageGetContents(page);
+}
+
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+	Assert(PageIsEmpty(page));
+	((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
+
+	*((FullTransactionId *) PageGetContents(page)) = deletexid;
+}
 
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 78e2e3fb31..2c54f208ca 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -419,11 +419,11 @@ extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 
 /* gistxlog.c */
 extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
-				   TransactionId xid, Buffer parentBuffer,
+				   FullTransactionId xid, Buffer parentBuffer,
 				   OffsetNumber downlinkOffset);
 
 extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-				  TransactionId latestRemovedXid);
+				  FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 			   OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 9990d97cbd..9967bdcf3f 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -83,7 +83,7 @@ typedef struct gistxlogPageSplit
  */
 typedef struct gistxlogPageDelete
 {
-	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	FullTransactionId deleteXid; /* last Xid which could see page in scan */
 	OffsetNumber downlinkOffset; /* Offset of downlink referencing this page */
 } gistxlogPageDelete;
 
-- 
2.20.1

#65Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#64)
Re: GiST VACUUM

Hi!

4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

On 25/03/2019 15:20, Heikki Linnakangas wrote:

On 24/03/2019 18:50, Andrey Borodin wrote:

I was working on new version of gist check in amcheck and understand one more thing:

/* Can this page be recycled yet? */
bool
gistPageRecyclable(Page page)
{
return PageIsNew(page) ||
(GistPageIsDeleted(page) &&
TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
}

Here RecentGlobalXmin can wraparound and page will become unrecyclable for half of xid cycle. Can we prevent it by resetting PageDeleteXid to InvalidTransactionId before doing RecordFreeIndexPage()?
(Seems like same applies to GIN...)

True, and B-tree has the same issue. I thought I saw a comment somewhere
in the B-tree code about that earlier, but now I can't find it. I
must've imagined it.
We could reset it, but that would require dirtying the page. That would
be just extra I/O overhead, if the page gets reused before XID
wraparound. We could avoid that if we stored the full XID+epoch, not
just XID. I think we should do that in GiST, at least, where this is
new. In the B-tree, it would require some extra code to deal with
backwards-compatibility, but maybe it would be worth it even there.

I suggest that we do the attached. It fixes this for GiST. The patch changes expands the "deletion XID" to 64-bits, and changes where it's stored. Instead of storing it pd_prune_xid, it's stored in the page contents. Luckily, a deleted page has no real content.

So, we store full xid right after page header?
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+	Assert(PageIsEmpty(page));
+	((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
+
+	*((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number) untouched. I do not know is it done for a reason or not....

Also, I did not understand this optimization:
+	/*
+	 * We can skip this if the page was deleted so long ago, that no scan can possibly
+	 * still see it, even in a standby. One measure might be anything older than the
+	 * table's frozen-xid, but we don't have that at hand here. But anything older than
+	 * 2 billion, from the next XID, is surely old enough, because you would hit XID
+	 * wraparound at that point.
+	 */
+	nextxid = ReadNextFullTransactionId();
+	diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+	if (diff < 0x7fffffff)
+		return;

Standby can be lagging months from primary, and, theoretically, close the gap in one sudden WAL leap... Also, I think, that comparison sign should be >, not <.

Best regards, Andrey Borodin.

#66Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#64)
Re: GiST VACUUM

Heikki,

On Thu, Apr 04, 2019 at 06:15:21PM +0300, Heikki Linnakangas wrote:

I think we should fix this in a similar manner in B-tree, too, but that can
be done separately. For B-tree, we need to worry about
backwards-compatibility, but that seems simple enough; we just need to
continue to understand old deleted pages, where the deletion XID is stored
in the page opaque field.

This is an open item present already for a couple of weeks. Are you
planning to tackle that soon?
--
Michael

#67Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#65)
Re: GiST VACUUM

(Thanks for the reminder on this, Michael!)

On 05/04/2019 08:39, Andrey Borodin wrote:

4 апр. 2019 г., в 20:15, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
I suggest that we do the attached. It fixes this for GiST. The
patch changes expands the "deletion XID" to 64-bits, and changes
where it's stored. Instead of storing it pd_prune_xid, it's stored
in the page contents. Luckily, a deleted page has no real content.

So, we store full xid right after page header?

Yep.

+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+	Assert(PageIsEmpty(page));
+	((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(FullTransactionId);
+
+	*((FullTransactionId *) PageGetContents(page)) = deletexid;
+}

Usually we leave one ItemId (located at invalid offset number)
untouched. I do not know is it done for a reason or not....

No. Take a look at PageGetItemId() macro, it subtracts one from the
offset number. But in any case, that's not really relevant here, because
this patch stores the transaction id directly as the page content. There
are no itemids at all on a deleted page.

Also, I did not understand this optimization:
+	/*
+	 * We can skip this if the page was deleted so long ago, that no scan can possibly
+	 * still see it, even in a standby. One measure might be anything older than the
+	 * table's frozen-xid, but we don't have that at hand here. But anything older than
+	 * 2 billion, from the next XID, is surely old enough, because you would hit XID
+	 * wraparound at that point.
+	 */
+	nextxid = ReadNextFullTransactionId();
+	diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+	if (diff < 0x7fffffff)
+		return;

Standby can be lagging months from primary, and, theoretically, close
the gap in one sudden WAL leap...

It would still process the WAL one WAL record at a time, even if it's
lagging months behind. It can't just jump over 2 billion XIDs.

Also, I think, that comparison sign should be >, not <.

Ah, good catch! And it shows that this needs more testing..

- Heikki

#68Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#67)
Re: GiST VACUUM

Hi!

Thanks for clarification, now I understand these patches better.

25 июня 2019 г., в 13:10, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

Also, I did not understand this optimization:
+	/*
+	 * We can skip this if the page was deleted so long ago, that no scan can possibly
+	 * still see it, even in a standby. One measure might be anything older than the
+	 * table's frozen-xid, but we don't have that at hand here. But anything older than
+	 * 2 billion, from the next XID, is surely old enough, because you would hit XID
+	 * wraparound at that point.
+	 */
+	nextxid = ReadNextFullTransactionId();
+	diff = U64FromFullTransactionId(nextxid) - U64FromFullTransactionId(latestRemovedXid);
+	if (diff < 0x7fffffff)
+		return;
Standby can be lagging months from primary, and, theoretically, close
the gap in one sudden WAL leap...

It would still process the WAL one WAL record at a time, even if it's lagging months behind. It can't just jump over 2 billion XIDs.

I feel a little uncomfortable with number 0x7fffffff right in code.

Thanks!

Best regards, Andrey Borodin.

#69Michael Paquier
michael@paquier.xyz
In reply to: Andrey Borodin (#68)
Re: GiST VACUUM

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:

I feel a little uncomfortable with number 0x7fffffff right in code.

PG_INT32_MAX...
--
Michael

#70Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#69)
Re: GiST VACUUM

On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:

I feel a little uncomfortable with number 0x7fffffff right in code.

PG_INT32_MAX...

MaxTransactionId / 2?

--
Thomas Munro
https://enterprisedb.com

#71Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#70)
2 attachment(s)
Re: GiST VACUUM

On 26/06/2019 06:07, Thomas Munro wrote:

On Wed, Jun 26, 2019 at 2:33 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Jun 25, 2019 at 02:38:43PM +0500, Andrey Borodin wrote:

I feel a little uncomfortable with number 0x7fffffff right in code.

PG_INT32_MAX...

MaxTransactionId / 2?

Yeah, that makes sense.

Here's a new version of the patches. Changes:

* I changed the reuse-logging so that we always write a reuse WAL
record, even if the deleteXid is very old. I tried to avoid that with
the check for MaxTransactionId / 2 or 0x7fffffff, but it had some
problems. In the previous patch version, it wasn't just an optimization.
Without the check, we would write 32-bit XIDs to the log that had
already wrapped around, and that caused the standby to unnecessarily
wait for or kill backends. But checking for MaxTransaction / 2 isn't
quite enough: there was a small chance that the next XID counter
advanced just after checking for that, so that we still logged an XID
that had just wrapped around. A more robust way to deal with this is to
log a full 64-bit XID, and check for wraparound at redo in the standby.
And if we do that, trying to optimize this in the master doesn't seem
that important anymore. So in this patch version, we always log the
64-bit XID, and check for the MaxTransaction / 2 when replaying the WAL
record instead.

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.

* Instead of storing just a naked FullTransactionId in the "page
contents" of a deleted page, I created a new struct for that. The effect
is the same, but I think the struct clarifies what's happening, and it's
a good location to attach a comment explaining it.

* Fixed the mixup between < and >

I haven't done any testing on this yet. Andrey, would you happen to have
an environment ready to test this?

- Heikki

Attachments:

0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchtext/x-patch; name=0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchDownload
From 0cc869d0f5e7ef33f0ccdcd2ccbd8f89d5711590 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 27 Jun 2019 14:37:56 +0300
Subject: [PATCH 2/2] Use full 64-bit XID for checking if a deleted GiST page
 is old enough.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again. B-tree has the same problem, and has had since time immemorial,
but let's at least fix this in GiST, where this is new.
---
 src/backend/access/gist/gistutil.c     | 26 +++++++++++++++--
 src/backend/access/gist/gistvacuum.c   |  4 +--
 src/backend/access/gist/gistxlog.c     | 29 +++++++++++++++----
 src/backend/access/rmgrdesc/gistdesc.c | 11 +++++---
 src/backend/access/transam/varsup.c    | 39 ++++++++++++++++++++++++++
 src/include/access/gist.h              | 35 +++++++++++++++++++++--
 src/include/access/gist_private.h      |  4 +--
 src/include/access/gistxlog.h          |  6 ++--
 src/include/access/transam.h           |  1 +
 9 files changed, 133 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 49df05653b3..3a09197a242 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -882,9 +882,29 @@ gistNewBuffer(Relation r)
 bool
 gistPageRecyclable(Page page)
 {
-	return PageIsNew(page) ||
-		(GistPageIsDeleted(page) &&
-		 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
+	if (PageIsNew(page))
+		return true;
+	if (GistPageIsDeleted(page))
+	{
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
+		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
+		FullTransactionId recentxmin_full;
+
+		recentxmin_full = FullTransactionIdFromRecentXid(RecentGlobalXmin);
+
+		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
+			return true;
+	}
+	return false;
 }
 
 bytea *
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 4270226eee2..ad1576da5b8 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -595,7 +595,7 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	ItemId		iid;
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
-	TransactionId txid;
+	FullTransactionId txid;
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -648,7 +648,7 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	 * currently in progress must have ended.  (That's much more conservative
 	 * than needed, but let's keep it safe and simple.)
 	 */
-	txid = ReadNewTransactionId();
+	txid = ReadNextFullTransactionId();
 
 	START_CRIT_SECTION();
 
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34d863..9fc7282200d 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -396,8 +396,27 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
+		FullTransactionId nextFullXid = ReadNextFullTransactionId();
+		uint64		diff;
+
+		/*
+		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
+		 * TransactionIds, so truncate the logged FullTransactionId. If the
+		 * logged value is very old, so that XID wrap-around already happened
+		 * on it, there can't be any snapshots that still see it.
+		 */
+		nextFullXid = ReadNextFullTransactionId();
+		diff = U64FromFullTransactionId(nextFullXid) -
+			U64FromFullTransactionId(latestRemovedFullXid);
+		if (diff < MaxTransactionId / 2)
+		{
+			TransactionId latestRemovedXid;
+
+			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->node);
+		}
 	}
 }
 
@@ -554,7 +573,7 @@ gistXLogSplit(bool page_is_leaf,
  * downlink from the parent page.
  */
 XLogRecPtr
-gistXLogPageDelete(Buffer buffer, TransactionId xid,
+gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
 				   Buffer parentBuffer, OffsetNumber downlinkOffset)
 {
 	gistxlogPageDelete xlrec;
@@ -578,7 +597,7 @@ gistXLogPageDelete(Buffer buffer, TransactionId xid,
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -591,7 +610,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfGistxlogPageReuse);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 767864b58e6..eccb6fd9428 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -26,10 +26,11 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate *xlrec)
 static void
 out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec)
 {
-	appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u",
+	appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u:%u",
 					 xlrec->node.spcNode, xlrec->node.dbNode,
 					 xlrec->node.relNode, xlrec->block,
-					 xlrec->latestRemovedXid);
+					 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+					 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 }
 
 static void
@@ -50,8 +51,10 @@ out_gistxlogPageSplit(StringInfo buf, gistxlogPageSplit *xlrec)
 static void
 out_gistxlogPageDelete(StringInfo buf, gistxlogPageDelete *xlrec)
 {
-	appendStringInfo(buf, "deleteXid %u; downlink %u",
-					 xlrec->deleteXid, xlrec->downlinkOffset);
+	appendStringInfo(buf, "deleteXid %u:%u; downlink %u",
+					 EpochFromFullTransactionId(xlrec->deleteXid),
+					 XidFromFullTransactionId(xlrec->deleteXid),
+					 xlrec->downlinkOffset);
 }
 
 void
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 5b759ec7f3f..e305a3be028 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -300,6 +300,45 @@ AdvanceNextFullTransactionIdPastXid(TransactionId xid)
 	LWLockRelease(XidGenLock);
 }
 
+/*
+ * Extend a 32-bit TransactionId into a 64-bit FullTransactionId.
+ *
+ * This assumes that the xid is "recent", not older than 2 billion XIDs
+ * from the next xid.
+ */
+FullTransactionId
+FullTransactionIdFromRecentXid(TransactionId xid)
+{
+	FullTransactionId nextxid_full;
+	uint32		nextxid_epoch;
+	TransactionId nextxid_xid;
+	uint32		xid_epoch;
+
+	if (TransactionIdIsNormal(xid))
+	{
+		/*
+		 * Compute the epoch of the target xid from the next XID's epoch.
+		 * This assumes that the target XID is within the 2 billion XID
+		 * horizon from the next XID.
+		 */
+		nextxid_full = ReadNextFullTransactionId();
+		nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
+		nextxid_xid = XidFromFullTransactionId(nextxid_full);
+
+		if (xid > nextxid_xid)
+			xid_epoch = nextxid_epoch - 1;
+		else
+			xid_epoch = nextxid_epoch;
+	}
+	else
+	{
+		/* Use epoch 0 for special XIDs. */
+		xid_epoch = 0;
+	}
+
+	return FullTransactionIdFromEpochAndXid(xid_epoch, xid);
+}
+
 /*
  * Advance the cluster-wide value for the oldest valid clog entry.
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 6902f4115b7..14fa9646b23 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -16,6 +16,7 @@
 #ifndef GIST_H
 #define GIST_H
 
+#include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogdefs.h"
 #include "storage/block.h"
@@ -158,9 +159,37 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
-/* For deleted pages we store last xid which could see the page in scan */
-#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
-#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
+/*
+ * On a deleted page, we store this struct. A deleted page doesn't contain any
+ * tuples, so we don't use the normal page layout with line pointers. Instead,
+ * this struct is stored right after the standard page header. pd_lower points
+ * to the end of this struct. If we add fields to this struct in the future, we
+ * can distinguish the old and new formats by pd_lower.
+ */
+typedef struct GISTDeletedPageContents
+{
+	/* last xid which could see the page in a scan */
+	FullTransactionId deleteXid;
+} GISTDeletedPageContents;
+
+static inline FullTransactionId
+GistPageGetDeleteXid(Page page)
+{
+	Assert(GistPageIsDeleted(page));
+	Assert(((PageHeader) page)->pd_lower == MAXALIGN(SizeOfPageHeaderData) + sizeof(GISTDeletedPageContents));
+
+	return ((GISTDeletedPageContents *) PageGetContents(page))->deleteXid;
+}
+
+static inline void
+GistPageSetDeleteXid(Page page, FullTransactionId deletexid)
+{
+	Assert(PageIsEmpty(page));
+	((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(GISTDeletedPageContents);
+
+	((GISTDeletedPageContents *) PageGetContents(page))->deleteXid = deletexid;
+}
 
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index f80694bf9a8..f84ea71ecba 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -426,11 +426,11 @@ extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 
 /* gistxlog.c */
 extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
-									 TransactionId xid, Buffer parentBuffer,
+									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
 extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  TransactionId latestRemovedXid);
+							  FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a5376b5e..e44922d915c 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -83,7 +83,7 @@ typedef struct gistxlogPageSplit
  */
 typedef struct gistxlogPageDelete
 {
-	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	FullTransactionId deleteXid;	/* last Xid which could see page in scan */
 	OffsetNumber downlinkOffset;	/* Offset of downlink referencing this
 									 * page */
 } gistxlogPageDelete;
@@ -98,10 +98,10 @@ typedef struct gistxlogPageReuse
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, latestRemovedXid) + sizeof(TransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, latestRemovedFullXid) + sizeof(FullTransactionId))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/access/transam.h b/src/include/access/transam.h
index 6cbb0c82c73..e5e5e4b3bf3 100644
--- a/src/include/access/transam.h
+++ b/src/include/access/transam.h
@@ -225,6 +225,7 @@ extern XLogRecPtr TransactionIdGetCommitLSN(TransactionId xid);
 extern FullTransactionId GetNewTransactionId(bool isSubXact);
 extern void AdvanceNextFullTransactionIdPastXid(TransactionId xid);
 extern FullTransactionId ReadNextFullTransactionId(void);
+extern FullTransactionId FullTransactionIdFromRecentXid(TransactionId xid);
 extern void SetTransactionIdLimit(TransactionId oldest_datfrozenxid,
 								  Oid oldest_datoid);
 extern void AdvanceOldestClogXid(TransactionId oldest_datfrozenxid);
-- 
2.20.1

0001-Refactor-checks-for-deleted-GiST-pages.patchtext/x-patch; name=0001-Refactor-checks-for-deleted-GiST-pages.patchDownload
From 7fd5901e15ac9e0f1928eeecbb9ae1212bacf3f3 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Thu, 4 Apr 2019 18:06:48 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c    | 40 ++++++++++++-------------------
 src/backend/access/gist/gistget.c | 14 +++++++++++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 470b121e7da..79030e77153 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -858,12 +856,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 					 * leaf/inner is enough to recognize split for root
 					 */
 				}
-				else if (GistFollowRight(stack->page) ||
-						 stack->parent->lsn < GistPageGetNSN(stack->page))
+				else if ((GistFollowRight(stack->page) ||
+						  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+						 GistPageIsDeleted(stack->page))
 				{
 					/*
-					 * The page was split while we momentarily unlocked the
-					 * page. Go back to parent.
+					 * The page was split or deleted while we momentarily
+					 * unlocked the page. Go back to parent.
 					 */
 					UnlockReleaseBuffer(stack->buffer);
 					xlocked = false;
@@ -872,18 +871,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-				UnlockReleaseBuffer(stack->buffer);
-				xlocked = false;
-				state.stack = stack = stack->parent;
-				continue;
-			}
-
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
@@ -947,6 +934,9 @@ gistFindPath(Relation r, BlockNumber child, OffsetNumber *downlinkoffnum)
 			break;
 		}
 
+		/* currently, internal pages are never deleted */
+		Assert(!GistPageIsDeleted(page));
+
 		top->lsn = BufferGetLSNAtomic(buffer);
 
 		/*
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 8108fbb7d8e..77ae2fb3395 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -377,6 +377,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/*
+	 * Check if the page was deleted after we saw the downlink. There's
+	 * nothing of interest on a deleted page. Note that we must do this
+	 * after checking the NSN for concurrent splits! It's possible that
+	 * the page originally contained some tuples that are visible to use,
+	 * but was split so that all the visible tuples were moved to another
+	 * page, and then this page was deleted.
+	 */
+	if (GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
 	so->nPageData = so->curPageData = 0;
 	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
 	if (so->pageDataCxt)
-- 
2.20.1

#72Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#71)
Re: GiST VACUUM

27 июня 2019 г., в 16:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

I haven't done any testing on this yet. Andrey, would you happen to have an environment ready to test this?

Thanks!

I will do some testing this evening (UTC+5). But I'm not sure I can reliably test wraparound of xids...
I will try to break code with usual setup which we used to stress vacuum with deletes and inserts.

Best regards, Andrey Borodin.

#73Andrey Borodin
x4mmm@yandex-team.ru
In reply to: Heikki Linnakangas (#71)
Re: GiST VACUUM

27 июня 2019 г., в 16:38, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):

I haven't done any testing on this yet. Andrey, would you happen to have an environment ready to test this?

Patches do not deadlock and delete pages on "rescan test" - setup that we used to detect "left jumps" in during development of physical vacuum. check-world is happy on my machine.
I really like that now there is GISTDeletedPageContents and we do not just cast *(FullTransactionId *) PageGetContents(page).

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid becomes xid from future

latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Best regards, Andrey Borodin.

#74Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Andrey Borodin (#73)
Re: GiST VACUUM

On 27/06/2019 20:15, Andrey Borodin wrote:

But I have stupid question again, about this code:

https://github.com/x4m/postgres_g/commit/096d5586537d29ff316ca8ce074bbe1b325879ee#diff-754126824470cb8e68fd5e32af6d3bcaR417

nextFullXid = ReadNextFullTransactionId();
diff = U64FromFullTransactionId(nextFullXid) -
U64FromFullTransactionId(latestRemovedFullXid);
if (diff < MaxTransactionId / 2)
{
TransactionId latestRemovedXid;

// sleep(100500 hours); latestRemovedXid becomes xid from future

latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
xlrec->node);
}

Do we have a race condition here? Can latestRemovedXid overlap and start to be xid from future?
I understand that it is purely hypothetical, but still latestRemovedXid is from ancient past already.

Good question. No, that can't happen, because this code is in the WAL
redo function. In a standby, the next XID counter only moves forward
when a WAL record is replayed that advances it, and all WAL records are
replayed serially, so that can't happen when we're in the middle of
replaying this record. A comment on that would be good, though.

When I originally had the check like above in the code that created the
WAL record, it had exactly that problem, because in the master the next
XID counter can advance concurrently.

- Heikki

#75Thomas Munro
thomas.munro@gmail.com
In reply to: Heikki Linnakangas (#71)
Re: GiST VACUUM

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse. I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1]https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32 as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids. The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation. I was planning to think harder about for 13.
Obviously that doesn't help you now. So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

[1]: https://github.com/EnterpriseDB/zheap/commit/1203c2fa49f5f872f42ea4a02ccba2b191c45f32

--
Thomas Munro
https://enterprisedb.com

In reply to: Heikki Linnakangas (#64)
Re: GiST VACUUM

On Thu, Apr 4, 2019 at 8:15 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

I think we should fix this in a similar manner in B-tree, too, but that
can be done separately. For B-tree, we need to worry about
backwards-compatibility, but that seems simple enough; we just need to
continue to understand old deleted pages, where the deletion XID is
stored in the page opaque field.

What Postgres versions will the B-Tree fix end up targeting? Sounds
like you plan to backpatch all the way?

--
Peter Geoghegan

#77Amit Kapila
amit.kapila16@gmail.com
In reply to: Thomas Munro (#75)
Re: GiST VACUUM

On Fri, Jun 28, 2019 at 3:32 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse. I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

I have two more cases in undo patch series where the same function is
needed and it is safe to use it there. The first place is twophase.c
for rolling back prepared transactions where we know that we don't
support aborted xacts that are older than 2 billion, so we can rely on
such a function. We also need it in undodiscard.c to compute the
value of oldestFullXidHavingUnappliedUndo. See the usage of
GetEpochForXid in unprocessing patches. Now, we might find a way to
avoid using in one of these places by doing some more work, but not
sure we can avoid from all the three places (one proposed by this
patch and two by undo patchset).

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids. The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation. I was planning to think harder about for 13.
Obviously that doesn't help you now. So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

Do you suggest to open code for all the three places for now? I am
not against open coding the logic for now but not sure if we can
eliminate its need because even if we can do what you are saying with
procarray.c-managed xid horizons, I think we need to do something
about clog.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#78Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Thomas Munro (#75)
4 attachment(s)
Re: GiST VACUUM

On 28/06/2019 01:02, Thomas Munro wrote:

On Thu, Jun 27, 2019 at 11:38 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

* I moved the logic to extend a 32-bit XID to 64-bits to a new helper
function in varsup.c.

I'm a bit uneasy about making this into a generally available function
for reuse. I think we should instead come up with a very small number
of sources of fxids that known to be free of races because of some
well explained interlocking.

For example, I believe it is safe to convert an xid obtained from a
WAL record during recovery, because (for now) recovery is a single
thread of execution and the next fxid can't advance underneath you.
So I think XLogRecGetFullXid(XLogReaderState *record)[1] as I'm about
to propose in another thread (though I've forgotten who wrote it,
maybe Andres, Amit or me, but if it wasn't me then it's exactly what I
would have written) is a safe blessed source of fxids. The rationale
for writing this function instead of an earlier code that looked much
like your proposed helper function, during EDB-internal review of
zheap stuff, was this: if we provide a general purpose xid->fxid
facility, it's virtually guaranteed that someone will eventually use
it to shoot footwards, by acquiring an xid from somewhere, and then
some arbitrary time later extending it to a fxid when no interlocking
guarantees that the later conversion has the correct epoch.

Fair point.

I'd like to figure out how to maintain full versions of the
procarray.c-managed xid horizons, without widening the shared memory
representation. I was planning to think harder about for 13.
Obviously that doesn't help you now. So I'm wondering if you should
just open-code this for now, and put in a comment about why it's safe
and a note that there'll hopefully be a fxid horizon available in a
later release?

I came up with the attached. Instead of having a general purpose "widen"
function, it adds GetFullRecentGlobalXmin(), to return a 64-bit version
of RecentGlobalXmin. That's enough for this Gist vacuum patch.

In addition to that change, I modified the GistPageSetDeleted(),
GistPageSetDeleteXid(), GistPageGetDeleteXid() inline functions a bit. I
merged GistPageSetDeleted() and GistPageSetDeleteXid() into a single
function, to make sure that the delete-XID is always set when a page is
marked as deleted. And I modified GistPageGetDeleteXid() to return
NormalTransactionId (or a FullTransactionId version of it, to be
precise), for Gist pages that were deleted with older PostgreSQL v12
beta versions. The previous patch tripped an assertion in that case,
which is not nice for people binary-upgrading from earlier beta versions.

I did some testing of this with XID wraparound, and it seems to work. I
used the attached bash script for the testing, with the a helper contrib
module to consume XIDs faster. It's not very well commented and probably
not too useful for anyone, but I'm attaching it here mainly as a note to
future-self, if we need to revisit this.

Unless something comes up, I'll commit this tomorrow.

- Heikki

Attachments:

0001-Refactor-checks-for-deleted-GiST-pages.patchtext/x-patch; name=0001-Refactor-checks-for-deleted-GiST-pages.patchDownload
From bdeb2467211a1ab9e733e070d54dce97c05cf18c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2019 15:57:01 +0300
Subject: [PATCH 1/2] Refactor checks for deleted GiST pages.

The explicit check in gistScanPage() isn't currently really necessary, as
a deleted page is always empty, so the loop would fall through without
doing anything, anyway. But it's a marginal optimization, and it gives a
nice place to attach a comment to explain how it works.
---
 src/backend/access/gist/gist.c    | 40 ++++++++++++-------------------
 src/backend/access/gist/gistget.c | 14 +++++++++++
 2 files changed, 29 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 169bf6fcfed..e9ca4b82527 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -709,14 +709,15 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			continue;
 		}
 
-		if (stack->blkno != GIST_ROOT_BLKNO &&
-			stack->parent->lsn < GistPageGetNSN(stack->page))
+		if ((stack->blkno != GIST_ROOT_BLKNO &&
+			 stack->parent->lsn < GistPageGetNSN(stack->page)) ||
+			GistPageIsDeleted(stack->page))
 		{
 			/*
-			 * Concurrent split detected. There's no guarantee that the
-			 * downlink for this page is consistent with the tuple we're
-			 * inserting anymore, so go back to parent and rechoose the best
-			 * child.
+			 * Concurrent split or page deletion detected. There's no
+			 * guarantee that the downlink for this page is consistent with
+			 * the tuple we're inserting anymore, so go back to parent and
+			 * rechoose the best child.
 			 */
 			UnlockReleaseBuffer(stack->buffer);
 			xlocked = false;
@@ -735,9 +736,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 			GISTInsertStack *item;
 			OffsetNumber downlinkoffnum;
 
-			/* currently, internal pages are never deleted */
-			Assert(!GistPageIsDeleted(stack->page));
-
 			downlinkoffnum = gistchoose(state.r, stack->page, itup, giststate);
 			iid = PageGetItemId(stack->page, downlinkoffnum);
 			idxtuple = (IndexTuple) PageGetItem(stack->page, iid);
@@ -858,12 +856,13 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 					 * leaf/inner is enough to recognize split for root
 					 */
 				}
-				else if (GistFollowRight(stack->page) ||
-						 stack->parent->lsn < GistPageGetNSN(stack->page))
+				else if ((GistFollowRight(stack->page) ||
+						  stack->parent->lsn < GistPageGetNSN(stack->page)) &&
+						 GistPageIsDeleted(stack->page))
 				{
 					/*
-					 * The page was split while we momentarily unlocked the
-					 * page. Go back to parent.
+					 * The page was split or deleted while we momentarily
+					 * unlocked the page. Go back to parent.
 					 */
 					UnlockReleaseBuffer(stack->buffer);
 					xlocked = false;
@@ -872,18 +871,6 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace,
 				}
 			}
 
-			/*
-			 * The page might have been deleted after we scanned the parent
-			 * and saw the downlink.
-			 */
-			if (GistPageIsDeleted(stack->page))
-			{
-				UnlockReleaseBuffer(stack->buffer);
-				xlocked = false;
-				state.stack = stack = stack->parent;
-				continue;
-			}
-
 			/* now state.stack->(page, buffer and blkno) points to leaf page */
 
 			gistinserttuple(&state, stack, giststate, itup,
@@ -947,6 +934,9 @@ gistFindPath(Relation r, BlockNumber child, OffsetNumber *downlinkoffnum)
 			break;
 		}
 
+		/* currently, internal pages are never deleted */
+		Assert(!GistPageIsDeleted(page));
+
 		top->lsn = BufferGetLSNAtomic(buffer);
 
 		/*
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index 46d08e06350..c5fe2ea3998 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -377,6 +377,20 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances,
 		MemoryContextSwitchTo(oldcxt);
 	}
 
+	/*
+	 * Check if the page was deleted after we saw the downlink. There's
+	 * nothing of interest on a deleted page. Note that we must do this
+	 * after checking the NSN for concurrent splits! It's possible that
+	 * the page originally contained some tuples that are visible to use,
+	 * but was split so that all the visible tuples were moved to another
+	 * page, and then this page was deleted.
+	 */
+	if (GistPageIsDeleted(page))
+	{
+		UnlockReleaseBuffer(buffer);
+		return;
+	}
+
 	so->nPageData = so->curPageData = 0;
 	scan->xs_hitup = NULL;		/* might point into pageDataCxt */
 	if (so->pageDataCxt)
-- 
2.20.1

0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchtext/x-patch; name=0002-Use-full-64-bit-XID-for-checking-if-a-deleted-GiST-p.patchDownload
From d9f03ccbf30eb25afac78771ba30690554dd97d6 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 22 Jul 2019 15:57:02 +0300
Subject: [PATCH 2/2] Use full 64-bit XID for checking if a deleted GiST page
 is old enough.

Otherwise, after a deleted page gets even older, it becomes unrecyclable
again. B-tree has the same problem, and has had since time immemorial,
but let's at least fix this in GiST, where this is new.
---
 src/backend/access/gist/gistutil.c     | 24 ++++++++++++--
 src/backend/access/gist/gistvacuum.c   |  7 ++--
 src/backend/access/gist/gistxlog.c     | 32 ++++++++++++++----
 src/backend/access/rmgrdesc/gistdesc.c | 11 ++++---
 src/backend/utils/time/snapmgr.c       | 30 +++++++++++++++++
 src/include/access/gist.h              | 45 +++++++++++++++++++++++---
 src/include/access/gist_private.h      |  4 +--
 src/include/access/gistxlog.h          |  6 ++--
 src/include/utils/snapmgr.h            |  3 ++
 9 files changed, 134 insertions(+), 28 deletions(-)

diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 7d1b219bbc8..97260201dcd 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -882,9 +882,27 @@ gistNewBuffer(Relation r)
 bool
 gistPageRecyclable(Page page)
 {
-	return PageIsNew(page) ||
-		(GistPageIsDeleted(page) &&
-		 TransactionIdPrecedes(GistPageGetDeleteXid(page), RecentGlobalXmin));
+	if (PageIsNew(page))
+		return true;
+	if (GistPageIsDeleted(page))
+	{
+		/*
+		 * The page was deleted, but when? If it was just deleted, a scan
+		 * might have seen the downlink to it, and will read the page later.
+		 * As long as that can happen, we must keep the deleted page around as
+		 * a tombstone.
+		 *
+		 * Compare the deletion XID with RecentGlobalXmin. If deleteXid <
+		 * RecentGlobalXmin, then no scan that's still in progress could have
+		 * seen its downlink, and we can recycle it.
+		 */
+		FullTransactionId deletexid_full = GistPageGetDeleteXid(page);
+		FullTransactionId recentxmin_full = GetFullRecentGlobalXmin();
+
+		if (FullTransactionIdPrecedes(deletexid_full, recentxmin_full))
+			return true;
+	}
+	return false;
 }
 
 bytea *
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 4270226eee2..bf754ea6d0d 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -595,7 +595,7 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	ItemId		iid;
 	IndexTuple	idxtuple;
 	XLogRecPtr	recptr;
-	TransactionId txid;
+	FullTransactionId txid;
 
 	/*
 	 * Check that the leaf is still empty and deletable.
@@ -648,14 +648,13 @@ gistdeletepage(IndexVacuumInfo *info, GistBulkDeleteResult *stats,
 	 * currently in progress must have ended.  (That's much more conservative
 	 * than needed, but let's keep it safe and simple.)
 	 */
-	txid = ReadNewTransactionId();
+	txid = ReadNextFullTransactionId();
 
 	START_CRIT_SECTION();
 
 	/* mark the page as deleted */
 	MarkBufferDirty(leafBuffer);
-	GistPageSetDeleteXid(leafPage, txid);
-	GistPageSetDeleted(leafPage);
+	GistPageSetDeleted(leafPage, txid);
 	stats->stats.pages_deleted++;
 
 	/* remove the downlink from the parent */
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 503db34d863..3b28f546465 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -356,8 +356,7 @@ gistRedoPageDelete(XLogReaderState *record)
 	{
 		Page		page = (Page) BufferGetPage(leafBuffer);
 
-		GistPageSetDeleteXid(page, xldata->deleteXid);
-		GistPageSetDeleted(page);
+		GistPageSetDeleted(page, xldata->deleteXid);
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(leafBuffer);
@@ -396,8 +395,27 @@ gistRedoPageReuse(XLogReaderState *record)
 	 */
 	if (InHotStandby)
 	{
-		ResolveRecoveryConflictWithSnapshot(xlrec->latestRemovedXid,
-											xlrec->node);
+		FullTransactionId latestRemovedFullXid = xlrec->latestRemovedFullXid;
+		FullTransactionId nextFullXid = ReadNextFullTransactionId();
+		uint64		diff;
+
+		/*
+		 * ResolveRecoveryConflictWithSnapshot operates on 32-bit
+		 * TransactionIds, so truncate the logged FullTransactionId. If the
+		 * logged value is very old, so that XID wrap-around already happened
+		 * on it, there can't be any snapshots that still see it.
+		 */
+		nextFullXid = ReadNextFullTransactionId();
+		diff = U64FromFullTransactionId(nextFullXid) -
+			U64FromFullTransactionId(latestRemovedFullXid);
+		if (diff < MaxTransactionId / 2)
+		{
+			TransactionId latestRemovedXid;
+
+			latestRemovedXid = XidFromFullTransactionId(latestRemovedFullXid);
+			ResolveRecoveryConflictWithSnapshot(latestRemovedXid,
+												xlrec->node);
+		}
 	}
 }
 
@@ -554,7 +572,7 @@ gistXLogSplit(bool page_is_leaf,
  * downlink from the parent page.
  */
 XLogRecPtr
-gistXLogPageDelete(Buffer buffer, TransactionId xid,
+gistXLogPageDelete(Buffer buffer, FullTransactionId xid,
 				   Buffer parentBuffer, OffsetNumber downlinkOffset)
 {
 	gistxlogPageDelete xlrec;
@@ -578,7 +596,7 @@ gistXLogPageDelete(Buffer buffer, TransactionId xid,
  * Write XLOG record about reuse of a deleted page.
  */
 void
-gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXid)
+gistXLogPageReuse(Relation rel, BlockNumber blkno, FullTransactionId latestRemovedXid)
 {
 	gistxlogPageReuse xlrec_reuse;
 
@@ -591,7 +609,7 @@ gistXLogPageReuse(Relation rel, BlockNumber blkno, TransactionId latestRemovedXi
 	/* XLOG stuff */
 	xlrec_reuse.node = rel->rd_node;
 	xlrec_reuse.block = blkno;
-	xlrec_reuse.latestRemovedXid = latestRemovedXid;
+	xlrec_reuse.latestRemovedFullXid = latestRemovedXid;
 
 	XLogBeginInsert();
 	XLogRegisterData((char *) &xlrec_reuse, SizeOfGistxlogPageReuse);
diff --git a/src/backend/access/rmgrdesc/gistdesc.c b/src/backend/access/rmgrdesc/gistdesc.c
index 767864b58e6..eccb6fd9428 100644
--- a/src/backend/access/rmgrdesc/gistdesc.c
+++ b/src/backend/access/rmgrdesc/gistdesc.c
@@ -26,10 +26,11 @@ out_gistxlogPageUpdate(StringInfo buf, gistxlogPageUpdate *xlrec)
 static void
 out_gistxlogPageReuse(StringInfo buf, gistxlogPageReuse *xlrec)
 {
-	appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u",
+	appendStringInfo(buf, "rel %u/%u/%u; blk %u; latestRemovedXid %u:%u",
 					 xlrec->node.spcNode, xlrec->node.dbNode,
 					 xlrec->node.relNode, xlrec->block,
-					 xlrec->latestRemovedXid);
+					 EpochFromFullTransactionId(xlrec->latestRemovedFullXid),
+					 XidFromFullTransactionId(xlrec->latestRemovedFullXid));
 }
 
 static void
@@ -50,8 +51,10 @@ out_gistxlogPageSplit(StringInfo buf, gistxlogPageSplit *xlrec)
 static void
 out_gistxlogPageDelete(StringInfo buf, gistxlogPageDelete *xlrec)
 {
-	appendStringInfo(buf, "deleteXid %u; downlink %u",
-					 xlrec->deleteXid, xlrec->downlinkOffset);
+	appendStringInfo(buf, "deleteXid %u:%u; downlink %u",
+					 EpochFromFullTransactionId(xlrec->deleteXid),
+					 XidFromFullTransactionId(xlrec->deleteXid),
+					 xlrec->downlinkOffset);
 }
 
 void
diff --git a/src/backend/utils/time/snapmgr.c b/src/backend/utils/time/snapmgr.c
index 6690d781379..35d99857871 100644
--- a/src/backend/utils/time/snapmgr.c
+++ b/src/backend/utils/time/snapmgr.c
@@ -956,6 +956,36 @@ xmin_cmp(const pairingheap_node *a, const pairingheap_node *b, void *arg)
 		return 0;
 }
 
+/*
+ * Get current RecentGlobalXmin value, as a FullTransactionId.
+ */
+FullTransactionId
+GetFullRecentGlobalXmin(void)
+{
+	FullTransactionId nextxid_full;
+	uint32		nextxid_epoch;
+	TransactionId nextxid_xid;
+	uint32		epoch;
+
+	Assert(TransactionIdIsNormal(RecentGlobalXmin));
+
+	/*
+	 * Compute the epoch of the target xid from the next XID's epoch.
+	 * This relies on the fact that RecentGlobalXmin must be within the 2
+	 * billion XID horizon from the next XID.
+	 */
+	nextxid_full = ReadNextFullTransactionId();
+	nextxid_epoch = EpochFromFullTransactionId(nextxid_full);
+	nextxid_xid = XidFromFullTransactionId(nextxid_full);
+
+	if (RecentGlobalXmin > nextxid_xid)
+		epoch = nextxid_epoch - 1;
+	else
+		epoch = nextxid_epoch;
+
+	return FullTransactionIdFromEpochAndXid(epoch, RecentGlobalXmin);
+}
+
 /*
  * SnapshotResetXmin
  *
diff --git a/src/include/access/gist.h b/src/include/access/gist.h
index 6902f4115b7..8292956cc09 100644
--- a/src/include/access/gist.h
+++ b/src/include/access/gist.h
@@ -16,6 +16,7 @@
 #ifndef GIST_H
 #define GIST_H
 
+#include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xlogdefs.h"
 #include "storage/block.h"
@@ -140,8 +141,6 @@ typedef struct GISTENTRY
 #define GIST_LEAF(entry) (GistPageIsLeaf((entry)->page))
 
 #define GistPageIsDeleted(page) ( GistPageGetOpaque(page)->flags & F_DELETED)
-#define GistPageSetDeleted(page)	( GistPageGetOpaque(page)->flags |= F_DELETED)
-#define GistPageSetNonDeleted(page) ( GistPageGetOpaque(page)->flags &= ~F_DELETED)
 
 #define GistTuplesDeleted(page) ( GistPageGetOpaque(page)->flags & F_TUPLES_DELETED)
 #define GistMarkTuplesDeleted(page) ( GistPageGetOpaque(page)->flags |= F_TUPLES_DELETED)
@@ -158,9 +157,45 @@ typedef struct GISTENTRY
 #define GistPageGetNSN(page) ( PageXLogRecPtrGet(GistPageGetOpaque(page)->nsn))
 #define GistPageSetNSN(page, val) ( PageXLogRecPtrSet(GistPageGetOpaque(page)->nsn, val))
 
-/* For deleted pages we store last xid which could see the page in scan */
-#define GistPageGetDeleteXid(page) ( ((PageHeader) (page))->pd_prune_xid )
-#define GistPageSetDeleteXid(page, val) ( ((PageHeader) (page))->pd_prune_xid = val)
+
+/*
+ * On a deleted page, we store this struct. A deleted page doesn't contain any
+ * tuples, so we don't use the normal page layout with line pointers. Instead,
+ * this struct is stored right after the standard page header. pd_lower points
+ * to the end of this struct. If we add fields to this struct in the future, we
+ * can distinguish the old and new formats by pd_lower.
+ */
+typedef struct GISTDeletedPageContents
+{
+	/* last xid which could see the page in a scan */
+	FullTransactionId deleteXid;
+} GISTDeletedPageContents;
+
+static inline void
+GistPageSetDeleted(Page page, FullTransactionId deletexid)
+{
+	Assert(PageIsEmpty(page));
+
+	GistPageGetOpaque(page)->flags |= F_DELETED;
+	((PageHeader) page)->pd_lower = MAXALIGN(SizeOfPageHeaderData) + sizeof(GISTDeletedPageContents);
+
+	((GISTDeletedPageContents *) PageGetContents(page))->deleteXid = deletexid;
+}
+
+static inline FullTransactionId
+GistPageGetDeleteXid(Page page)
+{
+	Assert(GistPageIsDeleted(page));
+
+	/* Is the deleteXid field present? */
+	if (((PageHeader) page)->pd_lower >= MAXALIGN(SizeOfPageHeaderData) +
+		offsetof(GISTDeletedPageContents, deleteXid) + sizeof(FullTransactionId))
+	{
+		return ((GISTDeletedPageContents *) PageGetContents(page))->deleteXid;
+	}
+	else
+		return FullTransactionIdFromEpochAndXid(0, FirstNormalTransactionId);
+}
 
 /*
  * Vector of GISTENTRY structs; user-defined methods union and picksplit
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 9e3958398ef..0488d01c9b8 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -426,11 +426,11 @@ extern SplitedPageLayout *gistSplit(Relation r, Page page, IndexTuple *itup,
 
 /* gistxlog.c */
 extern XLogRecPtr gistXLogPageDelete(Buffer buffer,
-									 TransactionId xid, Buffer parentBuffer,
+									 FullTransactionId xid, Buffer parentBuffer,
 									 OffsetNumber downlinkOffset);
 
 extern void gistXLogPageReuse(Relation rel, BlockNumber blkno,
-							  TransactionId latestRemovedXid);
+							  FullTransactionId latestRemovedXid);
 
 extern XLogRecPtr gistXLogUpdate(Buffer buffer,
 								 OffsetNumber *todelete, int ntodelete,
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index 969a5376b5e..e44922d915c 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -83,7 +83,7 @@ typedef struct gistxlogPageSplit
  */
 typedef struct gistxlogPageDelete
 {
-	TransactionId deleteXid;	/* last Xid which could see page in scan */
+	FullTransactionId deleteXid;	/* last Xid which could see page in scan */
 	OffsetNumber downlinkOffset;	/* Offset of downlink referencing this
 									 * page */
 } gistxlogPageDelete;
@@ -98,10 +98,10 @@ typedef struct gistxlogPageReuse
 {
 	RelFileNode node;
 	BlockNumber block;
-	TransactionId latestRemovedXid;
+	FullTransactionId latestRemovedFullXid;
 } gistxlogPageReuse;
 
-#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, latestRemovedXid) + sizeof(TransactionId))
+#define SizeOfGistxlogPageReuse	(offsetof(gistxlogPageReuse, latestRemovedFullXid) + sizeof(FullTransactionId))
 
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
diff --git a/src/include/utils/snapmgr.h b/src/include/utils/snapmgr.h
index 58ae3b0c7a1..6641ee510a1 100644
--- a/src/include/utils/snapmgr.h
+++ b/src/include/utils/snapmgr.h
@@ -13,6 +13,7 @@
 #ifndef SNAPMGR_H
 #define SNAPMGR_H
 
+#include "access/transam.h"
 #include "fmgr.h"
 #include "utils/relcache.h"
 #include "utils/resowner.h"
@@ -122,6 +123,8 @@ extern void UnregisterSnapshot(Snapshot snapshot);
 extern Snapshot RegisterSnapshotOnOwner(Snapshot snapshot, ResourceOwner owner);
 extern void UnregisterSnapshotFromOwner(Snapshot snapshot, ResourceOwner owner);
 
+extern FullTransactionId GetFullRecentGlobalXmin(void);
+
 extern void AtSubCommit_Snapshot(int level);
 extern void AtSubAbort_Snapshot(int level);
 extern void AtEOXact_Snapshot(bool isCommit, bool resetXmin);
-- 
2.20.1

gist-vacuum-wraparound-test.shapplication/x-shellscript; name=gist-vacuum-wraparound-test.shDownload
xidtest.tar.gzapplication/gzip; name=xidtest.tar.gzDownload
#79Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#78)
Re: GiST VACUUM

On 22/07/2019 16:09, Heikki Linnakangas wrote:

Unless something comes up, I'll commit this tomorrow.

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?

- Heikki

In reply to: Heikki Linnakangas (#79)
Re: GiST VACUUM

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it. :-)

Any reason to think that it's much different to what you've done with GiST?

--
Peter Geoghegan

#81Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Peter Geoghegan (#80)
Re: GiST VACUUM

On 24/07/2019 21:02, Peter Geoghegan wrote:

On Wed, Jul 24, 2019 at 10:30 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Pushed this now, to master and REL_12_STABLE.

Now, B-tree indexes still have the same problem, in all versions. Any
volunteers to write a similar fix for B-trees?

I was hoping that you'd work on it. :-)

That's probably how it's going to go, but hey, doesn't hurt to ask :-).

Any reason to think that it's much different to what you've done with GiST?

No, it should be very similar.

- Heikki

In reply to: Heikki Linnakangas (#81)
Re: GiST VACUUM

On Wed, Jul 24, 2019 at 11:33 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

That's probably how it's going to go, but hey, doesn't hurt to ask :-).

I think that it would be fine to be conservative with nbtree, and only
target the master branch. The problem is annoying, certainly, but it's
not likely to make a huge difference for most real world workloads.
OTOH, perhaps the risk is so low that we might as well target
backbranches.

How do you feel about it?

--
Peter Geoghegan