GIN non-intrusive vacuum of posting tree

Started by Andrew Borodinabout 9 years ago4 messages
#1Andrew Borodin
borodin@octonica.com
1 attachment(s)

Hi hackers!

Here is a patch with more concurrency-friendly vacuum of GIN.

===Short problem statement===
Currently GIN vacuum during cleaning posting tree can lock this tree
for a long time without real need.

===Problem statement===
Vacuum of posting tree (function ginVacuumPostingTree() in ginvacuum.c
) is doing two passes through posting tree:

1. ginVacuumPostingTreeLeaves() takes LockBufferForCleanup,
effectively excluding all inserts. Then it traverses down trough tree
taking exclusive lock, effectively excluding all reads. On leaf level
it calls ginVacuumPostingTreeLeaf(), which deletes all dead tuples. If
there are any empty page, root lock is not relaesed, it passes to
stage two.

Interim: vacuum_delay_point(), which can hand for a while, holding
CleanupLock on root.

2. If there are any empty pages, ginScanToDelete() scans through tree,
deleting empty lead pages, then deleting empty inner pages, if they
emerged after leaf page deletion.

===Proposed algorithm===
ginVacuumPostingTreeLeaves() takes SHARED locks on inner pages and
EXCLUSIVE locks on leaf pages. On leaf pages it calls
ginVacuumPostingTreeLeaf().

If ginVacuumPostingTreeLeaves() encounters subtree with all leafs
empty, then it takes LockBufferForCleanup() on page P pointing to
empty subtree. After taking lock on P, ginScanToDelete() is called for
P. For every parent of P we can guarantee that this procedure will not
be necessary: if P was empty subtree itself, we would pass this
procedure to parent (unless P is root, than behavior is effectively
equal to before-patched).

===Testing===
This patch solved a problem encountered by Evgeniy Efimkin and
Vladimir Borodin from Yandex.Mail.
They implemented testbed with GIN index

CREATE TABLE BOX (uid bigint,
lids integer[] NOT NULL
CHECK (array_ndims(lids) = 1));

CREATE OR REPLACE FUNCTION ulids(
i_uid bigint,
i_lids integer[]
) RETURNS bigint[] AS $$
SELECT array_agg((i_uid << 32) | lid)
FROM unnest(i_lids) lid;
$$ LANGUAGE SQL IMMUTABLE STRICT;

CREATE INDEX i_box_uid_lids
ON box USING gin (ulids(uid, lids)) WITH (FASTUPDATE=OFF);

Executing by a pgbench following query

\setrandom uid 1 1500000
\setrandom lid 1 1000
\setrandom lid2 1 1000
\setrandom lid3 1 1000
BEGIN;
insert into box values(:uid,'{:lid,:lid2,:lid3}');
insert into box values(:uid,'{}');
END;

and eventually deleting some of data. This testbed showed VACUUM
holding inserts for up to tenths of seconds. They claim that proposed
patch made vacuum invisible in this test.

Evgeniy, Vladimir, if I missed something or you have something to add,
please join discussion.

===Known issues===
1. Probably, there exists better algorithms. I could not adapt B-tree
vacuum wizardy to GIN, but that does not mean it is impossible.

2. There may be more CleanUp locks taken. Under write-dense scenarios,
this can lead to longer vacuum, but it should not consume more
resources, just wait.

3. I have changed locks sequence during page deleteion. I think that
it is safe, but comments there were in fear of inserts (excluded by
CleanupLock). More details in a code of ginDeletePage().

4. ginVacuumPostingTreeLeaves() traverses children pages after release
of parent lock (event SHARED). This is safe if there is only one
vacuum at a time. Is there a reason to be afraid of concurrent
vacuums?

I will be happy if someone points me were is a best place to read
about B-tree vacuum process. Or if someone will explain me how it
works in a few words.
Dev process is here
https://github.com/x4m/postgres_g/pull/2

Thank you for reading, I'm looking forward to hear your thought on the matter.

Best regards, Andrey Borodin.

Attachments:

gin_nonintrusive_vacuum_v0.difftext/plain; charset=US-ASCII; name=gin_nonintrusive_vacuum_v0.diffDownload
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index a0afec4..dc28c81 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack *stack,
 /*
  * Lock buffer by needed method for search.
  */
-static int
+int
 ginTraverseLock(Buffer buffer, bool searchMode)
 {
 	Page		page;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 2685a1c..bc54284 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer)
 	PageSetLSN(page, recptr);
 }
 
-static bool
-ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, Buffer *rootBuffer)
-{
-	Buffer		buffer;
-	Page		page;
-	bool		hasVoidPage = FALSE;
-	MemoryContext oldCxt;
-
-	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-								RBM_NORMAL, gvs->strategy);
-	page = BufferGetPage(buffer);
-
-	/*
-	 * We should be sure that we don't concurrent with inserts, insert process
-	 * never release root page until end (but it can unlock it and lock
-	 * again). New scan can't start but previously started ones work
-	 * concurrently.
-	 */
-	if (isRoot)
-		LockBufferForCleanup(buffer);
-	else
-		LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	Assert(GinPageIsData(page));
-
-	if (GinPageIsLeaf(page))
-	{
-		oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
-		ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
-		MemoryContextSwitchTo(oldCxt);
-		MemoryContextReset(gvs->tmpCxt);
-
-		/* if root is a leaf page, we don't desire further processing */
-		if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page))
-			hasVoidPage = TRUE;
-	}
-	else
-	{
-		OffsetNumber i;
-		bool		isChildHasVoid = FALSE;
-
-		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
-		{
-			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-			if (ginVacuumPostingTreeLeaves(gvs, PostingItemGetBlockNumber(pitem), FALSE, NULL))
-				isChildHasVoid = TRUE;
-		}
-
-		if (isChildHasVoid)
-			hasVoidPage = TRUE;
-	}
+typedef struct DataPageDeleteStack
+{
+	struct DataPageDeleteStack *child;
+	struct DataPageDeleteStack *parent;
 
-	/*
-	 * if we have root and there are empty pages in tree, then we don't
-	 * release lock to go further processing and guarantee that tree is unused
-	 */
-	if (!(isRoot && hasVoidPage))
-	{
-		UnlockReleaseBuffer(buffer);
-	}
-	else
-	{
-		Assert(rootBuffer);
-		*rootBuffer = buffer;
-	}
+	BlockNumber blkno;			/* current block number */
+	BlockNumber leftBlkno;		/* rightest non-deleted page on left */
+	bool		isRoot;
+} DataPageDeleteStack;
 
-	return hasVoidPage;
-}
 
 /*
  * Delete a posting tree page.
@@ -193,8 +135,16 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	BlockNumber rightlink;
 
 	/*
-	 * Lock the pages in the same order as an insertion would, to avoid
+	 * OBSOLETE COMMENT: Lock the pages in the same order as an insertion would, to avoid
 	 * deadlocks: left, then right, then parent.
+	 *
+	 * AB: I will delete this comment and all following single line comments.
+	 * they are here to highlight changes in locking
+	 *
+	 * This function MUST be called only if someone of parent pages hold
+	 * exclusive cleanup lock. This guarantees that no insertions currently
+	 * happen in this subtree. We still acquire Exclusive lock to exclude
+	 * reads. Parent and this page is already locked.
 	 */
 	lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
 								 RBM_NORMAL, gvs->strategy);
@@ -204,10 +154,10 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 								 RBM_NORMAL, gvs->strategy);
 
 	LockBuffer(lBuffer, GIN_EXCLUSIVE);
-	LockBuffer(dBuffer, GIN_EXCLUSIVE);
-	if (!isParentRoot)			/* parent is already locked by
-								 * LockBufferForCleanup() */
-		LockBuffer(pBuffer, GIN_EXCLUSIVE);
+	//LockBuffer(dBuffer, GIN_EXCLUSIVE);
+	//if (!isParentRoot)			/* parent is already locked by
+	//							 * LockBufferForCleanup() */
+	//	LockBuffer(pBuffer, GIN_EXCLUSIVE);
 
 	START_CRIT_SECTION();
 
@@ -271,26 +221,18 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 		PageSetLSN(BufferGetPage(lBuffer), recptr);
 	}
 
-	if (!isParentRoot)
-		LockBuffer(pBuffer, GIN_UNLOCK);
+	//if (!isParentRoot)
+	//	LockBuffer(pBuffer, GIN_UNLOCK);
+	// These comments will be deleted, explanation is upper
 	ReleaseBuffer(pBuffer);
 	UnlockReleaseBuffer(lBuffer);
-	UnlockReleaseBuffer(dBuffer);
+	ReleaseBuffer(dBuffer);//UnlockReleaseBuffer(dBuffer);
 
 	END_CRIT_SECTION();
 
 	gvs->result->pages_deleted++;
 }
 
-typedef struct DataPageDeleteStack
-{
-	struct DataPageDeleteStack *child;
-	struct DataPageDeleteStack *parent;
-
-	BlockNumber blkno;			/* current block number */
-	BlockNumber leftBlkno;		/* rightest non-deleted page on left */
-	bool		isRoot;
-} DataPageDeleteStack;
 
 /*
  * scans posting tree and deletes empty pages
@@ -324,6 +266,10 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 
 	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
 								RBM_NORMAL, gvs->strategy);
+
+	if(!isRoot)
+		LockBuffer(buffer, GIN_EXCLUSIVE);
+
 	page = BufferGetPage(buffer);
 
 	Assert(GinPageIsData(page));
@@ -358,6 +304,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 		}
 	}
 
+	if(!isRoot)
+			LockBuffer(buffer, GIN_UNLOCK);
+
 	ReleaseBuffer(buffer);
 
 	if (!meDelete)
@@ -366,37 +315,124 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 	return meDelete;
 }
 
-static void
-ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+
+/*
+ * Scan through posting tree, delete empty tuples from leaf pages.
+ * Also, this function collects empty subtrees (with all empty leafs).
+ * For parents of these subtrees CleanUp lock is taken, then we call
+ * ScanToDelete. This is done for every inner page, which points to
+ * empty subtree.
+ */
+static bool
+ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot)
 {
-	Buffer		rootBuffer = InvalidBuffer;
-	DataPageDeleteStack root,
-			   *ptr,
-			   *tmp;
+	Buffer		buffer;
+	Page		page;
+	bool		hasVoidPage = FALSE;
+	MemoryContext oldCxt;
 
-	if (ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE, &rootBuffer) == FALSE)
+	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
+								RBM_NORMAL, gvs->strategy);
+	page = BufferGetPage(buffer);
+
+	ginTraverseLock(buffer,false);
+
+	Assert(GinPageIsData(page));
+
+	if (GinPageIsLeaf(page))
 	{
-		Assert(rootBuffer == InvalidBuffer);
-		return;
+		oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
+		ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
+		MemoryContextSwitchTo(oldCxt);
+		MemoryContextReset(gvs->tmpCxt);
+
+		/* if root is a leaf page, we don't desire further processing */
+		if (GinDataLeafPageIsEmpty(page))
+			hasVoidPage = TRUE;
+
+		UnlockReleaseBuffer(buffer);
+
+		return hasVoidPage;
 	}
+	else
+	{
+		OffsetNumber 	i;
+		bool			isChildHasVoid = FALSE;
+		bool			isAnyNonempy = FALSE;
+		OffsetNumber	maxoff = GinPageGetOpaque(page)->maxoff;
+		BlockNumber*	children = palloc(sizeof(BlockNumber) * (maxoff + 1));
 
-	memset(&root, 0, sizeof(DataPageDeleteStack));
-	root.leftBlkno = InvalidBlockNumber;
-	root.isRoot = TRUE;
+		/*
+		 * Read all children BlockNumbers.
+		 * Not sure it is safe if there are many concurrent vacuums.
+		 */
 
-	vacuum_delay_point();
+		for (i = FirstOffsetNumber; i <= maxoff; i++)
+		{
+			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-	ginScanToDelete(gvs, rootBlkno, TRUE, &root, InvalidOffsetNumber);
+			children[i] = PostingItemGetBlockNumber(pitem);
+		}
 
-	ptr = root.child;
-	while (ptr)
-	{
-		tmp = ptr->child;
-		pfree(ptr);
-		ptr = tmp;
+		UnlockReleaseBuffer(buffer);
+
+		for (i = FirstOffsetNumber; i <= maxoff; i++)
+		{
+			if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE))
+				isChildHasVoid = TRUE;
+			else
+				isAnyNonempy = TRUE;
+		}
+
+		pfree(children);
+
+		vacuum_delay_point();
+
+		/*
+		 * All subtree is empty - just return TRUE to indicate that parent must
+		 * do a cleanup. Unless we are ROOT an there is way to go upper.
+		 */
+
+		if(isChildHasVoid && !isAnyNonempy && !isRoot)
+			return TRUE;
+
+		if(isChildHasVoid)
+		{
+			DataPageDeleteStack root,
+					   *ptr,
+					   *tmp;
+
+			buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
+											RBM_NORMAL, gvs->strategy);
+			LockBufferForCleanup(buffer);
+
+			memset(&root, 0, sizeof(DataPageDeleteStack));
+				root.leftBlkno = InvalidBlockNumber;
+				root.isRoot = TRUE;
+
+			ginScanToDelete(gvs, blkno, TRUE, &root, InvalidOffsetNumber);
+
+			ptr = root.child;
+
+			while (ptr)
+			{
+				tmp = ptr->child;
+				pfree(ptr);
+				ptr = tmp;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Here we have deleted all empty subtrees */
+		return FALSE;
 	}
+}
 
-	UnlockReleaseBuffer(rootBuffer);
+static void
+ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+{
+	ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE);
 }
 
 /*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index bf589ab..b738f47 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -981,4 +981,6 @@ ginCompareItemPointers(ItemPointer a, ItemPointer b)
 		return -1;
 }
 
+extern int ginTraverseLock(Buffer buffer, bool searchMode);
+
 #endif   /* GIN_PRIVATE_H */
#2Vladimir Borodin
root@simply.name
In reply to: Andrew Borodin (#1)
Re: GIN non-intrusive vacuum of posting tree

28 нояб. 2016 г., в 20:31, Andrew Borodin <borodin@octonica.com> написал(а):

This patch solved a problem encountered by Evgeniy Efimkin and
Vladimir Borodin from Yandex.Mail.

and eventually deleting some of data. This testbed showed VACUUM
holding inserts for up to tenths of seconds. They claim that proposed
patch made vacuum invisible in this test.

Evgeniy, Vladimir, if I missed something or you have something to add,
please join discussion.

Yep, in our load environment the table is not so big (~ 50 GB), GIN index size is ~ 1 GB. And under our load profile we have seen 90 seconds of impossibility to do an insert into the table because of vacuuming this index. I confirm that with this patch we now don’t see any spikes of errors as it was previously.

--
May the force be with you…
https://simply.name

#3Andrew Borodin
borodin@octonica.com
In reply to: Vladimir Borodin (#2)
1 attachment(s)
Re: GIN non-intrusive vacuum of posting tree

Here is v1 of the patch. Now it has changes for README and contains
more comments clarifying changes of locking model.

Also I will elaborate some more on what is patched. Main portion of
changes is made to function ginVacuumPostingTreeLeaves(). Before the
patch it was traversing tree in depth-first fashion, acquiring
exclusive lock on each page and removing dead tuples from leafs. Also
this function used to hold cleanup lock. Now this function is doing
same DFS, but without cleanup lock, acquiring only read locks on inner
pages and exclusive lock on leafs before eliminating dead tuples. If
this function finds empty leafs, it computes minimal subtree,
containing only empty pages and start scan for empty pages from parent
page pointing to found subtree.

This scan acquires cleanup lock on root of scan (not necessarily root
of posting tree). Cleanup lock ensures no inserts are inside subtree.
Scan traverses subtree DF taking exclusive locks from left to right.
For any page being deleted all leftmost pages were locked and unlocked
before. New reads cannot enter subtree, all old readscans were
excluded by lock\unlock. Thus there shall not be deadlocks with
ginStepRight().

We get lock on page being deleted, then on a left page.
ginStepRight() takes lock on left page, than on right page. But it’s
presence is excluded by cleanup lock and DFS scan with locks of upper
and left parts of tree.

Thank you for reading this. Concurrency bothers me a lot. If you see
that anything is wrong or suspicious, please do not hesitate to post
your thoughts.

Best regards, Andrey Borodin.

Attachments:

gin_nonintrusive_vacuum_v1.difftext/plain; charset=US-ASCII; name=gin_nonintrusive_vacuum_v1.diffDownload
diff --git a/src/backend/access/gin/README b/src/backend/access/gin/README
index fade0cb..29dafce 100644
--- a/src/backend/access/gin/README
+++ b/src/backend/access/gin/README
@@ -314,10 +314,16 @@ deleted.
 The previous paragraph's reasoning only applies to searches, and only to
 posting trees. To protect from inserters following a downlink to a deleted
 page, vacuum simply locks out all concurrent insertions to the posting tree,
-by holding a super-exclusive lock on the posting tree root. Inserters hold a
-pin on the root page, but searches do not, so while new searches cannot begin
-while root page is locked, any already-in-progress scans can continue
-concurrently with vacuum. In the entry tree, we never delete pages.
+by holding a super-exclusive lock on the parent page of subtree with deletable
+pages. Inserters hold a pin on the root page, but searches do not, so while
+new searches cannot begin while root page is locked, any already-in-progress
+scans can continue concurrently with vacuum in corresponding subtree of
+posting tree. To exclude interference with readers vacuum takes exclusive
+locks in a depth-first scan in leaf-to-right order of page tuples. Leftmost
+page is never deleted. Thus before deleting any page we obtain exclusive
+lock on any left page, effectively excluding deadlock with any reader, despite
+takinf parent lock first and not holding left lock at all.
+In the entry tree, we never delete pages.
 
 (This is quite different from the mechanism the btree indexam uses to make
 page-deletions safe; it stamps the deleted pages with an XID and keeps the
diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index a0afec4..dc28c81 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -30,7 +30,7 @@ static void ginFinishSplit(GinBtree btree, GinBtreeStack *stack,
 /*
  * Lock buffer by needed method for search.
  */
-static int
+int
 ginTraverseLock(Buffer buffer, bool searchMode)
 {
 	Page		page;
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index 2685a1c..9bd2f50 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -108,75 +108,17 @@ xlogVacuumPage(Relation index, Buffer buffer)
 	PageSetLSN(page, recptr);
 }
 
-static bool
-ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, Buffer *rootBuffer)
-{
-	Buffer		buffer;
-	Page		page;
-	bool		hasVoidPage = FALSE;
-	MemoryContext oldCxt;
-
-	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
-								RBM_NORMAL, gvs->strategy);
-	page = BufferGetPage(buffer);
-
-	/*
-	 * We should be sure that we don't concurrent with inserts, insert process
-	 * never release root page until end (but it can unlock it and lock
-	 * again). New scan can't start but previously started ones work
-	 * concurrently.
-	 */
-	if (isRoot)
-		LockBufferForCleanup(buffer);
-	else
-		LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	Assert(GinPageIsData(page));
-
-	if (GinPageIsLeaf(page))
-	{
-		oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
-		ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
-		MemoryContextSwitchTo(oldCxt);
-		MemoryContextReset(gvs->tmpCxt);
-
-		/* if root is a leaf page, we don't desire further processing */
-		if (!isRoot && !hasVoidPage && GinDataLeafPageIsEmpty(page))
-			hasVoidPage = TRUE;
-	}
-	else
-	{
-		OffsetNumber i;
-		bool		isChildHasVoid = FALSE;
-
-		for (i = FirstOffsetNumber; i <= GinPageGetOpaque(page)->maxoff; i++)
-		{
-			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-			if (ginVacuumPostingTreeLeaves(gvs, PostingItemGetBlockNumber(pitem), FALSE, NULL))
-				isChildHasVoid = TRUE;
-		}
-
-		if (isChildHasVoid)
-			hasVoidPage = TRUE;
-	}
+typedef struct DataPageDeleteStack
+{
+	struct DataPageDeleteStack *child;
+	struct DataPageDeleteStack *parent;
 
-	/*
-	 * if we have root and there are empty pages in tree, then we don't
-	 * release lock to go further processing and guarantee that tree is unused
-	 */
-	if (!(isRoot && hasVoidPage))
-	{
-		UnlockReleaseBuffer(buffer);
-	}
-	else
-	{
-		Assert(rootBuffer);
-		*rootBuffer = buffer;
-	}
+	BlockNumber blkno;			/* current block number */
+	BlockNumber leftBlkno;		/* rightest non-deleted page on left */
+	bool		isRoot;
+} DataPageDeleteStack;
 
-	return hasVoidPage;
-}
 
 /*
  * Delete a posting tree page.
@@ -193,8 +135,17 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	BlockNumber rightlink;
 
 	/*
-	 * Lock the pages in the same order as an insertion would, to avoid
+	 * OBSOLETE COMMENT: Lock the pages in the same order as an insertion would, to avoid
 	 * deadlocks: left, then right, then parent.
+	 *
+	 * AB: I will delete this comment and all following single line comments.
+	 * they are here to highlight changes in locking
+	 *
+	 * This function MUST be called only if someone of parent pages hold
+	 * exclusive cleanup lock. This guarantees that no insertions currently
+	 * happen in this subtree. Caller also acquire Exclusive lock on deletable
+	 * page and is acquiring and releasing exclusive lock on left page before.
+	 * Left page was locked and released. Then parent and this page are locked.
 	 */
 	lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
 								 RBM_NORMAL, gvs->strategy);
@@ -203,11 +154,11 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 	pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno,
 								 RBM_NORMAL, gvs->strategy);
 
-	LockBuffer(lBuffer, GIN_EXCLUSIVE);
-	LockBuffer(dBuffer, GIN_EXCLUSIVE);
-	if (!isParentRoot)			/* parent is already locked by
-								 * LockBufferForCleanup() */
-		LockBuffer(pBuffer, GIN_EXCLUSIVE);
+	//LockBuffer(lBuffer, GIN_EXCLUSIVE);
+	//LockBuffer(dBuffer, GIN_EXCLUSIVE);
+	//if (!isParentRoot)			/* parent is already locked by
+	//							 * LockBufferForCleanup() */
+	//	LockBuffer(pBuffer, GIN_EXCLUSIVE);
 
 	START_CRIT_SECTION();
 
@@ -271,26 +222,18 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 		PageSetLSN(BufferGetPage(lBuffer), recptr);
 	}
 
-	if (!isParentRoot)
-		LockBuffer(pBuffer, GIN_UNLOCK);
+	//if (!isParentRoot)
+	//	LockBuffer(pBuffer, GIN_UNLOCK);
+	// These comments will be deleted, explanation is upper
 	ReleaseBuffer(pBuffer);
-	UnlockReleaseBuffer(lBuffer);
-	UnlockReleaseBuffer(dBuffer);
+	ReleaseBuffer(lBuffer);//UnlockReleaseBuffer(lBuffer);
+	ReleaseBuffer(dBuffer);//UnlockReleaseBuffer(dBuffer);
 
 	END_CRIT_SECTION();
 
 	gvs->result->pages_deleted++;
 }
 
-typedef struct DataPageDeleteStack
-{
-	struct DataPageDeleteStack *child;
-	struct DataPageDeleteStack *parent;
-
-	BlockNumber blkno;			/* current block number */
-	BlockNumber leftBlkno;		/* rightest non-deleted page on left */
-	bool		isRoot;
-} DataPageDeleteStack;
 
 /*
  * scans posting tree and deletes empty pages
@@ -324,6 +267,10 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 
 	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
 								RBM_NORMAL, gvs->strategy);
+
+	if(!isRoot)
+		LockBuffer(buffer, GIN_EXCLUSIVE);
+
 	page = BufferGetPage(buffer);
 
 	Assert(GinPageIsData(page));
@@ -358,6 +305,9 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 		}
 	}
 
+	if(!isRoot)
+			LockBuffer(buffer, GIN_UNLOCK);
+
 	ReleaseBuffer(buffer);
 
 	if (!meDelete)
@@ -366,37 +316,124 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
 	return meDelete;
 }
 
-static void
-ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+
+/*
+ * Scan through posting tree, delete empty tuples from leaf pages.
+ * Also, this function collects empty subtrees (with all empty leafs).
+ * For parents of these subtrees CleanUp lock is taken, then we call
+ * ScanToDelete. This is done for every inner page, which points to
+ * empty subtree.
+ */
+static bool
+ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot)
 {
-	Buffer		rootBuffer = InvalidBuffer;
-	DataPageDeleteStack root,
-			   *ptr,
-			   *tmp;
+	Buffer		buffer;
+	Page		page;
+	bool		hasVoidPage = FALSE;
+	MemoryContext oldCxt;
 
-	if (ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE, &rootBuffer) == FALSE)
+	buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
+								RBM_NORMAL, gvs->strategy);
+	page = BufferGetPage(buffer);
+
+	ginTraverseLock(buffer,false);
+
+	Assert(GinPageIsData(page));
+
+	if (GinPageIsLeaf(page))
 	{
-		Assert(rootBuffer == InvalidBuffer);
-		return;
+		oldCxt = MemoryContextSwitchTo(gvs->tmpCxt);
+		ginVacuumPostingTreeLeaf(gvs->index, buffer, gvs);
+		MemoryContextSwitchTo(oldCxt);
+		MemoryContextReset(gvs->tmpCxt);
+
+		/* if root is a leaf page, we don't desire further processing */
+		if (GinDataLeafPageIsEmpty(page))
+			hasVoidPage = TRUE;
+
+		UnlockReleaseBuffer(buffer);
+
+		return hasVoidPage;
 	}
+	else
+	{
+		OffsetNumber 	i;
+		bool			isChildHasVoid = FALSE;
+		bool			isAnyNonempy = FALSE;
+		OffsetNumber	maxoff = GinPageGetOpaque(page)->maxoff;
+		BlockNumber*	children = palloc(sizeof(BlockNumber) * (maxoff + 1));
 
-	memset(&root, 0, sizeof(DataPageDeleteStack));
-	root.leftBlkno = InvalidBlockNumber;
-	root.isRoot = TRUE;
+		/*
+		 * Read all children BlockNumbers.
+		 * Not sure it is safe if there are many concurrent vacuums.
+		 */
 
-	vacuum_delay_point();
+		for (i = FirstOffsetNumber; i <= maxoff; i++)
+		{
+			PostingItem *pitem = GinDataPageGetPostingItem(page, i);
 
-	ginScanToDelete(gvs, rootBlkno, TRUE, &root, InvalidOffsetNumber);
+			children[i] = PostingItemGetBlockNumber(pitem);
+		}
 
-	ptr = root.child;
-	while (ptr)
-	{
-		tmp = ptr->child;
-		pfree(ptr);
-		ptr = tmp;
+		UnlockReleaseBuffer(buffer);
+
+		for (i = FirstOffsetNumber; i <= maxoff; i++)
+		{
+			if (ginVacuumPostingTreeLeaves(gvs, children[i], FALSE))
+				isChildHasVoid = TRUE;
+			else
+				isAnyNonempy = TRUE;
+		}
+
+		pfree(children);
+
+		vacuum_delay_point();
+
+		/*
+		 * All subtree is empty - just return TRUE to indicate that parent must
+		 * do a cleanup. Unless we are ROOT an there is way to go upper.
+		 */
+
+		if(isChildHasVoid && !isAnyNonempy && !isRoot)
+			return TRUE;
+
+		if(isChildHasVoid)
+		{
+			DataPageDeleteStack root,
+					   *ptr,
+					   *tmp;
+
+			buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, blkno,
+											RBM_NORMAL, gvs->strategy);
+			LockBufferForCleanup(buffer);
+
+			memset(&root, 0, sizeof(DataPageDeleteStack));
+				root.leftBlkno = InvalidBlockNumber;
+				root.isRoot = TRUE;
+
+			ginScanToDelete(gvs, blkno, TRUE, &root, InvalidOffsetNumber);
+
+			ptr = root.child;
+
+			while (ptr)
+			{
+				tmp = ptr->child;
+				pfree(ptr);
+				ptr = tmp;
+			}
+
+			UnlockReleaseBuffer(buffer);
+		}
+
+		/* Here we have deleted all empty subtrees */
+		return FALSE;
 	}
+}
 
-	UnlockReleaseBuffer(rootBuffer);
+static void
+ginVacuumPostingTree(GinVacuumState *gvs, BlockNumber rootBlkno)
+{
+	ginVacuumPostingTreeLeaves(gvs, rootBlkno, TRUE);
 }
 
 /*
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index bf589ab..b738f47 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -981,4 +981,6 @@ ginCompareItemPointers(ItemPointer a, ItemPointer b)
 		return -1;
 }
 
+extern int ginTraverseLock(Buffer buffer, bool searchMode);
+
 #endif   /* GIN_PRIVATE_H */
#4Robert Haas
robertmhaas@gmail.com
In reply to: Andrew Borodin (#3)
Re: GIN non-intrusive vacuum of posting tree

On Wed, Nov 30, 2016 at 11:38 AM, Andrew Borodin <borodin@octonica.com> wrote:

This scan acquires cleanup lock on root of scan (not necessarily root
of posting tree). Cleanup lock ensures no inserts are inside subtree.
Scan traverses subtree DF taking exclusive locks from left to right.
For any page being deleted all leftmost pages were locked and unlocked
before. New reads cannot enter subtree, all old readscans were
excluded by lock\unlock. Thus there shall not be deadlocks with
ginStepRight().

We get lock on page being deleted, then on a left page.
ginStepRight() takes lock on left page, than on right page. But it’s
presence is excluded by cleanup lock and DFS scan with locks of upper
and left parts of tree.

Thank you for reading this. Concurrency bothers me a lot. If you see
that anything is wrong or suspicious, please do not hesitate to post
your thoughts.

I don't know much about GIN, but this seems like an interesting
improvement. I hope somebody who knows more about GIN will step up to
review it in depth.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers