B-tree parent pointer and checkpoints

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Heikki Linnakangas (#1)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

regards, tom lane

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Tom Lane (#3)

Re: B-tree parent pointer and checkpoints

On 02.11.2010 16:30, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Heikki Linnakangas (#4)

1 attachment(s)

Re: B-tree parent pointer and checkpoints

On 02.11.2010 16:40, Heikki Linnakangas wrote:

On 02.11.2010 16:30, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I think we can fix this by requiring that any multi-WAL-record actions
that are in-progress when a checkpoint starts (at the REDO-pointer) must
finish before the checkpoint record is written.

What happens if someone wants to start a new split while the checkpoint
is hanging fire?

You mean after CreateCheckPoint has determined the redo pointer, but
before it has written the checkpoint record? The new split can go ahead,
and the checkpoint doesn't need care about it. Recovery will start at
the redo pointer, so it will see the split record, and will know to
finish the incomplete split if necessary.

The logic is the same as with inCommit. Checkpoint will fetch the list
of in-progress splits some time after determining the redo-pointer. It
will then wait until all of those splits have finished. Any new splits
that begin after fetching the list don't affect the checkpoint.

inCommit can't be used as is, because it's tied to the Xid, but
something similar should work.

Here's a first draft of this, using the inCommit flag as is. It works,
but suffers from starvation if you have a lot of concurrent
multi-WAL-record actions. I tested that by running INSERTs to a table
with tsvector field with a GiST index on it from five concurrent
sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a
boolean flag. I'm thinking of adding a counter beside the inCommit flag
that's incremented every time a new multi-WAL-record action begins, so
that the checkpoint process can distinguish between a new action that
was started after deciding the REDO pointer and an old one that's still
running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

split-delay-checkpoint-1.patchtext/x-diff; name=split-delay-checkpoint-1.patchDownload

diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c
index 070cd92..ec95f89 100644
--- a/src/backend/access/gin/ginbtree.c
+++ b/src/backend/access/gin/ginbtree.c
@@ -17,6 +17,7 @@
 #include "access/gin.h"
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
+#include "storage/proc.h"
 #include "utils/rel.h"
 
 /*
@@ -281,6 +282,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 	Page		page,
 				rpage,
 				lpage;
+	bool		splitInProgress = false;
 
 	/* remember root BlockNumber */
 	while (parent)
@@ -318,7 +320,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 
 			freeGinBtreeStack(stack);
 
-			return;
+			break; /* don't need to recurse to parent */
 		}
 		else
 		{
@@ -326,6 +328,20 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 			Page		newlpage;
 
 			/*
+			 * Mark ourselves as within a "checkpoint critical section". This
+			 * forces any concurrent checkpoint to wait until we've inserted the
+			 * parent pointer too. Without this, it is possible for the checkpoint to
+			 * set REDO after the split WAL record and quickly finish the checkpoint
+			 * before the insertion of the parent pointer has been WAL-logged. Replay
+			 * from such a checkpoint would not know that the split is incomplete.
+			 */
+			if (!!btree->index->rd_istemp)
+			{
+				splitInProgress = true;
+				MyProc->inCommit = true;
+			}
+
+			/*
 			 * newlpage is a pointer to memory page, it doesn't associate with
 			 * buffer, stack->buffer should be untouched
 			 */
@@ -402,7 +418,7 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 						buildStats->nEntryPages++;
 				}
 
-				return;
+				break; /* don't need to recurse to parent */
 			}
 			else
 			{
@@ -472,4 +488,11 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
 		pfree(stack);
 		stack = parent;
 	}
+
+	/*
+	 * If we had to split a page, let checkpointer know that we're finished
+	 * now.
+	 */
+	if (splitInProgress)
+		MyProc->inCommit = false;
 }
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 9fd6167..a774a67 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -866,11 +866,3 @@ gin_xlog_cleanup(void)
 	MemoryContextDelete(opCtx);
 	incomplete_splits = NIL;
 }
-
-bool
-gin_safe_restartpoint(void)
-{
-	if (incomplete_splits)
-		return false;
-	return true;
-}
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 3054f98..4d4d964 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/indexfsm.h"
+#include "storage/proc.h"
 #include "utils/memutils.h"
 
 const XLogRecPtr XLogRecPtrForTemp = {1, 1};
@@ -272,12 +273,27 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
 	state.r = r;
 	state.key = itup->t_tid;
 	state.needInsertComplete = true;
+	/*
+	 * Mark ourselves as within a "checkpoint critical section". This
+	 * forces any concurrent checkpoint to wait until we've completed the
+	 * insert. Without this, it is possible for the checkpoint to set REDO
+	 * in the middle of a multi-record insert sequence, and quickly finish
+	 * the checkpoint before the next related page update record has been
+	 * WAL-logged. Replay from such a checkpoint would not see any of the
+	 * GiST WAL records, and would therefore not know that the operation
+	 * is incomplete.
+	 */
+	if (!r->rd_istemp)
+		MyProc->inCommit = true;
 
 	state.stack = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
 	state.stack->blkno = GIST_ROOT_BLKNO;
 
 	gistfindleaf(&state, giststate);
 	gistmakedeal(&state, giststate);
+
+	if (!r->rd_istemp)
+		MyProc->inCommit = false;
 }
 
 static bool
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index a90303e..7fdbe77 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -838,14 +838,6 @@ gist_xlog_cleanup(void)
 	MemoryContextDelete(insertCtx);
 }
 
-bool
-gist_safe_restartpoint(void)
-{
-	if (incomplete_inserts)
-		return false;
-	return true;
-}
-
 
 XLogRecData *
 formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index eaad812..eb510bb 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -21,6 +21,7 @@
 #include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
+#include "storage/proc.h"
 #include "utils/inval.h"
 #include "utils/tqual.h"
 
@@ -63,7 +64,8 @@ static void _bt_insertonpg(Relation rel, Buffer buf,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page);
+			   bool split_only_page,
+			   bool isleaf);
 static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 		  OffsetNumber newitemoff, Size newitemsz,
 		  IndexTuple newitem, bool newitemonleft);
@@ -176,7 +178,7 @@ top:
 	{
 		/* do the insertion */
 		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
-		_bt_insertonpg(rel, buf, stack, itup, offset, false);
+		_bt_insertonpg(rel, buf, stack, itup, offset, false, true);
 	}
 	else
 	{
@@ -660,7 +662,8 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
-			   bool split_only_page)
+			   bool split_only_page,
+			   bool isleaf)
 {
 	Page		page;
 	BTPageOpaque lpageop;
@@ -714,6 +717,10 @@ _bt_insertonpg(Relation rel,
 		 *----------
 		 */
 		_bt_insert_parent(rel, buf, rbuf, stack, is_root, is_only);
+
+		/* Finish the "checkpoint critical section" started in _bt_split() */
+		if (!rel->rd_istemp && isleaf)
+			MyProc->inCommit = false;
 	}
 	else
 	{
@@ -1139,6 +1146,17 @@ _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
 	START_CRIT_SECTION();
 
 	/*
+	 * Mark ourselves as within a "checkpoint critical section". This
+	 * forces any concurrent checkpoint to wait until we've inserted the
+	 * parent pointer too. Without this, it is possible for the checkpoint to
+	 * set REDO after the split WAL record and quickly finish the checkpoint
+	 * before the insertion of the parent pointer has been WAL-logged. Replay
+	 * from such a checkpoint would not know that the split is incomplete.
+	 */
+	if (!rel->rd_istemp)
+		MyProc->inCommit = true;
+
+	/*
 	 * By here, the original data page has been split into two new halves, and
 	 * these are correct.  The algorithm requires that the left page never
 	 * move during a split, so we copy the new left page back on top of the
@@ -1683,7 +1701,7 @@ _bt_insert_parent(Relation rel,
 		/* Recursively update the parent */
 		_bt_insertonpg(rel, pbuf, stack->bts_parent,
 					   new_item, stack->bts_offset + 1,
-					   is_only);
+					   is_only, false);
 
 		/* be tidy */
 		pfree(new_item);
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 0822f5c..bbe01e1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -1257,11 +1257,3 @@ btree_xlog_cleanup(void)
 	}
 	incomplete_actions = NIL;
 }
-
-bool
-btree_safe_restartpoint(void)
-{
-	if (incomplete_actions)
-		return false;
-	return true;
-}
diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index eaac139..c72b534 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -542,6 +542,19 @@ replay code has to do the insertion on its own to restore the index to
 consistency.  Such insertions occur after WAL is operational, so they can
 and should write WAL records for the additional generated actions.
 
+Finishing incomplete actions at the end of recovery only works if the recovery
+sees the WAL record starting the action. Because of that, you need to ensure
+that a checkpoint doesn't happen in the middle of such an operation. To be
+more precise, a checkpoint mustn't begin and finish so that there a
+multi-record WAL action is in progress, but no WAL record part of the action
+lies between the REDO pointer and the checkpoint record. Recovery from such a
+checkpoint would not see any evidence of the still incomplete multi-record
+action, so it would not know that it needs to be finished at the end of
+recovery. To prevent that, you must set MyProc->inCommit before writing the
+first WAL record of the operation, and clear it after the operation is
+finished. The checkpoint code checks that after deciding the REDO pointer,
+waiting until it sees that the inCommit flag of every backend is clear before
+writing the checkpoint record.
 
 Write-Ahead Logging for Filesystem Actions
 ------------------------------------------
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index c706e97..c3fcfb4 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -26,20 +26,20 @@
 
 
 const RmgrData RmgrTable[RM_MAX_ID + 1] = {
-	{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
-	{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
-	{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
-	{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
-	{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
-	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
-	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
-	{"RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL},
-	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
-	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
-	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
-	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
-	{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
-	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
-	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, gist_safe_restartpoint},
-	{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
+	{"XLOG", xlog_redo, xlog_desc, NULL, NULL},
+	{"Transaction", xact_redo, xact_desc, NULL, NULL},
+	{"Storage", smgr_redo, smgr_desc, NULL, NULL},
+	{"CLOG", clog_redo, clog_desc, NULL, NULL},
+	{"Database", dbase_redo, dbase_desc, NULL, NULL},
+	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL},
+	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL},
+	{"RelMap", relmap_redo, relmap_desc, NULL, NULL},
+	{"Standby", standby_redo, standby_desc, NULL, NULL},
+	{"Heap2", heap2_redo, heap2_desc, NULL, NULL},
+	{"Heap", heap_redo, heap_desc, NULL, NULL},
+	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup},
+	{"Hash", hash_redo, hash_desc, NULL, NULL},
+	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup},
+	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup},
+	{"Sequence", seq_redo, seq_desc, NULL, NULL}
 };
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 786d0c6..15091af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7302,8 +7302,11 @@ CreateCheckPoint(int flags)
 	nInCommit = GetTransactionsInCommit(&inCommitXids);
 	if (nInCommit > 0)
 	{
+		int i = 0;
 		do
 		{
+			if ((++i) % 10 == 0)
+				elog(LOG, "waiting, %d try", i);
 			pg_usleep(10000L);	/* wait for 10 msec */
 		} while (HaveTransactionsInCommit(inCommitXids, nInCommit));
 	}
@@ -7554,31 +7557,10 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 static void
 RecoveryRestartPoint(const CheckPoint *checkPoint)
 {
-	int			rmid;
-
 	/* use volatile pointer to prevent code rearrangement */
 	volatile XLogCtlData *xlogctl = XLogCtl;
 
 	/*
-	 * Is it safe to checkpoint?  We must ask each of the resource managers
-	 * whether they have any partial state information that might prevent a
-	 * correct restart from this point.  If so, we skip this opportunity, but
-	 * return at the next checkpoint record for another try.
-	 */
-	for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
-	{
-		if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
-			if (!(RmgrTable[rmid].rm_safe_restartpoint()))
-			{
-				elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
-					 rmid,
-					 checkPoint->redo.xlogid,
-					 checkPoint->redo.xrecoff);
-				return;
-			}
-	}
-
-	/*
 	 * Copy the checkpoint record to shared memory, so that bgwriter can use
 	 * it the next time it wants to perform a restartpoint.
 	 */
diff --git a/src/include/access/gin.h b/src/include/access/gin.h
index e2d7b45..b5c62df 100644
--- a/src/include/access/gin.h
+++ b/src/include/access/gin.h
@@ -400,7 +400,6 @@ extern void gin_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void gin_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void gin_xlog_startup(void);
 extern void gin_xlog_cleanup(void);
-extern bool gin_safe_restartpoint(void);
 
 /* ginbtree.c */
 
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 34cc5d5..9f0b18c 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -252,7 +252,6 @@ extern void gist_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void gist_xlog_startup(void);
 extern void gist_xlog_cleanup(void);
-extern bool gist_safe_restartpoint(void);
 extern IndexTuple gist_form_invalid_tuple(BlockNumber blkno);
 
 extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3bbc4d1..4fdc25d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -647,6 +647,5 @@ extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
 extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void btree_xlog_startup(void);
 extern void btree_xlog_cleanup(void);
-extern bool btree_safe_restartpoint(void);
 
 #endif   /* NBTREE_H */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 370c989..866b0d2 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -250,7 +250,6 @@ typedef struct RmgrData
 	void		(*rm_desc) (StringInfo buf, uint8 xl_info, char *rec);
 	void		(*rm_startup) (void);
 	void		(*rm_cleanup) (void);
-	bool		(*rm_safe_restartpoint) (void);
 } RmgrData;
 
 extern const RmgrData RmgrTable[];

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Heikki Linnakangas (#5)

1 attachment(s)

Re: B-tree parent pointer and checkpoints

On 08.11.2010 15:40, Heikki Linnakangas wrote:

Here's a first draft of this, using the inCommit flag as is. It works,
but suffers from starvation if you have a lot of concurrent
multi-WAL-record actions. I tested that by running INSERTs to a table
with tsvector field with a GiST index on it from five concurrent
sessions, and saw checkpoints regularly busy-waiting for over a minute.

To avoid that, we need something a little bit more complicated than a
boolean flag. I'm thinking of adding a counter beside the inCommit flag
that's incremented every time a new multi-WAL-record action begins, so
that the checkpoint process can distinguish between a new action that
was started after deciding the REDO pointer and an old one that's still
running.

(inCommit is a misnomer now, of course. Will need to find a better name..)

Here's a 2nd version, with an additional counter in PGPROC to avoid
starving checkpoint in the face of a constant stream e.g GiST inserts.

The new rule is that before you start a multi-WAL-record operation that
needs to be completed at end of recovery if you crash in the middle, you
call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().
rm_safe_restartpoint() is gone.

This is a pre-existing bug, but given the lack of field reports and the
fact that it's pretty darn hard to run into this in real life, I'm
inclined to not backpatch this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

split-delay-checkpoint-2.patchtext/x-diff; name=split-delay-checkpoint-2.patchDownload

*** a/src/backend/access/gin/ginbtree.c
--- b/src/backend/access/gin/ginbtree.c
***************
*** 17,22 ****
--- 17,23 ----
  #include "access/gin.h"
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
+ #include "storage/proc.h"
  #include "utils/rel.h"
  
  /*
***************
*** 281,286 **** ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
--- 282,288 ----
  	Page		page,
  				rpage,
  				lpage;
+ 	bool		splitInProgress = false;
  
  	/* remember root BlockNumber */
  	while (parent)
***************
*** 318,324 **** ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  
  			freeGinBtreeStack(stack);
  
! 			return;
  		}
  		else
  		{
--- 320,326 ----
  
  			freeGinBtreeStack(stack);
  
! 			break; /* don't need to recurse to parent */
  		}
  		else
  		{
***************
*** 326,331 **** ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
--- 328,344 ----
  			Page		newlpage;
  
  			/*
+ 			 * Hold off checkpoints until we've finished this split by
+ 			 * inserting the parent pointer. Replay might miss the incomplete
+ 			 * split otherwise.
+ 			 */
+ 			if (!!btree->index->rd_istemp)
+ 			{
+ 				splitInProgress = true;
+ 				HoldCheckpoint();
+ 			}
+ 
+ 			/*
  			 * newlpage is a pointer to memory page, it doesn't associate with
  			 * buffer, stack->buffer should be untouched
  			 */
***************
*** 402,408 **** ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
  						buildStats->nEntryPages++;
  				}
  
! 				return;
  			}
  			else
  			{
--- 415,421 ----
  						buildStats->nEntryPages++;
  				}
  
! 				break; /* don't need to recurse to parent */
  			}
  			else
  			{
***************
*** 472,475 **** ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
--- 485,495 ----
  		pfree(stack);
  		stack = parent;
  	}
+ 
+ 	/*
+ 	 * If we had to split a page, let checkpointer know that we're now
+ 	 * finished with it.
+ 	 */
+ 	if (splitInProgress)
+ 		ResumeCheckpoint();
  }
*** a/src/backend/access/gin/ginxlog.c
--- b/src/backend/access/gin/ginxlog.c
***************
*** 866,876 **** gin_xlog_cleanup(void)
  	MemoryContextDelete(opCtx);
  	incomplete_splits = NIL;
  }
- 
- bool
- gin_safe_restartpoint(void)
- {
- 	if (incomplete_splits)
- 		return false;
- 	return true;
- }
--- 866,868 ----
*** a/src/backend/access/gist/gist.c
--- b/src/backend/access/gist/gist.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
  #include "storage/indexfsm.h"
+ #include "storage/proc.h"
  #include "utils/memutils.h"
  
  const XLogRecPtr XLogRecPtrForTemp = {1, 1};
***************
*** 272,283 **** gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate)
--- 273,290 ----
  	state.r = r;
  	state.key = itup->t_tid;
  	state.needInsertComplete = true;
+ 	/* Hold off checkpoints until we've completed this insertion. */
+ 	if (!r->rd_istemp)
+ 		HoldCheckpoint();
  
  	state.stack = (GISTInsertStack *) palloc0(sizeof(GISTInsertStack));
  	state.stack->blkno = GIST_ROOT_BLKNO;
  
  	gistfindleaf(&state, giststate);
  	gistmakedeal(&state, giststate);
+ 
+ 	if (!r->rd_istemp)
+ 		ResumeCheckpoint();
  }
  
  static bool
*** a/src/backend/access/gist/gistxlog.c
--- b/src/backend/access/gist/gistxlog.c
***************
*** 838,851 **** gist_xlog_cleanup(void)
  	MemoryContextDelete(insertCtx);
  }
  
- bool
- gist_safe_restartpoint(void)
- {
- 	if (incomplete_inserts)
- 		return false;
- 	return true;
- }
- 
  
  XLogRecData *
  formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
--- 838,843 ----
*** a/src/backend/access/nbtree/nbtinsert.c
--- b/src/backend/access/nbtree/nbtinsert.c
***************
*** 21,26 ****
--- 21,27 ----
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
  #include "storage/lmgr.h"
+ #include "storage/proc.h"
  #include "utils/inval.h"
  #include "utils/tqual.h"
  
***************
*** 63,69 **** static void _bt_insertonpg(Relation rel, Buffer buf,
  			   BTStack stack,
  			   IndexTuple itup,
  			   OffsetNumber newitemoff,
! 			   bool split_only_page);
  static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
  		  OffsetNumber newitemoff, Size newitemsz,
  		  IndexTuple newitem, bool newitemonleft);
--- 64,71 ----
  			   BTStack stack,
  			   IndexTuple itup,
  			   OffsetNumber newitemoff,
! 			   bool split_only_page,
! 			   bool isleaf);
  static Buffer _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
  		  OffsetNumber newitemoff, Size newitemsz,
  		  IndexTuple newitem, bool newitemonleft);
***************
*** 176,182 **** top:
  	{
  		/* do the insertion */
  		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
! 		_bt_insertonpg(rel, buf, stack, itup, offset, false);
  	}
  	else
  	{
--- 178,184 ----
  	{
  		/* do the insertion */
  		_bt_findinsertloc(rel, &buf, &offset, natts, itup_scankey, itup, heapRel);
! 		_bt_insertonpg(rel, buf, stack, itup, offset, false, true);
  	}
  	else
  	{
***************
*** 660,666 **** _bt_insertonpg(Relation rel,
  			   BTStack stack,
  			   IndexTuple itup,
  			   OffsetNumber newitemoff,
! 			   bool split_only_page)
  {
  	Page		page;
  	BTPageOpaque lpageop;
--- 662,669 ----
  			   BTStack stack,
  			   IndexTuple itup,
  			   OffsetNumber newitemoff,
! 			   bool split_only_page,
! 			   bool isleaf)
  {
  	Page		page;
  	BTPageOpaque lpageop;
***************
*** 714,719 **** _bt_insertonpg(Relation rel,
--- 717,726 ----
  		 *----------
  		 */
  		_bt_insert_parent(rel, buf, rbuf, stack, is_root, is_only);
+ 
+ 		/* Finish the "checkpoint critical section" started in _bt_split() */
+ 		if (!rel->rd_istemp && isleaf)
+ 			ResumeCheckpoint();
  	}
  	else
  	{
***************
*** 870,875 **** _bt_insertonpg(Relation rel,
--- 877,886 ----
   *
   *		Returns the new right sibling of buf, pinned and write-locked.
   *		The pin and lock on buf are maintained.
+  *
+  *		If it's not a temporary index, _bt_split() calls HoldCheckpoint(),
+  *		and the caller is responsible for calling ResumeCheckpoint() after
+  *		inserting the parent pointer.
   */
  static Buffer
  _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
***************
*** 1139,1144 **** _bt_split(Relation rel, Buffer buf, OffsetNumber firstright,
--- 1150,1167 ----
  	START_CRIT_SECTION();
  
  	/*
+ 	 * Hold off checkpoints until we've inserted the parent pointer. Without
+ 	 * this, it would be possible for the checkpoint to set REDO after the
+ 	 * split WAL record and quickly finish the checkpoint before the insertion
+ 	 * of the parent pointer has been WAL-logged. Incomplete splits are
+ 	 * normally completed at the end of recovery, but if there's no trace of
+ 	 * the split between the redo pointer and checkpoint record, recovery
+ 	 * would not know that the split is incomplete.
+ 	 */
+ 	if (!rel->rd_istemp)
+ 		HoldCheckpoint();
+ 
+ 	/*
  	 * By here, the original data page has been split into two new halves, and
  	 * these are correct.  The algorithm requires that the left page never
  	 * move during a split, so we copy the new left page back on top of the
***************
*** 1683,1689 **** _bt_insert_parent(Relation rel,
  		/* Recursively update the parent */
  		_bt_insertonpg(rel, pbuf, stack->bts_parent,
  					   new_item, stack->bts_offset + 1,
! 					   is_only);
  
  		/* be tidy */
  		pfree(new_item);
--- 1706,1712 ----
  		/* Recursively update the parent */
  		_bt_insertonpg(rel, pbuf, stack->bts_parent,
  					   new_item, stack->bts_offset + 1,
! 					   is_only, false);
  
  		/* be tidy */
  		pfree(new_item);
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 1257,1267 **** btree_xlog_cleanup(void)
  	}
  	incomplete_actions = NIL;
  }
- 
- bool
- btree_safe_restartpoint(void)
- {
- 	if (incomplete_actions)
- 		return false;
- 	return true;
- }
--- 1257,1259 ----
*** a/src/backend/access/transam/README
--- b/src/backend/access/transam/README
***************
*** 542,547 **** replay code has to do the insertion on its own to restore the index to
--- 542,560 ----
  consistency.  Such insertions occur after WAL is operational, so they can
  and should write WAL records for the additional generated actions.
  
+ Finishing incomplete actions at the end of recovery only works if the recovery
+ sees the WAL record starting the action. Because of that, you need to ensure
+ that a checkpoint doesn't happen in the middle of such an operation. To be
+ precise, a checkpoint mustn't begin and finish so that there is a multi-record
+ WAL action in progress, but no WAL record part of the action lies between the
+ REDO pointer and the checkpoint record. Recovery from such a checkpoint would
+ not see any evidence of the still incomplete multi-record action, so it would
+ not know that the operation needs to be finished at the end of recovery. To
+ prevent that, you must call HoldCheckpoint() before writing the first WAL
+ record of the operation, and ResumeCheckpoint() after it's finished. The
+ checkpoint code checks for that after deciding the REDO pointer, waiting until
+ all the operations that were in-progress when the checkpoint started have
+ finished, before it writes the checkpoint record.
  
  Write-Ahead Logging for Filesystem Actions
  ------------------------------------------
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 26,45 ****
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
! 	{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
! 	{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
! 	{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
! 	{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
! 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
! 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
! 	{"RelMap", relmap_redo, relmap_desc, NULL, NULL, NULL},
! 	{"Standby", standby_redo, standby_desc, NULL, NULL, NULL},
! 	{"Heap2", heap2_redo, heap2_desc, NULL, NULL, NULL},
! 	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
! 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
! 	{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
! 	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
! 	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, gist_safe_restartpoint},
! 	{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
  };
--- 26,45 ----
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{"XLOG", xlog_redo, xlog_desc, NULL, NULL},
! 	{"Transaction", xact_redo, xact_desc, NULL, NULL},
! 	{"Storage", smgr_redo, smgr_desc, NULL, NULL},
! 	{"CLOG", clog_redo, clog_desc, NULL, NULL},
! 	{"Database", dbase_redo, dbase_desc, NULL, NULL},
! 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL},
! 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL},
! 	{"RelMap", relmap_redo, relmap_desc, NULL, NULL},
! 	{"Standby", standby_redo, standby_desc, NULL, NULL},
! 	{"Heap2", heap2_redo, heap2_desc, NULL, NULL},
! 	{"Heap", heap_redo, heap_desc, NULL, NULL},
! 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup},
! 	{"Hash", hash_redo, hash_desc, NULL, NULL},
! 	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup},
! 	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup},
! 	{"Sequence", seq_redo, seq_desc, NULL, NULL}
  };
*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 313,319 **** MarkAsPreparing(TransactionId xid, const char *gid,
  	gxact->proc.backendId = InvalidBackendId;
  	gxact->proc.databaseId = databaseid;
  	gxact->proc.roleId = owner;
! 	gxact->proc.inCommit = false;
  	gxact->proc.vacuumFlags = 0;
  	gxact->proc.lwWaiting = false;
  	gxact->proc.lwExclusive = false;
--- 313,320 ----
  	gxact->proc.backendId = InvalidBackendId;
  	gxact->proc.databaseId = databaseid;
  	gxact->proc.roleId = owner;
! 	gxact->proc.holdCheckpoint = false;
! 	gxact->proc.chkptHoldCounter = 0;
  	gxact->proc.vacuumFlags = 0;
  	gxact->proc.lwWaiting = false;
  	gxact->proc.lwExclusive = false;
***************
*** 1018,1024 **** EndPrepare(GlobalTransaction gxact)
  	 */
  	START_CRIT_SECTION();
  
! 	MyProc->inCommit = true;
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
--- 1019,1025 ----
  	 */
  	START_CRIT_SECTION();
  
! 	MyProc->holdCheckpoint = true;
  
  	gxact->prepare_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE,
  									records.head);
***************
*** 1066,1072 **** EndPrepare(GlobalTransaction gxact)
  	 * checkpoint starting after this will certainly see the gxact as a
  	 * candidate for fsyncing.
  	 */
! 	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
  
--- 1067,1073 ----
  	 * checkpoint starting after this will certainly see the gxact as a
  	 * candidate for fsyncing.
  	 */
! 	MyProc->holdCheckpoint = false;
  
  	END_CRIT_SECTION();
  
***************
*** 1959,1965 **** RecordTransactionCommitPrepared(TransactionId xid,
  	START_CRIT_SECTION();
  
  	/* See notes in RecordTransactionCommit */
! 	MyProc->inCommit = true;
  
  	/* Emit the XLOG commit record */
  	xlrec.xid = xid;
--- 1960,1966 ----
  	START_CRIT_SECTION();
  
  	/* See notes in RecordTransactionCommit */
! 	MyProc->holdCheckpoint = true;
  
  	/* Emit the XLOG commit record */
  	xlrec.xid = xid;
***************
*** 2024,2030 **** RecordTransactionCommitPrepared(TransactionId xid,
  	TransactionIdCommitTree(xid, nchildren, children);
  
  	/* Checkpoint can proceed now */
! 	MyProc->inCommit = false;
  
  	END_CRIT_SECTION();
  }
--- 2025,2031 ----
  	TransactionIdCommitTree(xid, nchildren, children);
  
  	/* Checkpoint can proceed now */
! 	MyProc->holdCheckpoint = false;
  
  	END_CRIT_SECTION();
  }
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 968,975 **** RecordTransactionCommit(void)
  		xlrec.tsId = MyDatabaseTableSpace;
  
  		/*
! 		 * Mark ourselves as within our "commit critical section".	This
! 		 * forces any concurrent checkpoint to wait until we've updated
  		 * pg_clog.  Without this, it is possible for the checkpoint to set
  		 * REDO after the XLOG record but fail to flush the pg_clog update to
  		 * disk, leading to loss of the transaction commit if the system
--- 968,974 ----
  		xlrec.tsId = MyDatabaseTableSpace;
  
  		/*
! 		 * Hold off any concurrent checkpoints until we've updated
  		 * pg_clog.  Without this, it is possible for the checkpoint to set
  		 * REDO after the XLOG record but fail to flush the pg_clog update to
  		 * disk, leading to loss of the transaction commit if the system
***************
*** 979,991 **** RecordTransactionCommit(void)
  		 * RecordTransactionAbort.	That's because loss of a transaction abort
  		 * is noncritical; the presumption would be that it aborted, anyway.
  		 *
! 		 * It's safe to change the inCommit flag of our own backend without
! 		 * holding the ProcArrayLock, since we're the only one modifying it.
! 		 * This makes checkpoint's determination of which xacts are inCommit a
! 		 * bit fuzzy, but it doesn't matter.
  		 */
  		START_CRIT_SECTION();
! 		MyProc->inCommit = true;
  
  		SetCurrentTransactionStopTimestamp();
  		xlrec.xact_time = xactStopTimestamp;
--- 978,992 ----
  		 * RecordTransactionAbort.	That's because loss of a transaction abort
  		 * is noncritical; the presumption would be that it aborted, anyway.
  		 *
! 		 * We don't bother bumping chkptHoldCounter like HoldCheckpoint()
! 		 * does, as the transaction is about to end soon anyway. At worst it
! 		 * means that checkpoint thinks that we're still in some previous
! 		 * multi-record operation that happened earlier in this transaction,
! 		 * and has to wait a bit longer than necessary. Better to avoid any
! 		 * unnecessary overhead in this performance-critical path.
  		 */
  		START_CRIT_SECTION();
! 		MyProc->holdCheckpoint = true;
  
  		SetCurrentTransactionStopTimestamp();
  		xlrec.xact_time = xactStopTimestamp;
***************
*** 1100,1106 **** RecordTransactionCommit(void)
  	 */
  	if (markXidCommitted)
  	{
! 		MyProc->inCommit = false;
  		END_CRIT_SECTION();
  	}
  
--- 1101,1107 ----
  	 */
  	if (markXidCommitted)
  	{
! 		MyProc->holdCheckpoint = false;
  		END_CRIT_SECTION();
  	}
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 7111,7118 **** CreateCheckPoint(int flags)
  	uint32		freespace;
  	uint32		_logId;
  	uint32		_logSeg;
- 	TransactionId *inCommitXids;
- 	int			nInCommit;
  
  	/*
  	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
--- 7111,7116 ----
***************
*** 7299,7313 **** CreateCheckPoint(int flags)
  	 * and we will correctly flush the update below.  So we cannot miss any
  	 * xacts we need to wait for.
  	 */
! 	nInCommit = GetTransactionsInCommit(&inCommitXids);
! 	if (nInCommit > 0)
! 	{
! 		do
! 		{
! 			pg_usleep(10000L);	/* wait for 10 msec */
! 		} while (HaveTransactionsInCommit(inCommitXids, nInCommit));
! 	}
! 	pfree(inCommitXids);
  
  	/*
  	 * Get the other info we need for the checkpoint record.
--- 7297,7303 ----
  	 * and we will correctly flush the update below.  So we cannot miss any
  	 * xacts we need to wait for.
  	 */
! 	ProcArrayCheckpointBarrier();
  
  	/*
  	 * Get the other info we need for the checkpoint record.
***************
*** 7554,7584 **** CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
  static void
  RecoveryRestartPoint(const CheckPoint *checkPoint)
  {
- 	int			rmid;
- 
  	/* use volatile pointer to prevent code rearrangement */
  	volatile XLogCtlData *xlogctl = XLogCtl;
  
  	/*
- 	 * Is it safe to checkpoint?  We must ask each of the resource managers
- 	 * whether they have any partial state information that might prevent a
- 	 * correct restart from this point.  If so, we skip this opportunity, but
- 	 * return at the next checkpoint record for another try.
- 	 */
- 	for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
- 	{
- 		if (RmgrTable[rmid].rm_safe_restartpoint != NULL)
- 			if (!(RmgrTable[rmid].rm_safe_restartpoint()))
- 			{
- 				elog(trace_recovery(DEBUG2), "RM %d not safe to record restart point at %X/%X",
- 					 rmid,
- 					 checkPoint->redo.xlogid,
- 					 checkPoint->redo.xrecoff);
- 				return;
- 			}
- 	}
- 
- 	/*
  	 * Copy the checkpoint record to shared memory, so that bgwriter can use
  	 * it the next time it wants to perform a restartpoint.
  	 */
--- 7544,7553 ----
*** a/src/backend/storage/ipc/procarray.c
--- b/src/backend/storage/ipc/procarray.c
***************
*** 367,373 **** ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  		proc->xmin = InvalidTransactionId;
  		/* must be cleared with xid/xmin: */
  		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 		proc->inCommit = false; /* be sure this is cleared in abort */
  		proc->recoveryConflictPending = false;
  
  		/* Clear the subtransaction-XID cache too while holding the lock */
--- 367,373 ----
  		proc->xmin = InvalidTransactionId;
  		/* must be cleared with xid/xmin: */
  		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 		proc->holdCheckpoint = false; /* be sure this is cleared in abort */
  		proc->recoveryConflictPending = false;
  
  		/* Clear the subtransaction-XID cache too while holding the lock */
***************
*** 394,400 **** ProcArrayEndTransaction(PGPROC *proc, TransactionId latestXid)
  		proc->xmin = InvalidTransactionId;
  		/* must be cleared with xid/xmin: */
  		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 		proc->inCommit = false; /* be sure this is cleared in abort */
  		proc->recoveryConflictPending = false;
  
  		Assert(proc->subxids.nxids == 0);
--- 394,400 ----
  		proc->xmin = InvalidTransactionId;
  		/* must be cleared with xid/xmin: */
  		proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 		proc->holdCheckpoint = false; /* be sure this is cleared in abort */
  		proc->recoveryConflictPending = false;
  
  		Assert(proc->subxids.nxids == 0);
***************
*** 427,433 **** ProcArrayClearTransaction(PGPROC *proc)
  
  	/* redundant, but just in case */
  	proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 	proc->inCommit = false;
  
  	/* Clear the subtransaction-XID cache too */
  	proc->subxids.nxids = 0;
--- 427,433 ----
  
  	/* redundant, but just in case */
  	proc->vacuumFlags &= ~PROC_VACUUM_STATE_MASK;
! 	proc->holdCheckpoint = false;
  
  	/* Clear the subtransaction-XID cache too */
  	proc->subxids.nxids = 0;
***************
*** 1537,1569 **** GetRunningTransactionData(void)
  }
  
  /*
!  * GetTransactionsInCommit -- Get the XIDs of transactions that are committing
!  *
!  * Constructs an array of XIDs of transactions that are currently in commit
!  * critical sections, as shown by having inCommit set in their PGPROC entries.
!  *
!  * *xids_p is set to a palloc'd array that should be freed by the caller.
!  * The return value is the number of valid entries.
!  *
!  * Note that because backends set or clear inCommit without holding any lock,
!  * the result is somewhat indeterminate, but we don't really care.  Even in
!  * a multiprocessor with delayed writes to shared memory, it should be certain
!  * that setting of inCommit will propagate to shared memory when the backend
!  * takes the WALInsertLock, so we cannot fail to see an xact as inCommit if
!  * it's already inserted its commit record.  Whether it takes a little while
!  * for clearing of inCommit to propagate is unimportant for correctness.
   */
! int
! GetTransactionsInCommit(TransactionId **xids_p)
  {
  	ProcArrayStruct *arrayP = procArray;
! 	TransactionId *xids;
  	int			nxids;
  	int			index;
  
! 	xids = (TransactionId *) palloc(arrayP->maxProcs * sizeof(TransactionId));
  	nxids = 0;
  
  	LWLockAcquire(ProcArrayLock, LW_SHARED);
  
  	for (index = 0; index < arrayP->numProcs; index++)
--- 1537,1586 ----
  }
  
  /*
!  * ProcArrayCheckpointBarrier -- Wait until all multi-record actions finish
!  *
!  * Waits until all backends finish any multi-WAL-record actions, or other
!  * incomplete actions, so that it is safe to perform a checkpoint.
!  * This is called in the checkpoint process after deciding REDO pointer,
!  * but before writing the checkpoint record.
!  *
!  * We don't need to wait for any new multi-record actions that begin after
!  * we've started waiting, they will be seen on recovery from the REDO
!  * pointer.
!  *
!  * Note that because backends set or clear holdCheckpoints flag and 
!  * chkptHoldCounter without holding any lock, the result is somewhat
!  * indeterminate, but we don't really care.  Even in a multiprocessor with
!  * delayed writes to shared memory, it should be certain that setting of
!  * holdCheckpoints will propagate to shared memory when the backend takes the
!  * WALInsertLock to write the first WAL record in the action, so we cannot
!  * fail to see a multi-step action as holding checkpoints if it's already
!  * inserted its first WAL record.  Whether it takes a little while for
!  * clearing of holdCheckpoints to propagate is unimportant for correctness.
!  * Likewise we don't care much if the change of checkpointHoldCounter is seen
!  * as atomic, we just care it it's the same or different as it was when we
!  * started. By the time the backend takes the WALInsertLock, it should be
!  * updated in shared memory, and until that we don't really need to wait for
!  * the backend anyway.
   */
! void
! ProcArrayCheckpointBarrier(void)
  {
  	ProcArrayStruct *arrayP = procArray;
! 	struct xid_barrier {
! 		TransactionId xid;
! 		int32 counter;
! 	} *xids;
  	int			nxids;
  	int			index;
  
! 	xids = (struct xid_barrier *) palloc(arrayP->maxProcs * sizeof(struct xid_barrier));
  	nxids = 0;
  
+ 	/*
+ 	 * Take a snapshot of the chkptHoldCounter values of all backends
+ 	 * currently holding off checkpoints.
+ 	 */
  	LWLockAcquire(ProcArrayLock, LW_SHARED);
  
  	for (index = 0; index < arrayP->numProcs; index++)
***************
*** 1573,1633 **** GetTransactionsInCommit(TransactionId **xids_p)
  		/* Fetch xid just once - see GetNewTransactionId */
  		TransactionId pxid = proc->xid;
  
! 		if (proc->inCommit && TransactionIdIsValid(pxid))
! 			xids[nxids++] = pxid;
  	}
  
  	LWLockRelease(ProcArrayLock);
  
! 	*xids_p = xids;
! 	return nxids;
! }
! 
! /*
!  * HaveTransactionsInCommit -- Are any of the specified XIDs in commit?
!  *
!  * This is used with the results of GetTransactionsInCommit to see if any
!  * of the specified XIDs are still in their commit critical sections.
!  *
!  * Note: this is O(N^2) in the number of xacts that are/were in commit, but
!  * those numbers should be small enough for it not to be a problem.
!  */
! bool
! HaveTransactionsInCommit(TransactionId *xids, int nxids)
! {
! 	bool		result = false;
! 	ProcArrayStruct *arrayP = procArray;
! 	int			index;
! 
! 	LWLockAcquire(ProcArrayLock, LW_SHARED);
! 
! 	for (index = 0; index < arrayP->numProcs; index++)
  	{
! 		volatile PGPROC *proc = arrayP->procs[index];
! 
! 		/* Fetch xid just once - see GetNewTransactionId */
! 		TransactionId pxid = proc->xid;
! 
! 		if (proc->inCommit && TransactionIdIsValid(pxid))
  		{
! 			int			i;
  
! 			for (i = 0; i < nxids; i++)
  			{
! 				if (xids[i] == pxid)
  				{
! 					result = true;
! 					break;
  				}
  			}
! 			if (result)
! 				break;
! 		}
  	}
! 
! 	LWLockRelease(ProcArrayLock);
! 
! 	return result;
  }
  
  /*
--- 1590,1651 ----
  		/* Fetch xid just once - see GetNewTransactionId */
  		TransactionId pxid = proc->xid;
  
! 		if (proc->holdCheckpoint && TransactionIdIsValid(pxid))
! 		{
! 			xids[nxids].xid = pxid;
! 			xids[nxids].counter = proc->chkptHoldCounter;
! 			nxids++;
! 		}
  	}
  
  	LWLockRelease(ProcArrayLock);
  
! 	/*
! 	 * Wait until all operations that were in-progress when we took the
! 	 * snapshot have finished.
! 	 *
! 	 * Note: this is O(N^2) in the number of operations that are/were in
! 	 * progress, but those numbers should be small enough for it not to be
! 	 * a problem.
! 	 */
! 	if (nxids > 0)
  	{
! 		bool stillHeld;
! 		do
  		{
! 			pg_usleep(10000L);	/* wait for 10 msec */
! 
! 			stillHeld = false;
  
! 			LWLockAcquire(ProcArrayLock, LW_SHARED);
! 			for (index = 0; index < arrayP->numProcs; index++)
  			{
! 				volatile PGPROC *proc = arrayP->procs[index];
! 
! 				/* Fetch xid just once - see GetNewTransactionId */
! 				TransactionId pxid = proc->xid;
! 
! 				if (proc->holdCheckpoint && TransactionIdIsValid(pxid))
  				{
! 					int			i;
! 
! 					for (i = 0; i < nxids; i++)
! 					{
! 						if (xids[i].xid == pxid)
! 						{
! 							if (xids[i].counter == proc->chkptHoldCounter)
! 								stillHeld = true;
! 							break;
! 						}
! 					}
! 					if (stillHeld)
! 						break;
  				}
  			}
! 			LWLockRelease(ProcArrayLock);
! 		} while (stillHeld);
  	}
! 	pfree(xids);
  }
  
  /*
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
***************
*** 312,318 **** InitProcess(void)
  	MyProc->backendId = InvalidBackendId;
  	MyProc->databaseId = InvalidOid;
  	MyProc->roleId = InvalidOid;
! 	MyProc->inCommit = false;
  	MyProc->vacuumFlags = 0;
  	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
  	if (IsAutoVacuumWorkerProcess())
--- 312,319 ----
  	MyProc->backendId = InvalidBackendId;
  	MyProc->databaseId = InvalidOid;
  	MyProc->roleId = InvalidOid;
! 	MyProc->holdCheckpoint = false;
! 	MyProc->chkptHoldCounter = 0;
  	MyProc->vacuumFlags = 0;
  	/* NB -- autovac launcher intentionally does not set IS_AUTOVACUUM */
  	if (IsAutoVacuumWorkerProcess())
***************
*** 450,456 **** InitAuxiliaryProcess(void)
  	MyProc->backendId = InvalidBackendId;
  	MyProc->databaseId = InvalidOid;
  	MyProc->roleId = InvalidOid;
! 	MyProc->inCommit = false;
  	MyProc->vacuumFlags = 0;
  	MyProc->lwWaiting = false;
  	MyProc->lwExclusive = false;
--- 451,458 ----
  	MyProc->backendId = InvalidBackendId;
  	MyProc->databaseId = InvalidOid;
  	MyProc->roleId = InvalidOid;
! 	MyProc->holdCheckpoint = false;
! 	MyProc->chkptHoldCounter = 0;
  	MyProc->vacuumFlags = 0;
  	MyProc->lwWaiting = false;
  	MyProc->lwExclusive = false;
***************
*** 1784,1786 **** handle_standby_sig_alarm(SIGNAL_ARGS)
--- 1786,1829 ----
  
  	errno = save_errno;
  }
+ 
+ /*
+  * Hold off checkpoints. A checkpoint that begins after this call won't
+  * finish until ResumeCheckpoint() is called. A checkpoint that is already
+  * in progress is not affected.
+  *
+  * You must call HoldCheckpoint() before starting an operation that consists
+  * of multiple WAL records, relying on the rmgr cleanup to finish an
+  * incomplete operation at end of recovery.
+  */
+ void
+ HoldCheckpoint(void)
+ {
+ 	volatile PGPROC *p = MyProc;
+ 	Assert(!p->holdCheckpoint);
+ 
+ 	/*
+ 	 * Increment the counter, so that checkpoint can differentiate between
+ 	 * an operation that was in progress when the checkpoint began, and a
+ 	 * later one in the same transaction that began later.
+ 	 *
+ 	 * It's safe to change these for our own backend without holding the
+ 	 * ProcArrayLock, since we're the only one modifying them. This makes
+ 	 * checkpoint's determination of which xacts are holding off checkpoints
+ 	 * a bit fuzzy, but it doesn't matter. See also comments in
+ 	 * ProcArrayCheckpointBarrier().
+ 	 */
+ 	p->chkptHoldCounter++;
+ 	p->holdCheckpoint = true;
+ }
+ 
+ /*
+  * Allow a possible concurrent checkpoint to finish.
+  */
+ void
+ ResumeCheckpoint(void)
+ {
+ 	volatile PGPROC *p = MyProc;
+ 	Assert(p->holdCheckpoint);
+ 	p->holdCheckpoint = false;
+ }
*** a/src/include/access/gin.h
--- b/src/include/access/gin.h
***************
*** 400,406 **** extern void gin_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void gin_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void gin_xlog_startup(void);
  extern void gin_xlog_cleanup(void);
- extern bool gin_safe_restartpoint(void);
  
  /* ginbtree.c */
  
--- 400,405 ----
*** a/src/include/access/gist_private.h
--- b/src/include/access/gist_private.h
***************
*** 252,258 **** extern void gist_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void gist_xlog_startup(void);
  extern void gist_xlog_cleanup(void);
- extern bool gist_safe_restartpoint(void);
  extern IndexTuple gist_form_invalid_tuple(BlockNumber blkno);
  
  extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
--- 252,257 ----
*** a/src/include/access/nbtree.h
--- b/src/include/access/nbtree.h
***************
*** 647,652 **** extern void btree_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void btree_xlog_startup(void);
  extern void btree_xlog_cleanup(void);
- extern bool btree_safe_restartpoint(void);
  
  #endif   /* NBTREE_H */
--- 647,651 ----
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 250,256 **** typedef struct RmgrData
  	void		(*rm_desc) (StringInfo buf, uint8 xl_info, char *rec);
  	void		(*rm_startup) (void);
  	void		(*rm_cleanup) (void);
- 	bool		(*rm_safe_restartpoint) (void);
  } RmgrData;
  
  extern const RmgrData RmgrTable[];
--- 250,255 ----
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 91,97 **** struct PGPROC
  	Oid			databaseId;		/* OID of database this backend is using */
  	Oid			roleId;			/* OID of role using this backend */
  
! 	bool		inCommit;		/* true if within commit critical section */
  
  	uint8		vacuumFlags;	/* vacuum-related flags, see above */
  
--- 91,99 ----
  	Oid			databaseId;		/* OID of database this backend is using */
  	Oid			roleId;			/* OID of role using this backend */
  
! 	/* Fields for HoldCheckpoint()/ResumeCheckpoint() */
! 	int32		chkptHoldCounter;
! 	bool		holdCheckpoint;
  
  	uint8		vacuumFlags;	/* vacuum-related flags, see above */
  
***************
*** 204,207 **** extern bool enable_standby_sig_alarm(TimestampTz now,
--- 206,212 ----
  extern bool disable_standby_sig_alarm(void);
  extern void handle_standby_sig_alarm(SIGNAL_ARGS);
  
+ extern void HoldCheckpoint(void);
+ extern void ResumeCheckpoint(void);
+ 
  #endif   /* PROC_H */
*** a/src/include/storage/procarray.h
--- b/src/include/storage/procarray.h
***************
*** 48,55 **** extern bool TransactionIdIsInProgress(TransactionId xid);
  extern bool TransactionIdIsActive(TransactionId xid);
  extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
  
! extern int	GetTransactionsInCommit(TransactionId **xids_p);
! extern bool HaveTransactionsInCommit(TransactionId *xids, int nxids);
  
  extern PGPROC *BackendPidGetProc(int pid);
  extern int	BackendXidGetPid(TransactionId xid);
--- 48,54 ----
  extern bool TransactionIdIsActive(TransactionId xid);
  extern TransactionId GetOldestXmin(bool allDbs, bool ignoreVacuum);
  
! extern void ProcArrayCheckpointBarrier(void);
  
  extern PGPROC *BackendPidGetProc(int pid);
  extern int	BackendXidGetPid(TransactionId xid);

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Heikki Linnakangas (#6)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The new rule is that before you start a multi-WAL-record operation that
needs to be completed at end of recovery if you crash in the middle, you
call HoldCheckpoint(), and once you're finished, ResumeCheckpoint().

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

regards, tom lane

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Tom Lane (#7)

Re: B-tree parent pointer and checkpoints

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

regards, tom lane

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Tom Lane (#8)

Re: B-tree parent pointer and checkpoints

On 11.11.2010 00:49, Tom Lane wrote:

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):

The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in

log, and GiST's restore code should continue it. While continue, it's know which page was changed and should walk up to root. On each step of walk it should form union for page and insert it to parent.

Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

Do some of the other GiST algorithms get confused if there's a key on a
page that's not represented by the parent pointer? It's possible that
you insert a key to the leaf, update the leaf's immediate parent, but
crash before updating the parent's parent. As far as search is
concerned, that's OK as well, but it creates a hazard for subsequent
inserts. If you later insert a tuple with the same key to the same leaf
page, the insertion will see that the parent pointer already includes
the key, and will fail to update the parent's parent. That's a problem.

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Tom Lane (#8)

Re: B-tree parent pointer and checkpoints

On 11.11.2010 00:49, Tom Lane wrote:

I wrote:

What happens if you error out in between? Or is it assumed that the
*entire* sequence is a critical section? If it has to be that way,
one might wonder what's the point of trying to split it into multiple
WAL records.

Or, to be more concrete: I'm wondering if this *entire* mechanism isn't
a bad idea that we should just rip out.

The question that ought to be asked here, I think, is whether it
shouldn't be required that every inter-WAL-record state is a valid
consistent state that doesn't require post-crash fixups. If that
isn't the case, then a simple ERROR or FATAL exit out of the backend
that was creating the sequence originally will leave the system in
an unacceptable state. We could prevent such an exit by wrapping the
whole sequence in a critical section, but if you have to do that then
it's not apparent why you shouldn't fold it into one WAL record.

IOW, forget this patch. Take out the logic that tries to complete
pending splits during replay, instead. I believe this is perfectly safe
for btree: loss of a parent record isn't fatal, as proven by the fact
that searches don't have to be locked out while a split proceeds.
(We might want to make btree_page_del not think that a missing parent
record is an error, but it shouldn't think that anyway, because of the
possibility of a non-crashing failure during the original split.)
This approach might not be safe for GIST or GIN; but if it isn't, they
need fixes anyway.

GIN is similar to b-tree, the incomplete split logic there is for
inserting the parent pointers in the b-tree within the GIN index, just
like nbtree.

The problem with incopmleted inserts is: when new entry is installed into leaf page, all chain (in a worst case) of keys from root page to leaf should be updated. Of course, case of updating all chain is rarely and usially it's updated only part. Each key on inner pages contains union of keys (bounding box in a case of rtree, for example) on page which it points. This union can formed only with a help of user-defined function of opclass, because of GiST doesn't know something about nature of keys. Returning to WAL, GiST core write xlog entry with all nessary information for restoration before write page, but in this moment it doesn't know it should update keys on parent page or key is unchanged. So GiST's WAL restoration code should remember this page's update as incompleted insert. When insert complited, GiST's core write to log that insert is completed and restore code can clean up stored incompleted insert. If it was crash, then sign of completed insert can be absent in

Would it be hard to change the algorithm to update the parent keys
top-down rather than bottom-up?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Heikki Linnakangas (#10)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it. (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.) You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already. But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

* To complete insert we can't use basic insertion algorithm because
* during insertion we can't call user-defined support functions of opclass.
* So, we insert 'invalid' tuples without real key and do it by separate algorithm.
* 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

regards, tom lane

#12

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Tom Lane (#11)

Re: B-tree parent pointer and checkpoints

On 11.11.2010 17:16, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

GiST is different. When you insert a key to a leaf page, you (sometimes)
need to adjust the parent pointer to reflect the new key as well. B-tree
tolerates incomplete splits with the 'next page' pointer, but that is
not applicable to gist. Teodor described the issue back in 2005 when
WAL-logging was added to GiST
(http://archives.postgresql.org/pgsql-hackers/2005-06/msg00555.php):
Reading that I wonder: what harm would an incomplete insert cause if we
just left it in the tree? Imagine that you insert a key to a leaf page,
but crash before updating the parent. If you search for the key starting
from the root, you'll fail to find it, because the parent pointer claims
that there are no entries with such a key on the child page. But that's
OK, the inserting transaction aborted with the crash!

I think it'd be okay as far as that one entry is concerned, since as you
say it doesn't matter whether a search finds it. (We'd have to be sure
that VACUUM would still find it to remove it, of course, but that
doesn't use a normal search.) You're right that it poses a hazard of
subsequent inserts deciding that they don't need to do work on upper
levels because the lower ones look OK already. But depending on the
details of the search algorithm, this might be a non-problem: if you
remember that the upper level entries didn't cover your key when you
descended, you'd still know you need to recompute them.

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Something else I just noticed is that WAL replay isn't capable of
completely fixing the index anyway:

* To complete insert we can't use basic insertion algorithm because
* during insertion we can't call user-defined support functions of opclass.
* So, we insert 'invalid' tuples without real key and do it by separate algorithm.
* 'invalid' tuple should be updated by vacuum full.

Given that there's no more vacuum full, and nobody has been expected to
run it routinely for a long time anyway, this fixup approach seems
pretty completely broken anyhow.

The 'invalid' tuples don't affect correctness, but are a drag on
performance, so they are similar to incomplete b-tree splits. I suspect
the overhead of an invalid gist pointer is much bigger than the overhead
of an incomplete b-tree split, though. I agree we should get rid of
that, it's not comforting to get a stream of messages in the logs saying
you should run VACUUM FULL.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#13

tgl@sss.pgh.pa.us

about 15 years ago

In reply to: Heikki Linnakangas (#12)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Oh, that's a really good idea, I think. But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

If we supported UNIQUE GIST indexes then you could criticize this plan
on the grounds that parent entries would get uselessly enlarged before
detecting a uniqueness failure; but we don't and I know of no plans to.
So on the whole I think it sounds good. Teodor, what do you think?

regards, tom lane

#14

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Tom Lane (#13)

Re: B-tree parent pointer and checkpoints

On 11.11.2010 20:34, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Hmm, we don't currently keep track of that when we descend the tree to
choose the target page, but perhaps an extra Consistent call to check
that would be acceptable. We already call Penalty for every tuple on
each internal node on the way, so compared to that one more call should
not be too expensive. If we do that, I think it would simplify the
algorithm quite a bit to just update all the parents on the way down,
instead of traversing up from the bottom after inserting the tuple to
the leaf.

Oh, that's a really good idea, I think. But what about page splits?
I guess in the case of a split, you'd be replacing the parent entry
anyway, so having previously updated it to something larger doesn't
really cause a problem other than wasting a few cycles --- which are
probably still less than you save by not having to traverse back up.

I started looking at this, and run into a problem with page splits. The
right-links in GiST work differently from b-tree, a right-link is only
followed if we detect a concurrent page split. A concurrent split is
detected by comparing the "NSN" field on the child page against the LSN
that we saw on the parent when walking down. It means that if you just
leave the incompletely split page in the tree, where one half is missing
the parent pointer, scans will not find any tuples on that page. They
would at first, but as soon as the the parent page is updated due to
some unrelated insertion, the LSN of the parent is bumped above the NSN
stored on the child, and the page becomes invisible to scanners.

We avoid that problem during normal operation by keeping the parent page
locked while the child is split, until the downlink is inserted into the
parent. That blocks any other modifications to the parent page that
would bump the LSN, until our downlink has been inserted. That doesn't
work after crash recovery, as all the locks are released.

I think we can work around that with a small modification to the page
split algorithm. In a nutshell, when the child page is split, put a flag
on the left half indicating that the rightlink must always be followed,
regardless of the NSN. When the downlink is inserted to the parent,
clear the flag. Setting and clearing of these flags need to be performed
during WAL replay as well.

So to split a page:

(0. Lock the page to be split)
1. Split the page. Mark the rightlink in the left half with a flag
indicating that it always needs to be followed.
2. Lock the parent.
3. Insert downlink. (The parent may need to be split too)
4. Clear the flag in the child, and update NSN to the LSN of the
downlink insertion record.
5. Release child.

6. If the parent was split in step 3, goto 2.

If we crash between steps 1 and 3, the rightlink will have the flag, so
scans will know to always follow it. If we crash after step 3, recovery
will replay steps 3 and 4, so scans will see the downlinks as usual.

After a crash, the tree can be fixed up later by vacuum or subsequent
inserts, by performing steps 2-4.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#15

Greg Stark

gsstark@mit.edu

about 15 years ago

In reply to: Heikki Linnakangas (#14)

Re: B-tree parent pointer and checkpoints

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

--
greg

#16

heikki.linnakangas@enterprisedb.com

about 15 years ago

In reply to: Greg Stark (#15)

Re: B-tree parent pointer and checkpoints

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#17

bruce@momjian.us

almost 15 years ago

In reply to: Heikki Linnakangas (#16)

Re: B-tree parent pointer and checkpoints

Has this been addressed?

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#18

bruce@momjian.us

almost 15 years ago

In reply to: Bruce Momjian (#17)

Re: B-tree parent pointer and checkpoints

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

---------------------------------------------------------------------------

---------------------------------------------------------------------------

Heikki Linnakangas wrote:

On 13.11.2010 00:34, Greg Stark wrote:

On Fri, Nov 12, 2010 at 7:20 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I think we can work around that with a small modification to the page split
algorithm. In a nutshell, when the child page is split, put a flag on the
left half indicating that the rightlink must always be followed, regardless
of the NSN. When the downlink is inserted to the parent, clear the flag.
Setting and clearing of these flags need to be performed during WAL replay
as well.

Does this not cause duplicate results? Or does GIST already have to be
prepared to deal with duplicate results?

The GiST search algorithm avoids duplicate results by remembering the
LSN on the parent page when it follows a downlink. The split currently
happens like this:

0. (the child page is locked)
1. The parent page is locked.
2. The child page is split. The original page becomes the left half, and
new buffers are allocated for the right halves.
3. The downlink is inserted on the parent page (and the original
downlink is updated to reflect only the keys that stayed on the left
page). While keeping the child pages locked, the NSN field on the
children are updated with the new LSN of the parent page.

To avoid duplicates, when a scan looks at the child page, it needs to
know if it saw the parent page before or after the downlink was
inserted. If it saw it before, the scan needs to follow the rightlink to
the right half, otherwise it will follow the downlink as usual (if it
matched). The scan checks that by comparing the LSN it saw on the parent
page with the NSN on the child page. If parent LSN < NSN, we saw the
parent before the downlink was inserted.

Now, the problem with crash recovery is that the above algorithm depends
on the split to keep the parent and child locked until the downlink is
inserted in the parent. If you crash between steps 2 and 3, the locks
are gone. If a later insert then updates the parent page, because of a
split on some unrelated child page, that will bump the LSN of the parent
above the NSN on the child. Scans will see that the parent LSN > child
NSN, and will no longer follow the rightlink.

And the fix for that is to set a flag on the child page indicating that
rightlink has to be always followed regardless of the LSN/NSN, because
the downlink hasn't been inserted yet. When the downlink is inserted,
the flag is cleared and we rely on the existing LSN/NSN mechanism to
avoid duplicate results.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#19

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Bruce Momjian (#18)

Re: B-tree parent pointer and checkpoints

On 10.03.2011 22:50, Bruce Momjian wrote:

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

That's relatively harmless, index searches and insertions work without
the downlink. However, there's code in page deletion that ERRORs if the
parent can't be found. That should be fixed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#20

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Heikki Linnakangas (#19)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 10.03.2011 22:50, Bruce Momjian wrote:

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

But that will be fixed during WAL replay.

That's relatively harmless, index searches and insertions work without
the downlink. However, there's code in page deletion that ERRORs if the
parent can't be found. That should be fixed.

I don't like the idea of removing that consistency check, and I don't
think it's necessary to do so.

regards, tom lane

#21

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Tom Lane (#20)

Re: B-tree parent pointer and checkpoints

On 11.03.2011 17:59, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 10.03.2011 22:50, Bruce Momjian wrote:

Bruce Momjian wrote:

Has this been addressed?

I see we have with this commit:

9de3aa65f01fb51cbc725e8508ea233e4e92c46c

We fixed GiST. B-tree still has the issue that if you have a checkpoint
in the middle of an insert, and crash, you might be left with a leaf
node without a downlink in the parent.

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#22

tgl@sss.pgh.pa.us

almost 15 years ago

In reply to: Heikki Linnakangas (#21)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 11.03.2011 17:59, Tom Lane wrote:

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

regards, tom lane

#23

heikki.linnakangas@enterprisedb.com

almost 15 years ago

In reply to: Tom Lane (#22)

Re: B-tree parent pointer and checkpoints

On 11.03.2011 19:41, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 11.03.2011 17:59, Tom Lane wrote:

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#24

bruce@momjian.us

over 14 years ago

In reply to: Heikki Linnakangas (#23)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas wrote:

On 11.03.2011 19:41, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 11.03.2011 17:59, Tom Lane wrote:

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

Did this get fixed?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#25

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Bruce Momjian (#24)

Re: B-tree parent pointer and checkpoints

On 05.09.2011 21:55, Bruce Momjian wrote:

Heikki Linnakangas wrote:

On 11.03.2011 19:41, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 11.03.2011 17:59, Tom Lane wrote:

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

Did this get fixed?

Nope.

On a closer look, this isn't only a problem for page deletion. Page
splitting also barfs if it can't find the parent of a page. As the code
stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but
there's another situation where you can end up with a missing downlink:
if you run out of disk space while splitting a parent page, to insert a
downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put
a flag on pages with missing downlinks. Then we can fix the missing
downlinks in vacuum and insertion, and get rid of the code to fix
incomplete splits after WAL replay.

The way it would work is that on page split the right page is flagged
with MISSING_DOWNLINK flag. When the downlink is inserted into the
parent, the flag is cleared in the same critical section as the WAL
record for the insertion of the parent is written. Normally, a backend
would never see the flag set, because the locks on the split pages are
not released until the parent record is written and the flag cleared
again. But if inserting the downlink fails for any reason, the next
inserter or vacuum that steps on the page can finish the split by
inserting the downlink.

Unfortunately that means holding the locks on the split pages longer
than we do at the moment. Currently they are released as soon as the
parent page is locked; with this change they would need to be held until
the WAL record of the downlink insertion is done. B-tree is so heavily
used that I'm a bit hesitant to sacrifice any concurrency there, but I
don't think it would be noticeable in practice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#26

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#25)

Re: B-tree parent pointer and checkpoints

On Tue, Sep 6, 2011 at 6:21 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Nope.

On a closer look, this isn't only a problem for page deletion. Page
splitting also barfs if it can't find the parent of a page. As the code
stands, a missing downlink is not harmless, but causes all sorts of trouble.

The window for this to happen with a checkpoint is extremely tight, but
there's another situation where you can end up with a missing downlink: if
you run out of disk space while splitting a parent page, to insert a
downlink to it.

I think we should do a similar fix to b-tree that I did to GiST, and put a
flag on pages with missing downlinks. Then we can fix the missing downlinks
in vacuum and insertion, and get rid of the code to fix incomplete splits
after WAL replay.

The way it would work is that on page split the right page is flagged with
MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the
flag is cleared in the same critical section as the WAL record for the
insertion of the parent is written. Normally, a backend would never see the
flag set, because the locks on the split pages are not released until the
parent record is written and the flag cleared again. But if inserting the
downlink fails for any reason, the next inserter or vacuum that steps on the
page can finish the split by inserting the downlink.

Unfortunately that means holding the locks on the split pages longer than we
do at the moment. Currently they are released as soon as the parent page is
locked; with this change they would need to be held until the WAL record of
the downlink insertion is done. B-tree is so heavily used that I'm a bit
hesitant to sacrifice any concurrency there, but I don't think it would be
noticeable in practice.

Do you really need to hold the page locks for all that time, or could
you cheat? Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#27

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Robert Haas (#26)

Re: B-tree parent pointer and checkpoints

On 06.09.2011 16:40, Robert Haas wrote:

On Tue, Sep 6, 2011 at 6:21 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The way it would work is that on page split the right page is flagged with
MISSING_DOWNLINK flag. When the downlink is inserted into the parent, the
flag is cleared in the same critical section as the WAL record for the
insertion of the parent is written. Normally, a backend would never see the
flag set, because the locks on the split pages are not released until the
parent record is written and the flag cleared again. But if inserting the
downlink fails for any reason, the next inserter or vacuum that steps on the
page can finish the split by inserting the downlink.

Unfortunately that means holding the locks on the split pages longer than we
do at the moment. Currently they are released as soon as the parent page is
locked; with this change they would need to be held until the WAL record of
the downlink insertion is done. B-tree is so heavily used that I'm a bit
hesitant to sacrifice any concurrency there, but I don't think it would be
noticeable in practice.

Do you really need to hold the page locks for all that time, or could
you cheat? Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can
step onto the page and see that the MISSING_DOWNLINK flag is set, and
try to finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone
can start and finish a checkpoint after you've inserted the downlink,
and before you've cleared the flag. You end up in a scenario where the
flag is set, but the page in fact *does* have a downlink in the parent.

So, nope, we can't cheat.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#28

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#27)

Re: B-tree parent pointer and checkpoints

On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Do you really need to hold the page locks for all that time, or could
you cheat? Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can step
onto the page and see that the MISSING_DOWNLINK flag is set, and try to
finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone can
start and finish a checkpoint after you've inserted the downlink, and before
you've cleared the flag. You end up in a scenario where the flag is set, but
the page in fact *does* have a downlink in the parent.

It seems like both of these could be handled by making the code that
repairs the damage insert the downlink into the parent only if it's
not already present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

bruce@momjian.us

over 14 years ago

In reply to: Robert Haas (#28)

Re: B-tree parent pointer and checkpoints

Robert Haas wrote:

On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Do you really need to hold the page locks for all that time, or could
you cheat? ?Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can step
onto the page and see that the MISSING_DOWNLINK flag is set, and try to
finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone can
start and finish a checkpoint after you've inserted the downlink, and before
you've cleared the flag. You end up in a scenario where the flag is set, but
the page in fact *does* have a downlink in the parent.

It seems like both of these could be handled by making the code that
repairs the damage insert the downlink into the parent only if it's
not already present.

I am sorry to be dumping all these new open issues so late in the 9.1
cycle --- I only now got time to go back over my emails. Many are from
March and later.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#30

robertmhaas@gmail.com

over 14 years ago

In reply to: Bruce Momjian (#29)

Re: B-tree parent pointer and checkpoints

On Tue, Sep 6, 2011 at 10:03 AM, Bruce Momjian <bruce@momjian.us> wrote:

Robert Haas wrote:

On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Do you really need to hold the page locks for all that time, or could
you cheat? ?Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can step
onto the page and see that the MISSING_DOWNLINK flag is set, and try to
finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone can
start and finish a checkpoint after you've inserted the downlink, and before
you've cleared the flag. You end up in a scenario where the flag is set, but
the page in fact *does* have a downlink in the parent.

It seems like both of these could be handled by making the code that
repairs the damage insert the downlink into the parent only if it's
not already present.

I am sorry to be dumping all these new open issues so late in the 9.1
cycle --- I only now got time to go back over my emails. Many are from
March and later.

Well, I don't think we're likely to do anything about this for 9.1.
For 9.2, possibly, we can improve it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

http://archives.postgresql.org/pgsql-hackers/2010-11/msg00052.php
http://archives.postgresql.org/pgsql-hackers/2011-09/msg00184.php

bruce@momjian.us

over 14 years ago

In reply to: Heikki Linnakangas (#23)

Re: B-tree parent pointer and checkpoints

Heikki Linnakangas wrote:

On 11.03.2011 19:41, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

On 11.03.2011 17:59, Tom Lane wrote:

But that will be fixed during WAL replay.

Not under the circumstances that started the original thread:

1. Backend splits a page
2. Checkpoint starts
3. Checkpoint runs to completion
4. Crash
(5. Backend never got to insert the parent pointer)

WAL replay starts at the checkpoint redo pointer, which is after the
page split record, so WAL replay won't insert the parent pointer. That's
an incredibly tight window to hit in practice, but it's possible in theory.

Hmm. It's not so improbable that checkpoint would start inside that
window, but that the parent insertion is still pending by the time the
checkpoint finishes is pretty improbable.

How about just reducing the deletion-time ERROR for missing downlink to a LOG?

Well, the code that follows expects to have a valid parent page locked,
so you can't literally do just that. But yeah, LOG and aborting the page
deletion seems fine to me.

Added to TODO:

Fix problem with btree page splits during checkpoints

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#32

bruce@momjian.us

over 13 years ago

In reply to: Robert Haas (#28)

Re: B-tree parent pointer and checkpoints

Has this been addressed? A TODO?

---------------------------------------------------------------------------

On Tue, Sep 6, 2011 at 09:49:39AM -0400, Robert Haas wrote:

On Tue, Sep 6, 2011 at 9:45 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Do you really need to hold the page locks for all that time, or could
you cheat? ï¿½Like... release the locks on the split pages but then go
back and reacquire them to clear the flag...

Hmm, there's two issues with that:

1. While you're not holding the locks on the child pages, someone can step
onto the page and see that the MISSING_DOWNLINK flag is set, and try to
finish the split for you.

2. If you don't hold the page locked while you clear the flag, someone can
start and finish a checkpoint after you've inserted the downlink, and before
you've cleared the flag. You end up in a scenario where the flag is set, but
the page in fact *does* have a downlink in the parent.

It seems like both of these could be handled by making the code that
repairs the damage insert the downlink into the parent only if it's
not already present.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#33

robertmhaas@gmail.com

over 13 years ago

In reply to: Bruce Momjian (#32)

Re: B-tree parent pointer and checkpoints

On Wed, Aug 15, 2012 at 6:23 PM, Bruce Momjian <bruce@momjian.us> wrote:

Has this been addressed? A TODO?

I don't think anything's been done about it. According to your email
of October 11, 2011, you already did add a TODO for this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34