New WAL record to detect the checkpoint redo location

Started by Dilip Kumarover 2 years ago39 messages
#1Dilip Kumar
dilipbalaut@gmail.com
1 attachment(s)

As discussed [1 ][2]/messages/by-id/20230614194717.jyuw3okxup4cvtbt@awork3.anarazel.de currently, the checkpoint-redo LSN can not be
accurately detected while processing the WAL. Although we have a
checkpoint WAL record containing the exact redo LSN, other WAL records
may be inserted between the checkpoint-redo LSN and the actual
checkpoint record. If we want to stop processing wal exactly at the
checkpoint-redo location then we cannot do that because we would have
already processed some extra records that got added after the redo
LSN.

The patch inserts a special wal record named CHECKPOINT_REDO WAL,
which is located exactly at the checkpoint-redo location. We can
guarantee this record to be exactly at the checkpoint-redo point
because we already hold the exclusive WAL insertion lock while
identifying the checkpoint redo point and can insert this special
record exactly at the same time so that there are no concurrent WAL
insertions.

[1]: /messages/by-id/CA+TgmoYOYZfMCyOXFyC-P+-mdrZqm5pP2N7S-r0z3_402h9rsA@mail.gmail.com
[2]: /messages/by-id/20230614194717.jyuw3okxup4cvtbt@awork3.anarazel.de

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v1-0001-New-WAL-record-to-identify-check-redo-location.patchapplication/octet-stream; name=v1-0001-New-WAL-record-to-identify-check-redo-location.patchDownload
From d84cab5275d36a4eb857067f05c63391fdcc661d Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 15 Jun 2023 12:23:06 +0530
Subject: [PATCH v1] New WAL record to identify check redo location

Currently, the checkpoint-redo LSN cannot be accurately detected
while processing the WAL.  Although we have a checkpoint WAL record
containing the exact redo LSN, other WAL records may be inserted
between the checkpoint-redo LSN and the actual checkpoint record.
If we want to stop processing wal exactly at the checkpoint-redo
location then we cannot do that because we would have already
processed some extra records that got added after the redo LSN.

The patch inserts a special wal record named CHECKPOINT_REDO
WAL, which is located exactly at the checkpoint-redo location.
We can guarantee this record to be exactly at the checkpoint-redo
point because we already hold the exclusive WAL insertion lock
while identifying the checkpoint redo point and can insert this
special record exactly at the same time so that there are no
concurrent WAL insertions.
---
 src/backend/access/rmgrdesc/xlogdesc.c   |  7 +++
 src/backend/access/transam/xlog.c        | 60 +++++++++++++++++++++---
 src/backend/replication/logical/decode.c |  1 +
 src/include/catalog/pg_control.h         |  1 +
 4 files changed, 62 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..7868ec7633 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if(info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2430f617c..a025fb91e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
+	bool		callerHoldingExlock = holdingAllLocks;
 	TimeLineID	insertTLI;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata,
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	if (isLogSwitch)
-		WALInsertLockAcquireExclusive();
-	else
-		WALInsertLockAcquire();
+
+	/*
+	 * Acquire wal insertion lock, nothing to do if the caller is already
+	 * holding the exclusive lock.
+	 */
+	if (!callerHoldingExlock)
+	{
+		if (isLogSwitch)
+			WALInsertLockAcquireExclusive();
+		else
+			WALInsertLockAcquire();
+	}
 
 	/*
 	 * Check to see if my copy of RedoRecPtr is out of date. If so, may have
@@ -828,7 +837,10 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * Oops, some buffer now needs to be backed up that the caller didn't
 		 * back up.  Start over.
 		 */
-		WALInsertLockRelease();
+
+		/* release the wal insertion lock if we have acquired it here */
+		if (!callerHoldingExlock)
+			WALInsertLockRelease();
 		END_CRIT_SECTION();
 		return InvalidXLogRecPtr;
 	}
@@ -886,9 +898,12 @@ XLogInsertRecord(XLogRecData *rdata,
 	}
 
 	/*
-	 * Done! Let others know that we're finished.
+	 * Done! Let others know that we're finished.  But if we haven't acquire
+	 * the lock in this function then don't release it now, the caller will
+	 * take care of that.
 	 */
-	WALInsertLockRelease();
+	if (!callerHoldingExlock)
+		WALInsertLockRelease();
 
 	END_CRIT_SECTION();
 
@@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)
 	 */
 	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
 
+	/*
+	 * Insert a special purpose CHECKPOINT_REDO record as the first record at
+	 * checkpoint redo lsn.  Although we have the checkpoint record that
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 *
+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.
+	 *
+	 * This record is guaranteed to be the first record at checkpoint redo lsn
+	 * because we are inserting this while holding the exclusive wal insertion
+	 * lock.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+	}
+
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
 	 * proceed while we are flushing disk buffers.
@@ -8059,6 +8100,11 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here */
+	}
+
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index d91055a440..a126dc3c18 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.39.0

#2Andres Freund
andres@anarazel.de
In reply to: Dilip Kumar (#1)
Re: New WAL record to detect the checkpoint redo location

Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

On 2023-06-15 13:11:57 +0530, Dilip Kumar wrote:

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b2430f617c..a025fb91e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -744,6 +744,7 @@ XLogInsertRecord(XLogRecData *rdata,
XLogRecPtr	StartPos;
XLogRecPtr	EndPos;
bool		prevDoPageWrites = doPageWrites;
+	bool		callerHoldingExlock = holdingAllLocks;
TimeLineID	insertTLI;
/* we assume that all of the record header is in the first chunk */
@@ -792,10 +793,18 @@ XLogInsertRecord(XLogRecData *rdata,
*----------
*/
START_CRIT_SECTION();
-	if (isLogSwitch)
-		WALInsertLockAcquireExclusive();
-	else
-		WALInsertLockAcquire();
+
+	/*
+	 * Acquire wal insertion lock, nothing to do if the caller is already
+	 * holding the exclusive lock.
+	 */
+	if (!callerHoldingExlock)
+	{
+		if (isLogSwitch)
+			WALInsertLockAcquireExclusive();
+		else
+			WALInsertLockAcquire();
+	}

/*
* Check to see if my copy of RedoRecPtr is out of date. If so, may have

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.

@@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

+	/*
+	 * Insert a special purpose CHECKPOINT_REDO record as the first record at
+	 * checkpoint redo lsn.  Although we have the checkpoint record that
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 *
+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.
+	 *
+	 * This record is guaranteed to be the first record at checkpoint redo lsn
+	 * because we are inserting this while holding the exclusive wal insertion
+	 * lock.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+	}

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

Greetings,

Andres Freund

#3Dilip Kumar
dilipbalaut@gmail.com
In reply to: Andres Freund (#2)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Fri, Jul 14, 2023 at 8:46 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

As I think I mentioned before, I like this idea. However, I don't like the
implementation too much.

Thanks for looking into it.

This might work right now, but doesn't really seem maintainable. Nor do I like
adding branches into this code a whole lot.

Okay, Now I have moved the XlogInsert for the special record outside
the WalInsertLock so we don't need this special handling here.

@@ -6597,6 +6612,32 @@ CreateCheckPoint(int flags)

I think the commentary above this function would need a fair bit of
revising...

*/
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

+     /*
+      * Insert a special purpose CHECKPOINT_REDO record as the first record at
+      * checkpoint redo lsn.  Although we have the checkpoint record that
+      * contains the exact redo lsn, there might have been some other records
+      * those got inserted between the redo lsn and the actual checkpoint
+      * record.  So when processing the wal, we cannot rely on the checkpoint
+      * record if we want to stop at the checkpoint-redo LSN.
+      *
+      * This special record, however, is not required when we doing a shutdown
+      * checkpoint, as there will be no concurrent wal insertions during that
+      * time.  So, the shutdown checkpoint LSN will be the same as
+      * checkpoint-redo LSN.
+      *
+      * This record is guaranteed to be the first record at checkpoint redo lsn
+      * because we are inserting this while holding the exclusive wal insertion
+      * lock.
+      */
+     if (!shutdown)
+     {
+             int                     dummy = 0;
+
+             XLogBeginInsert();
+             XLogRegisterData((char *) &dummy, sizeof(dummy));
+             recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+     }

It seems to me that we should be able to do better than this.

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v2-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v2-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload
From 9424765162c6f698658fac7c58cc0be960a7a37c Mon Sep 17 00:00:00 2001
From: Dilip kumar <dilipkumar@dkmac.local>
Date: Tue, 15 Aug 2023 14:10:45 +0530
Subject: [PATCH v2] New WAL record for checkpoint redo location

Currently, the checkpoint-redo LSN cannot be accurately detected
while processing the WAL.  Although we have a checkpoint WAL record
containing the exact redo LSN, other WAL records may be inserted
between the checkpoint-redo LSN and the actual checkpoint record.
If we want to stop processing wal exactly at the checkpoint-redo
location then we cannot do that because we would have already
processed some extra records that got added after the redo LSN.

The patch inserts a special wal record named CHECKPOINT_REDO
WAL, and the checkpoint-redo location is set at LSN of this record.

Also after this patch now we do not need to compute the checkpoint
redo lsn based on the current WAL insert location as we are setting
this to the CHECKPOINT_REDO record.
---
 src/backend/access/rmgrdesc/xlogdesc.c   |  7 +++
 src/backend/access/transam/xlog.c        | 72 ++++++++++--------------
 src/backend/replication/logical/decode.c |  1 +
 src/include/catalog/pg_control.h         |  1 +
 4 files changed, 38 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..7868ec7633 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if(info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..d25b7ca812 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6454,11 +6454,9 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * position of the WAL record (redo ptr) is the same or earlier than the
  * physical position. When we replay WAL we locate the checkpoint via its
  * physical position then read the redo ptr and actually start replay at the
- * earlier logical position. Note that we don't write *anything* to WAL at
- * the logical position, so that location could be any other kind of WAL record.
- * All of this mechanism allows us to continue working while we checkpoint.
- * As a result, timing of actions is critical here and be careful to note that
- * this function will likely take minutes to execute on a busy system.
+ * earlier logical position. Note that we do write a dummy wal record
+ * called XLOG_CHECKPOINT_REDO record and the checkpoint.redo LSN is always
+ * set to the LSN of this record.
  */
 void
 CreateCheckPoint(int flags)
@@ -6468,11 +6466,9 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
-	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
+	int			dummy = 0;
 	int			nvxids;
 	int			oldXLogAllowed = 0;
 
@@ -6534,19 +6530,6 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
-	/*
-	 * Get location of last important record before acquiring insert locks (as
-	 * GetLastImportantRecPtr() also locks WAL locks).
-	 */
-	last_important_lsn = GetLastImportantRecPtr();
-
-	/*
-	 * We must block concurrent insertions while examining insert state to
-	 * determine the checkpoint REDO pointer.
-	 */
-	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
@@ -6555,9 +6538,8 @@ CreateCheckPoint(int flags)
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
 				  CHECKPOINT_FORCE)) == 0)
 	{
-		if (last_important_lsn == ControlFile->checkPoint)
+		if (GetLastImportantRecPtr() == ControlFile->checkPoint)
 		{
-			WALInsertLockRelease();
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg_internal("checkpoint skipped because system is idle")));
@@ -6573,6 +6555,24 @@ CreateCheckPoint(int flags)
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		oldXLogAllowed = LocalSetXLogInsertAllowed();
 
+	/*
+	 * Insert a dummy CHECKPOINT_REDO record and set LSN of this record as
+	 * checkpoint.redo.  Although we have the checkpoint record that also
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 */
+	XLogBeginInsert();
+	XLogRegisterData((char *) &dummy, sizeof(dummy));
+	recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+	/*
+	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+	 * must be done while holding all the insertion locks.
+	 */
+	WALInsertLockAcquireExclusive();
+
 	checkPoint.ThisTimeLineID = XLogCtl->InsertTimeLineID;
 	if (flags & CHECKPOINT_END_OF_RECOVERY)
 		checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
@@ -6581,28 +6581,9 @@ CreateCheckPoint(int flags)
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
-	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
-	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
-	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+	checkPoint.redo = recptr;
 
 	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
 	 * consequence is that XLogInsert might back up whole buffers that it
@@ -8074,6 +8055,11 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here */
+	}
+
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7039d425e2..89978c2fd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.39.2 (Apple Git-143)

#4Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#3)
Re: New WAL record to detect the checkpoint redo location

On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

Looking at this patch, I am bit surprised to see that the redo point
maps with the end location of the CHECKPOINT_REDO record as it is the
LSN returned by XLogInsert(), not its start LSN. For example after a
checkpoint:
=# CREATE EXTENSION pg_walinspect;
CREATE EXTENSION;
=# SELECT redo_lsn, checkpoint_lsn from pg_control_checkpoint();
redo_lsn | checkpoint_lsn
-----------+----------------
0/19129D0 | 0/1912A08
(1 row)
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
from pg_get_wal_record_info('0/19129D0');
start_lsn | prev_lsn | end_lsn | record_type
-----------+-----------+-----------+---------------
0/19129D0 | 0/19129B0 | 0/1912A08 | RUNNING_XACTS
(1 row)

In this case it matches with the previous record:
=# SELECT start_lsn, prev_lsn, end_lsn, record_type
from pg_get_wal_record_info('0/19129B0');
start_lsn | prev_lsn | end_lsn | record_type
-----------+-----------+-----------+-----------------
0/19129B0 | 0/1912968 | 0/19129D0 | CHECKPOINT_REDO
(1 row)

This could be used to cross-check that the first record replayed is of
the correct type. The commit message of this patch tells that "the
checkpoint-redo location is set at LSN of this record", which implies
the start LSN of the record tracked as the redo LSN, not the end of
it? What's the intention here?
--
Michael

#5Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#4)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 17, 2023 at 10:52 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 15, 2023 at 02:23:43PM +0530, Dilip Kumar wrote:

Yeah, good idea, actually we can do this insert outside of the
exclusive insert lock and set the LSN of this insert as the
checkpoint. redo location. So now we do not need to compute the
checkpoint. redo based on the current insertion point we can directly
use the LSN of this record. I have modified this and I have also
moved some other code that is not required to be inside the WAL
insertion lock.

Looking at this patch, I am bit surprised to see that the redo point
maps with the end location of the CHECKPOINT_REDO record as it is the
LSN returned by XLogInsert(), not its start LSN.

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#6Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#5)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?
--
Michael

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#6)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?

Yeah right, we can use ProcLastRecPtr. I will send the updated patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Dilip Kumar
dilipbalaut@gmail.com
In reply to: Dilip Kumar (#7)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 18, 2023 at 10:12 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Fri, Aug 18, 2023 at 5:24 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Aug 17, 2023 at 01:11:50PM +0530, Dilip Kumar wrote:

Yeah right, actually I was confused, I assumed it will return the
start LSN of the record. And I do not see any easy way to identify
the Start LSN of this record so maybe this solution will not work. I
will have to think of something else. Thanks for pointing it out.

About that. One thing to consider may be ReserveXLogInsertLocation()
while holding the WAL insert lock, but you can just rely on
ProcLastRecPtr for the job after inserting the REDO record, no?

Yeah right, we can use ProcLastRecPtr. I will send the updated patch.

Here is the updated version of the patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v3-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v3-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload
From 2fa9f78c95bbe043ba30642d3db6afb959dca418 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 25 Aug 2023 05:16:54 +0000
Subject: [PATCH v3] New WAL record for checkpoint redo location

Currently, the checkpoint-redo LSN cannot be accurately detected
while processing the WAL.  Although we have a checkpoint WAL record
containing the exact redo LSN, other WAL records may be inserted
between the checkpoint-redo LSN and the actual checkpoint record.
If we want to stop processing wal exactly at the checkpoint-redo
location then we cannot do that because we would have already
processed some extra records that got added after the redo LSN.

The patch inserts a special WAL record named CHECKPOINT_REDO
WAL, and the checkpoint-redo location is set at the start LSN of
this record.

However for shutdown checkpoint we don't need this special record
because during shutdown there could not be any concurrent WAL
insertion so the shutdown checkpoint record should be at the
checkpoint-redo location.
---
 src/backend/access/rmgrdesc/xlogdesc.c   |  7 +++
 src/backend/access/transam/xlog.c        | 63 ++++++++++++++++++------
 src/backend/replication/logical/decode.c |  1 +
 src/include/catalog/pg_control.h         |  1 +
 4 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..7868ec7633 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if(info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..856dc55baf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6534,6 +6534,31 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
+	/*
+	 * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+	 * as checkpoint.redo.  Although we have the checkpoint record that also
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 *
+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.
+	 */
+	if (!shutdown)
+	{
+		int		dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/* store CHECKPOINT_REDO record's start location as checkpoint.redo */
+		checkPoint.redo = ProcLastRecPtr;
+	}
+
 	/*
 	 * Get location of last important record before acquiring insert locks (as
 	 * GetLastImportantRecPtr() also locks WAL locks).
@@ -6545,7 +6570,14 @@ CreateCheckPoint(int flags)
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+
+	/*
+	 * For regular checkpoint we have already computed the checkpoint.redo
+	 * location so for shutdown checkpoint set it to the next insert position
+	 * in the wal.
+	 */
+	if (shutdown)
+		curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
@@ -6582,22 +6614,21 @@ CreateCheckPoint(int flags)
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * In case of shutdown checkpoint, compute new REDO record
+	 * ptr = location of next XLOG record.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
 	}
-	checkPoint.redo = curInsert;
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -8074,6 +8105,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here */
+	}
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7039d425e2..89978c2fd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.34.1

#9Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#8)
Re: New WAL record to detect the checkpoint redo location

On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:

Here is the updated version of the patch.

The concept of the patch looks sound to me. I have a few comments.

+	 * This special record, however, is not required when we doing a shutdown
+	 * checkpoint, as there will be no concurrent wal insertions during that
+	 * time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.

This is missing "are", as in "when we are doing a shutdown
checkpoint".

- freespace = INSERT_FREESPACE(curInsert);
- if (freespace == 0)

The variable "freespace" can be moved within the if block introduced
by this patch when calculating the REDO location for the shutdown
case. And you can do the same with curInsert?

-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * In case of shutdown checkpoint, compute new REDO record
+	 * ptr = location of next XLOG record.

It seems to me that keeping around this comment is important,
particularly for the case where we have a shutdown checkpoint and we
expect nothing to run, no?

How about adding a test in pg_walinspect?  There is an online
checkpoint running there, so you could just add something like that
to check that the REDO record is at the expected LSN stored in the
control file:
+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
--
Michael
#10Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#9)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Mon, Aug 28, 2023 at 5:14 AM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Aug 25, 2023 at 11:08:25AM +0530, Dilip Kumar wrote:

Here is the updated version of the patch.

The concept of the patch looks sound to me. I have a few comments.

Thanks for the review

+        * This special record, however, is not required when we doing a shutdown
+        * checkpoint, as there will be no concurrent wal insertions during that
+        * time.  So, the shutdown checkpoint LSN will be the same as
+        * checkpoint-redo LSN.

This is missing "are", as in "when we are doing a shutdown
checkpoint".

Fixed

- freespace = INSERT_FREESPACE(curInsert);
- if (freespace == 0)

The variable "freespace" can be moved within the if block introduced
by this patch when calculating the REDO location for the shutdown
case. And you can do the same with curInsert?

Done, I have also moved code related to computing curInsert in the
same if (shutdown) block.

-        * Compute new REDO record ptr = location of next XLOG record.
-        *
-        * NB: this is NOT necessarily where the checkpoint record itself will be,
-        * since other backends may insert more XLOG records while we're off doing
-        * the buffer flush work.  Those XLOG records are logically after the
-        * checkpoint, even though physically before it.  Got that?
+        * In case of shutdown checkpoint, compute new REDO record
+        * ptr = location of next XLOG record.

It seems to me that keeping around this comment is important,
particularly for the case where we have a shutdown checkpoint and we
expect nothing to run, no?

I removed this mainly because now in other comments[1]+ /* + * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record + * as checkpoint.redo. Although we have the checkpoint record that also + * contains the exact redo lsn, there might have been some other records + * those got inserted between the redo lsn and the actual checkpoint + * record. So when processing the wal, we cannot rely on the checkpoint + * record if we want to stop at the checkpoint-redo LSN. + * + * This special record, however, is not required when we are doing a + * shutdown checkpoint, as there will be no concurrent wal insertions + * during that time. So, the shutdown checkpoint LSN will be the same as + * checkpoint-redo LSN. + */ where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required. And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

[1]
+ /*
+ * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+ * as checkpoint.redo.  Although we have the checkpoint record that also
+ * contains the exact redo lsn, there might have been some other records
+ * those got inserted between the redo lsn and the actual checkpoint
+ * record.  So when processing the wal, we cannot rely on the checkpoint
+ * record if we want to stop at the checkpoint-redo LSN.
+ *
+ * This special record, however, is not required when we are doing a
+ * shutdown checkpoint, as there will be no concurrent wal insertions
+ * during that time.  So, the shutdown checkpoint LSN will be the same as
+ * checkpoint-redo LSN.
+ */
How about adding a test in pg_walinspect?  There is an online
checkpoint running there, so you could just add something like that
to check that the REDO record is at the expected LSN stored in the
control file:
+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
--

Done, thanks.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v4-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v4-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload
From c58cb77ec8ca903b2a5502be6646b6137cba3281 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 25 Aug 2023 05:16:54 +0000
Subject: [PATCH v4] New WAL record for checkpoint redo location

Currently, the checkpoint-redo LSN cannot be accurately detected
while processing the WAL.  Although we have a checkpoint WAL record
containing the exact redo LSN, other WAL records may be inserted
between the checkpoint-redo LSN and the actual checkpoint record.
If we want to stop processing wal exactly at the checkpoint-redo
location then we cannot do that because we would have already
processed some extra records that got added after the redo LSN.

The patch inserts a special WAL record named CHECKPOINT_REDO
WAL, and the checkpoint-redo location is set at the start LSN of
this record.

However for shutdown checkpoint we don't need this special record
because during shutdown there could not be any concurrent WAL
insertion so the shutdown checkpoint record should always be at the
checkpoint-redo location.
---
 .../pg_walinspect/expected/pg_walinspect.out  | 11 ++++
 contrib/pg_walinspect/sql/pg_walinspect.sql   |  7 +++
 src/backend/access/rmgrdesc/xlogdesc.c        |  7 +++
 src/backend/access/transam/xlog.c             | 60 ++++++++++++++-----
 src/backend/replication/logical/decode.c      |  1 +
 src/include/catalog/pg_control.h              |  1 +
 6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/contrib/pg_walinspect/expected/pg_walinspect.out b/contrib/pg_walinspect/expected/pg_walinspect.out
index a8f4c91060..f49c6b4844 100644
--- a/contrib/pg_walinspect/expected/pg_walinspect.out
+++ b/contrib/pg_walinspect/expected/pg_walinspect.out
@@ -140,6 +140,17 @@ SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn5', :'wal_lsn6')
  t
 (1 row)
 
+-- ===================================================================
+-- Test presence of new REDO record at the checkpoint redo location
+-- ===================================================================
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
+ same_lsn |   record_type   
+----------+-----------------
+ t        | CHECKPOINT_REDO
+(1 row)
+
 -- ===================================================================
 -- Tests for permissions
 -- ===================================================================
diff --git a/contrib/pg_walinspect/sql/pg_walinspect.sql b/contrib/pg_walinspect/sql/pg_walinspect.sql
index f987ca31c4..443702d26e 100644
--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -89,6 +89,13 @@ SELECT pg_current_wal_lsn() AS wal_lsn6 \gset
 SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn5', :'wal_lsn6')
   WHERE relfilenode = :'sample_tbl_oid' AND block_fpi_data IS NOT NULL;
 
+-- ===================================================================
+-- Test presence of new REDO record at the checkpoint redo location
+-- ===================================================================
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
+
 -- ===================================================================
 -- Tests for permissions
 -- ===================================================================
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..7868ec7633 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if(info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 60c0b7ec3a..7168e505fe 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6468,9 +6468,7 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -6534,6 +6532,31 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
+	/*
+	 * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+	 * as checkpoint.redo.  Although we have the checkpoint record that also
+	 * contains the exact redo lsn, there might have been some other records
+	 * those got inserted between the redo lsn and the actual checkpoint
+	 * record.  So when processing the wal, we cannot rely on the checkpoint
+	 * record if we want to stop at the checkpoint-redo LSN.
+	 *
+	 * This special record, however, is not required when we are doing a
+	 * shutdown checkpoint, as there will be no concurrent wal insertions
+	 * during that time.  So, the shutdown checkpoint LSN will be the same as
+	 * checkpoint-redo LSN.
+	 */
+	if (!shutdown)
+	{
+		int		dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/* store CHECKPOINT_REDO record's start location as checkpoint.redo */
+		checkPoint.redo = ProcLastRecPtr;
+	}
+
 	/*
 	 * Get location of last important record before acquiring insert locks (as
 	 * GetLastImportantRecPtr() also locks WAL locks).
@@ -6545,7 +6568,6 @@ CreateCheckPoint(int flags)
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
@@ -6582,22 +6604,24 @@ CreateCheckPoint(int flags)
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * In case of shutdown checkpoint, compute new REDO record
+	 * ptr = location of next XLOG record.  For regular checkpoint we have
+	 * already set it to the CHECKPOINT_REDO so nothing to be done for that.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		uint32		freespace = INSERT_FREESPACE(curInsert);
+
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
 	}
-	checkPoint.redo = curInsert;
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -8074,6 +8098,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here */
+	}
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7039d425e2..89978c2fd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.34.1

#11Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#10)
Re: New WAL record to detect the checkpoint redo location

On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:

I removed this mainly because now in other comments[1] where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required.
And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

+    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+    * as checkpoint.redo.

I would add a "for a non-shutdown checkpoint" at the end of this
sentence.

+    * record.  So when processing the wal, we cannot rely on the checkpoint
+    * record if we want to stop at the checkpoint-redo LSN.

The term "checkpoint-redo" is also a bit confusing, I guess, because
you just mean to refer to the "redo" LSN here? Maybe rework the last
sentence as:
"So, when processing WAL, we cannot rely on the checkpoint record if
we want to stop at the same position as the redo LSN".

+    * This special record, however, is not required when we are doing a
+    * shutdown checkpoint, as there will be no concurrent wal insertions
+    * during that time.  So, the shutdown checkpoint LSN will be the same as
+    * checkpoint-redo LSN.

Perhaps the last sentence could be merged with the first one, if we
are tweaking things, say:
"This special record is not required when doing a shutdown checkpoint;
the redo LSN is the same LSN as the checkpoint record as there cannot
be any WAL activity in a shutdown sequence."
--
Michael

#12Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#11)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Wed, Aug 30, 2023 at 1:03 PM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Aug 28, 2023 at 01:47:18PM +0530, Dilip Kumar wrote:

I removed this mainly because now in other comments[1] where we are
introducing this new CHECKPOINT_REDO record we are explaining the
problem
that the redo location and the actual checkpoint records are not at
the same place and that is because of the concurrent xlog insertion.
I think we are explaining in more
detail by also stating that in case of a shutdown checkpoint, there
would not be any concurrent insertion so the shutdown checkpoint redo
will be at the same place. So I feel keeping old comments is not
required.
And we are explaining it when we are setting this for
non-shutdown checkpoint because for shutdown checkpoint this statement
is anyway not correct because for the shutdown checkpoint the
checkpoint record will be at the same location and there will be no
concurrent wal insertion, what do you think?

+    * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+    * as checkpoint.redo.

I would add a "for a non-shutdown checkpoint" at the end of this
sentence.

+    * record.  So when processing the wal, we cannot rely on the checkpoint
+    * record if we want to stop at the checkpoint-redo LSN.

The term "checkpoint-redo" is also a bit confusing, I guess, because
you just mean to refer to the "redo" LSN here? Maybe rework the last
sentence as:
"So, when processing WAL, we cannot rely on the checkpoint record if
we want to stop at the same position as the redo LSN".

+    * This special record, however, is not required when we are doing a
+    * shutdown checkpoint, as there will be no concurrent wal insertions
+    * during that time.  So, the shutdown checkpoint LSN will be the same as
+    * checkpoint-redo LSN.

Perhaps the last sentence could be merged with the first one, if we
are tweaking things, say:
"This special record is not required when doing a shutdown checkpoint;
the redo LSN is the same LSN as the checkpoint record as there cannot
be any WAL activity in a shutdown sequence."

Your suggestions LGTM so modified accordingly

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

v5-0001-New-WAL-record-for-checkpoint-redo-location.patchapplication/octet-stream; name=v5-0001-New-WAL-record-for-checkpoint-redo-location.patchDownload
From b8a946291b130e3d1fe5f92c4d79a20443ca82c2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Fri, 25 Aug 2023 05:16:54 +0000
Subject: [PATCH v5] New WAL record for checkpoint redo location

Currently, for non shutdown checkpoints the checkpoint.redo
LSN cannot be accurately detected while processing the WAL.
Although we have a checkpoint WAL record containing the exact
redo LSN, other WAL records may be inserted between the redo
LSN and the actual checkpoint record.

So while processing the WAL, If we want to stop exactly at the
that LSN then we can't do that because we would have already
processed some extra records that got added after the redo LSN.

The patch inserts a special WAL record named CHECKPOINT_REDO
WAL, and the redo location is set at the start LSN of this record.

However for shutdown checkpoint we don't need this special record
because during shutdown there could not be any concurrent WAL
insertion so the shutdown checkpoint record should always be at the
redo location.
---
 .../pg_walinspect/expected/pg_walinspect.out  | 11 ++++
 contrib/pg_walinspect/sql/pg_walinspect.sql   |  7 +++
 src/backend/access/rmgrdesc/xlogdesc.c        |  7 +++
 src/backend/access/transam/xlog.c             | 60 ++++++++++++++-----
 src/backend/replication/logical/decode.c      |  1 +
 src/include/catalog/pg_control.h              |  1 +
 6 files changed, 71 insertions(+), 16 deletions(-)

diff --git a/contrib/pg_walinspect/expected/pg_walinspect.out b/contrib/pg_walinspect/expected/pg_walinspect.out
index a8f4c91060..f49c6b4844 100644
--- a/contrib/pg_walinspect/expected/pg_walinspect.out
+++ b/contrib/pg_walinspect/expected/pg_walinspect.out
@@ -140,6 +140,17 @@ SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn5', :'wal_lsn6')
  t
 (1 row)
 
+-- ===================================================================
+-- Test presence of new REDO record at the checkpoint redo location
+-- ===================================================================
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
+ same_lsn |   record_type   
+----------+-----------------
+ t        | CHECKPOINT_REDO
+(1 row)
+
 -- ===================================================================
 -- Tests for permissions
 -- ===================================================================
diff --git a/contrib/pg_walinspect/sql/pg_walinspect.sql b/contrib/pg_walinspect/sql/pg_walinspect.sql
index f987ca31c4..443702d26e 100644
--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -89,6 +89,13 @@ SELECT pg_current_wal_lsn() AS wal_lsn6 \gset
 SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn5', :'wal_lsn6')
   WHERE relfilenode = :'sample_tbl_oid' AND block_fpi_data IS NOT NULL;
 
+-- ===================================================================
+-- Test presence of new REDO record at the checkpoint redo location
+-- ===================================================================
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');
+
 -- ===================================================================
 -- Tests for permissions
 -- ===================================================================
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..7868ec7633 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if(info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..77945ed855 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6481,9 +6481,7 @@ CreateCheckPoint(int flags)
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -6547,6 +6545,31 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.oldestActiveXid = InvalidTransactionId;
 
+	/*
+	 * Insert a dummy CHECKPOINT_REDO record and set start LSN of this record
+	 * as checkpoint.redo for a non-shutdown checkpoint.  Although we have the
+	 * checkpoint record that also contains the exact redo LSN, there might
+	 * have been some other records those got inserted between the redo LSN and
+	 * the actual checkpoint record.  So when processing the WAL, we cannot
+	 * rely on the checkpoint record if we want to stop at the same position as
+	 * the redo LSN.
+	 *
+	 * This special record is not required when doing a shutdown checkpoint;
+	 * the redo LSN is the same LSN as the checkpoint record as there cannot
+	 * be any WAL activity in a shutdown sequence.
+	 */
+	if (!shutdown)
+	{
+		int		dummy = 0;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/* store CHECKPOINT_REDO record's start location as checkpoint.redo */
+		checkPoint.redo = ProcLastRecPtr;
+	}
+
 	/*
 	 * Get location of last important record before acquiring insert locks (as
 	 * GetLastImportantRecPtr() also locks WAL locks).
@@ -6558,7 +6581,6 @@ CreateCheckPoint(int flags)
 	 * determine the checkpoint REDO pointer.
 	 */
 	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
@@ -6595,22 +6617,24 @@ CreateCheckPoint(int flags)
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * In case of shutdown checkpoint, compute new REDO record
+	 * ptr = location of next XLOG record.  For regular checkpoint we have
+	 * already set it to the CHECKPOINT_REDO so nothing to be done for that.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		uint32		freespace = INSERT_FREESPACE(curInsert);
+
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
 	}
-	checkPoint.redo = curInsert;
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
@@ -8087,6 +8111,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here */
+	}
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 7039d425e2..89978c2fd8 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.39.2 (Apple Git-143)

#13Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#12)
Re: New WAL record to detect the checkpoint redo location

On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:

Your suggestions LGTM so modified accordingly

I have been putting my HEAD on this patch for a few hours, reviewing
the surroundings, and somewhat missed that this computation is done
while we do not hold the WAL insert locks:
+ checkPoint.redo = ProcLastRecPtr;

Then a few lines down the shared Insert.RedoRecPtr is updated while
holding an exclusive lock.
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

If we have a bunch of records inserted between the moment when the
REDO record is inserted and the moment when the checkpointer takes the
exclusive WAL lock, aren't we potentially missing a lot of FPW's that
should exist since the redo LSN?
--
Michael

#14Dilip Kumar
dilipbalaut@gmail.com
In reply to: Michael Paquier (#13)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 31, 2023 at 9:36 AM Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Aug 30, 2023 at 04:51:19PM +0530, Dilip Kumar wrote:

Your suggestions LGTM so modified accordingly

I have been putting my HEAD on this patch for a few hours, reviewing
the surroundings, and somewhat missed that this computation is done
while we do not hold the WAL insert locks:
+ checkPoint.redo = ProcLastRecPtr;

Then a few lines down the shared Insert.RedoRecPtr is updated while
holding an exclusive lock.
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

If we have a bunch of records inserted between the moment when the
REDO record is inserted and the moment when the checkpointer takes the
exclusive WAL lock, aren't we potentially missing a lot of FPW's that
should exist since the redo LSN?

Yeah, good catch. With this, it seems like we can not move this new
WAL Insert out of the Exclusive WAL insertion lock right? because if
we want to set the LSN of this record as the checkpoint. redo then
there should not be any concurrent insertion until we expose the
XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all
the record which has been inserted after the checkpoint. redo before
we acquired the exclusive WAL insertion lock.

So maybe I need to restart from the first version of the patch but
instead of moving the insertion of the new record out of the exclusive
lock need to do some better refactoring so that XLogInsertRecord()
doesn't look ugly.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Dilip Kumar (#14)
Re: New WAL record to detect the checkpoint redo location

On Thu, Aug 31, 2023 at 09:55:45AM +0530, Dilip Kumar wrote:

Yeah, good catch. With this, it seems like we can not move this new
WAL Insert out of the Exclusive WAL insertion lock right? Because if
we want to set the LSN of this record as the checkpoint.redo then
there should not be any concurrent insertion until we expose the
XLogCtl->Insert.RedoRecPtr. Otherwise, we will miss the FPW for all
the record which has been inserted after the checkpoint.redo before
we acquired the exclusive WAL insertion lock.

Yes.

So maybe I need to restart from the first version of the patch but
instead of moving the insertion of the new record out of the exclusive
lock need to do some better refactoring so that XLogInsertRecord()
doesn't look ugly.

Yes, I am not sure which interface would be less ugli-ish, but that's
enough material for a refactoring patch of the WAL insert routines on
top of the main patch that introduces the REDO record.
--
Michael

#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#2)
Re: New WAL record to detect the checkpoint redo location

On Fri, Jul 14, 2023 at 11:16 AM Andres Freund <andres@anarazel.de> wrote:

I suspect we might be able to get rid of the need for exclusive inserts
here. If we rid of that, we could determine the redoa location based on the
LSN determined by the XLogInsert().

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

As Michael Paquier correctly noted downthread, we need to make sure
that a backend inserting a WAL record knows whether it needs to
contain an FPI. The comments in the do...while loop in XLogInsert are
pretty helpful here: doPageWrites can't change once XLogInsertRecord
acquires a WAL insertion lock. For that to be true, the redo pointer
can only move when holding all WAL insertion locks. That means that if
we add an XLOG_CHECKPOINT_REDO to mark the location of the redo
pointer, we've got to either (a) insert the record *and* update our
notion of the last redo pointer while holding all the WAL insertion
locks or (b) change the concurrency model in some way.

Let's explore (b) first. Perhaps my imagination is too limited here,
but I'm having trouble seeing a good way of doing this. One idea that
occurred to me was to make either the insertion of the
XLOG_CHECKPOINT_REDO record fail softly if somebody inserts a record
after it that omits FPIs, but that doesn't really work because then
we're left with a hole in the WAL. It's too late to move the later
record earlier. We could convert the intended XLOG_CHECKPOINT_REDO
record into a dummy record but that seems complex and wasteful.
Similarly, you could try to make the insertion of the later record
fail, but that basically has the same problem: there could be an even
later record being inserted after that which it's already too late to
reposition. Basically, it feels like once we get to the point where we
have a range of LSNs and we're copying data into wal_buffers, it's
awfully late to be trying to back out. Other people can already be
depending on us to put the amount of WAL that we promised to insert at
the place where we promised to put it.

The only other approach to (b) that I can think of is to force FPIs on
for all backends from just before to just after we insert the
XLOG_CHECKPOINT_REDO record. However, since we currently require
taking all the WAL insertion locks to start requiring full page
writes, this doesn't seem like it gains much. In theory perhaps we
could have an approach where we flip full page writes to sorta-on,
then wait until we've seen each WAL insertion lock unheld at least
once, and then at that point we know all new WAL insertions will see
them and can deem them fully on. However, when I start to think along
these lines, I feel like maybe I'm losing the plot. Checkpoints are
rare enough that the overhead of taking all the WAL insertion locks at
the same time isn't really a big problem, or at least I don't think it
is. I think the concern here is more about avoiding useless branches
in hot paths that potentially cost something for every single record
whether it has anything to do with this mechanism or not.

OK, so let's suppose we abandon the idea of changing the concurrency
model in any fundamental way and just try to figure out how to both
insert the record and update our notion of the last redo pointer while
holding all the WAL insertion locks i.e. (a) from the two options
above. Dilip's patch approaches this problem by pulling acquisition of
the WAL insertion locks up to the place where we're already setting
the redo pointer. I wonder if we could also consider the other
possible approach of pushing the update to Insert->RedoRecPtr down
into XLogInsertRecord(), which already has a special case for
acquiring all locks when the record being inserted is an XLOG_SWITCH
record. That would have the advantage of holding all of the insertion
locks for a smaller period of time than what Dilip's patch does -- in
particular, it wouldn't need to hold the lock across the
XLOG_CHECKPOINT_REDO's XLogRecordAssemble -- or across the rather
lengthy tail of XLogInsertRecord. But the obvious objection is that it
would put more branches into XLogInsertRecord which nobody wants.

But ... what if it didn't? Suppose we pulled the XLOG_SWITCH case out
of XLogInsertRecord and made a separate function for that case. It
looks to me like that would avoid 5 branches in that function in the
normal case where we're not inserting XLOG_SWITCH. We would want to
move some logic, particularly the WAL_DEBUG stuff and maybe other
things, into reusable subroutines. Then, we could decide to treat
XLOG_CHECKPOINT_REDO either in the normal path -- adding a couple of
those branches back again -- or in the XLOG_SWITCH function and either
way I think the normal path would have fewer branches than it does
today. One idea that I had was to create a new rmgr for "weird
records," initially XLOG_SWITCH and XLOG_CHECKPOINT_REDO. Then the
test as to whether to use the "normal" version of XLogInsertRecord or
the "weird" version could just be based on the rmid, and the "normal"
function wouldn't need to concern itself with anything specific to the
"weird" cases.

A variant on this idea would be to just accept a few extra branches
and hope it's not really that big of a deal. For instance, instead of
this:

bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
info == XLOG_SWITCH);

We could have this:

bool isAllLocks = (rechdr->xl_rmid == RM_BIZARRE_ID);
bool isLogSwitch = (isAllLocks && info == XLOG_SWITCH);

...and then conditionalize on either isAllLocks or isLogSwitch as
apppropriate. You'd still need an extra branch someplace to update
Insert->RedoRecPtr when isAllLocks && info == XLOG_CHECKPOINT_REDO,
but maybe that's not so bad?

Alternatively, I think we should split XLogInsertRecord() so that the part
with the insertion locks held is a separate function, that we could use here.

The difficulty that I see with this is that the function does a lot
more stuff after calling WALInsertLockRelease(). So just pushing the
part that's between acquiring and releasing WAL insertion locks into a
separate function wouldn't actually avoid a lot of code duplication,
if the goal was to do everything else that XLogInsertRecord() does
except for the lock manipulation. To get there, I think we'd need to
move all of the stuff after the lock release into one or more static
functions, too. Which is possibly an OK approach. I haven't checked
how much additional parameter passing we'd end up doing if we went
that way.

--
Robert Haas
EDB: http://www.enterprisedb.com

#17Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#16)
3 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

I've tested this to the extent of running the regression tests, and I
also did one (1) manual test where it looked like the right thing was
happening, but that's it, so this might be buggy or perform like
garbage for all I know. But my hope is that it isn't buggy and
performs adequately. If there's any chance of getting some comments on
the basic design choices before I spend time testing and polishing it,
that would be very helpful.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v6-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchapplication/octet-stream; name=v6-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchDownload
From 474f96734c8c86a3d01979b6a3f8c293eee4e927 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Sep 2023 13:48:19 -0400
Subject: [PATCH v6 1/3] Unify two isLogSwitch tests in XLogInsertRecord.

An upcoming patch wants to introduce an additional special case in this
function. To keep that as cheap as possible, minimize the amount of branching
that we do based on whether this is an XLOG_SWITCH record.
---
 src/backend/access/transam/xlog.c | 97 ++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f6f8adc72a..348cbebe3a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -792,57 +792,72 @@ XLogInsertRecord(XLogRecData *rdata,
 	 *----------
 	 */
 	START_CRIT_SECTION();
+
 	if (isLogSwitch)
+	{
+		/*
+		 * In order to insert an XLOG_SWITCH record, we need to hold all of
+		 * the WAL insertion locks, not just one, so that no one else can
+		 * begin inserting a record until we've figured out how much space
+		 * remains in the current WAL segment and claimed all of it.
+		 *
+		 * Nonetheless, this case is simpler than the normal cases handled
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain
+		 * full-pages images, and an XLOG_SWITCH record never does.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+	}
 	else
-		WALInsertLockAcquire();
-
-	/*
-	 * Check to see if my copy of RedoRecPtr is out of date. If so, may have
-	 * to go back and have the caller recompute everything. This can only
-	 * happen just after a checkpoint, so it's better to be slow in this case
-	 * and fast otherwise.
-	 *
-	 * Also check to see if fullPageWrites was just turned on or there's a
-	 * running backup (which forces full-page writes); if we weren't already
-	 * doing full-page writes then go back and recompute.
-	 *
-	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
-	 * affect the contents of the XLOG record, so we'll update our local copy
-	 * but not force a recomputation.  (If doPageWrites was just turned off,
-	 * we could recompute the record without full pages, but we choose not to
-	 * bother.)
-	 */
-	if (RedoRecPtr != Insert->RedoRecPtr)
 	{
-		Assert(RedoRecPtr < Insert->RedoRecPtr);
-		RedoRecPtr = Insert->RedoRecPtr;
-	}
-	doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
+		WALInsertLockAcquire();
 
-	if (doPageWrites &&
-		(!prevDoPageWrites ||
-		 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
-	{
 		/*
-		 * Oops, some buffer now needs to be backed up that the caller didn't
-		 * back up.  Start over.
+		 * Check to see if my copy of RedoRecPtr is out of date. If so, may
+		 * have to go back and have the caller recompute everything. This can
+		 * only happen just after a checkpoint, so it's better to be slow in
+		 * this case and fast otherwise.
+		 *
+		 * Also check to see if fullPageWrites was just turned on or there's a
+		 * running backup (which forces full-page writes); if we weren't
+		 * already doing full-page writes then go back and recompute.
+		 *
+		 * If we aren't doing full-page writes then RedoRecPtr doesn't
+		 * actually affect the contents of the XLOG record, so we'll update
+		 * our local copy but not force a recomputation.  (If doPageWrites was
+		 * just turned off, we could recompute the record without full pages,
+		 * but we choose not to bother.)
 		 */
-		WALInsertLockRelease();
-		END_CRIT_SECTION();
-		return InvalidXLogRecPtr;
-	}
+		if (RedoRecPtr != Insert->RedoRecPtr)
+		{
+			Assert(RedoRecPtr < Insert->RedoRecPtr);
+			RedoRecPtr = Insert->RedoRecPtr;
+		}
+		doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
 
-	/*
-	 * Reserve space for the record in the WAL. This also sets the xl_prev
-	 * pointer.
-	 */
-	if (isLogSwitch)
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
-	else
-	{
+		if (doPageWrites &&
+			(!prevDoPageWrites ||
+			 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
+		{
+			/*
+			 * Oops, some buffer now needs to be backed up that the caller
+			 * didn't back up.  Start over.
+			 */
+			WALInsertLockRelease();
+			END_CRIT_SECTION();
+			return InvalidXLogRecPtr;
+		}
+
+		/*
+		 * Reserve space for the record in the WAL. This also sets the xl_prev
+		 * pointer.
+		 */
 		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
 								  &rechdr->xl_prev);
+
+		/* Normal records are always inserted. */
 		inserted = true;
 	}
 
-- 
2.37.1 (Apple Git-137.1)

v6-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchapplication/octet-stream; name=v6-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchDownload
From af6b150041ff399fda83cf9c43da880e8d850921 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Sep 2023 13:55:28 -0400
Subject: [PATCH v6 2/3] Minimally add XLOG_CHECKPOINT_REDO.

Not to be committed separately.

Based on a previous patch by Dilip Kumar.
---
 src/backend/access/rmgrdesc/xlogdesc.c   | 7 +++++++
 src/backend/access/transam/xlog.c        | 4 ++++
 src/backend/replication/logical/decode.c | 1 +
 src/include/catalog/pg_control.h         | 1 +
 4 files changed, 13 insertions(+)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..37f59bda7e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 348cbebe3a..026e2fc9da 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8102,6 +8102,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here, just for informational purposes */
+	}
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 730061c9da..24b712aa66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.37.1 (Apple Git-137.1)

v6-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchapplication/octet-stream; name=v6-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchDownload
From 04e712d73b32b7148fa8fcf019367ea3230050f3 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Sep 2023 16:11:31 -0400
Subject: [PATCH v6 3/3] WIP: Insert XLOG_CHECKPOINT_REDO at the redo point.

Merge this into previous commit.
---
 src/backend/access/transam/xlog.c | 129 ++++++++++++++++++++++--------
 1 file changed, 95 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 026e2fc9da..292646d1cf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -559,6 +559,16 @@ typedef struct XLogCtlData
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+/*
+ * Classification of XLogRecordInsert operations.
+ */
+typedef enum
+{
+	WALINSERT_NORMAL,
+	WALINSERT_SPECIAL_SWITCH,
+	WALINSERT_SPECIAL_CHECKPOINT
+} WalInsertClass;
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -739,13 +749,21 @@ XLogInsertRecord(XLogRecData *rdata,
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+	WalInsertClass	class = WALINSERT_NORMAL;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
+	/* Does this record type require special handling? */
+	if (rechdr->xl_rmid == RM_XLOG_ID)
+	{
+		if (info == XLOG_SWITCH)
+			class = WALINSERT_SPECIAL_SWITCH;
+		else if (XLOG_CHECKPOINT_REDO)
+			class = WALINSERT_SPECIAL_CHECKPOINT;
+	}
+
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
 
@@ -793,7 +811,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	START_CRIT_SECTION();
 
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		/*
 		 * In order to insert an XLOG_SWITCH record, we need to hold all of
@@ -804,12 +822,27 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * Nonetheless, this case is simpler than the normal cases handled
 		 * below, which must check for changes in doPageWrites and RedoRecPtr.
 		 * Those checks are only needed for records that can contain
-		 * full-pages images, and an XLOG_SWITCH record never does.
+		 * buffer references, and an XLOG_SWITCH record never does.
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
 		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
 	}
+	else if (class == WALINSERT_SPECIAL_CHECKPOINT)
+	{
+		/*
+		 * We need to update both the local and shared copies of RedoRecPtr, which
+		 * means that we need to hold all the WAL insertion locks. However, there
+		 * can't be any buffer references, so as above, we need not check RedoRecPtr
+		 * before inserting the record; we just need to update it afterwards.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+								  &rechdr->xl_prev);
+		RedoRecPtr = Insert->RedoRecPtr = StartPos;
+		inserted = true;
+	}
 	else
 	{
 		WALInsertLockAcquire();
@@ -876,7 +909,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(rechdr->xl_tot_len,
+							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -935,7 +969,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * padding space that fills the rest of the segment, and perform
 	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		TRACE_POSTGRESQL_WAL_SWITCH();
 		XLogFlush(EndPos);
@@ -6487,6 +6521,8 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * All of this mechanism allows us to continue working while we checkpoint.
  * As a result, timing of actions is critical here and be careful to note that
  * this function will likely take minutes to execute on a busy system.
+ *
+ * XXX FIX ABOVE COMMENTS
  */
 void
 CreateCheckPoint(int flags)
@@ -6609,36 +6645,37 @@ CreateCheckPoint(int flags)
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
-	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
-	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+		/*
+		 * Compute new REDO record ptr = location of next XLOG record.
+		 *
+		 * Since this is a shutdown checkpoint, there can't be any concurrent WAL
+		 * insertion.
+		 */
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
 
-	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
-	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
-	 * pointing past where it really needs to point.  This is okay; the only
-	 * consequence is that XLogInsert might back up whole buffers that it
-	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
-	 * XLogInserts that happen while we are dumping buffers must assume that
-	 * their buffer changes are not included in the checkpoint.
-	 */
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+		/*
+		 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+		 * must be done while holding all the insertion locks.
+		 *
+		 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+		 * pointing past where it really needs to point.  This is okay; the only
+		 * consequence is that XLogInsert might back up whole buffers that it
+		 * didn't really need to.  We can't postpone advancing RedoRecPtr because
+		 * XLogInserts that happen while we are dumping buffers must assume that
+		 * their buffer changes are not included in the checkpoint.
+		 */
+		RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	}
 
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
@@ -6646,6 +6683,30 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockRelease();
 
+	/*
+	 * If this is an online checkpoint, we have not yet determined the redo
+	 * point. We do so now by inserting the special XLOG_CHECKPOINT_REDO record;
+	 * the LSN at which it starts becomes the new redo pointer. We don't do this
+	 * for a shutdown checkpoint, because in that case no WAL can be written
+	 * between the redo point and the insertion of the checkpoint record itself,
+	 * so the checkpoint record itself services to mark the redo point.
+	 */
+	if (!shutdown)
+	{
+		int		dummy = 0;
+
+		/* Record must have payload to avoid assertion failure. */
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		(void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/*
+		 * XLogInsertRecord will have updated RedoRecPtr, but we need to copy that
+		 * into the record that will be inserted when the checkpoint is complete.
+		 */
+		checkPoint.redo = RedoRecPtr;
+	}
+
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->RedoRecPtr = checkPoint.redo;
-- 
2.37.1 (Apple Git-137.1)

#18Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#17)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 7:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

One minor comment:
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

Isn't the check needs to compare the record type with info?

Your v6-0001* patch looks like an improvement to me even without the
other two patches.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]/messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com. Basically in
[1]: /messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

[1]: /messages/by-id/TYAPR01MB586615579356A84A8CF29A00F5F9A@TYAPR01MB5866.jpnprd01.prod.outlook.com

--
With Regards,
Amit Kapila.

#19Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#18)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

I think the comment needs updating. I don't think we can do curInsert
= XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
Same for Insert->fullPageWrites.

I agree that it looks a little wasteful to release the lock and then
reacquire it, but I suppose checkpoints don't happen often enough for
it to matter. You're not going to notice an extra set of insertion
lock acquisitions once every 5 minutes, or every half hour, or even
every 1 minute if your checkpoints are super-frequent.

Also notice that the current code is also quite inefficient in this
way. GetLastImportantRecPtr() acquires and releases each lock one at a
time, and then we immediately turn around and do
WALInsertLockAcquireExclusive(). If the overhead that you're concerned
about here were big enough to matter, we could reclaim what we're
losing by having a version of GetLastImportantRecPtr() that expects to
be called with all locks already held. But when I asked Andres, he
thought that it didn't matter, and I bet he's right.

One minor comment:
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

Isn't the check needs to compare the record type with info?

Yeah wow. That's a big mistake.

Your v6-0001* patch looks like an improvement to me even without the
other two patches.

Good to know, thanks.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]. Basically in
[1], to allow slots migration from lower to higher version, we need to
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

If your approach requires a code change every time someone adds a new
WAL record that doesn't modify table data, you might want to rethink
the approach a bit.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Dilip Kumar
dilipbalaut@gmail.com
In reply to: Robert Haas (#17)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 1:50 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Sep 18, 2023 at 2:57 PM Robert Haas <robertmhaas@gmail.com> wrote:

I've been brainstorming about this today, trying to figure out some
ideas to make it work.

Here are some patches.

0001 refactors XLogInsertRecord to unify a couple of separate tests of
isLogSwitch, hopefully making it cleaner and cheaper to add more
special cases.

Yeah, this looks improvement as it removes one isLogSwitch from the code.

0002 is a very minimal patch to add XLOG_CHECKPOINT_REDO without using
it for anything.

0003 modifies CreateCheckPoint() to insert an XLOG_CHECKPOINT_REDO
record for any non-shutdown checkpoint, and modifies
XLogInsertRecord() to treat that as a new special case, wherein after
inserting the record the redo pointer is reset while still holding the
WAL insertion locks.

Yeah from a design POV, it looks fine to me because the main goal was
to insert the XLOG_CHECKPOINT_REDO record and set the "RedoRecPtr"
under the same exclusive wal insertion lock and this patch is doing
this. As you already mentioned it is an improvement over my first
patch because a) it holds the exclusive WAL insersion lock for a very
short duration b) not increasing the number of branches in
XLogInsertRecord().

Some review
1.
I feel we can reduce one more branch to the normal path by increasing
one branch in this special case i.e.

Your code is
if (class == WALINSERT_SPECIAL_SWITCH)
{
/*execute isSwitch case */
}
else if (class == WALINSERT_SPECIAL_CHECKPOINT)
{
/*execute checkpoint redo case */
}
else
{
/* common case*/
}

My suggestion
if (xl_rmid == RM_XLOG_ID)
{
if (class == WALINSERT_SPECIAL_SWITCH)
{
/*execute isSwitch case */
}
else if (class == WALINSERT_SPECIAL_CHECKPOINT)
{
/*execute checkpoint redo case */
}
}
else
{
/* common case*/
}

2.
In fact, I feel that we can remove this branch as well right? I mean
why do we need to have this separate thing called "class"? we can
very much use "info" for that purpose. right?

+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }

So if we remove this then we do not have this class and the above case
would look like

if (xl_rmid == RM_XLOG_ID)
{
if (info == XLOG_SWITCH)
{
/*execute isSwitch case */
}
else if (info == XLOG_CHECKPOINT_REDO)
{
/*execute checkpoint redo case */
}
}
else
{
/* common case*/
}

3.
+ /* Does this record type require special handling? */
+ if (rechdr->xl_rmid == RM_XLOG_ID)
+ {
+ if (info == XLOG_SWITCH)
+ class = WALINSERT_SPECIAL_SWITCH;
+ else if (XLOG_CHECKPOINT_REDO)
+ class = WALINSERT_SPECIAL_CHECKPOINT;
+ }
+

the above check-in else if is wrong I mean
else if (XLOG_CHECKPOINT_REDO) should be else if (info == XLOG_CHECKPOINT_REDO)

That's all I have for now.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#21Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#19)
Re: New WAL record to detect the checkpoint redo location

On Thu, Sep 21, 2023 at 9:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Sep 21, 2023 at 4:22 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

After the 0003 patch, do we need acquire exclusive lock via
WALInsertLockAcquireExclusive() for non-shutdown checkpoints. Even the
comment "We must block concurrent insertions while examining insert
state to determine the checkpoint REDO pointer." seems to indicate
that it is not required. If it is required then we may want to change
the comments and also acquiring the locks twice will have more cost
than acquiring it once and write the new WAL record under that lock.

I think the comment needs updating. I don't think we can do curInsert
= XLogBytePosToRecPtr(Insert->CurrBytePos) without taking the locks.
Same for Insert->fullPageWrites.

If we can't do those without taking all the locks then it is fine but
just wanted to give it a try to see if there is a way to avoid in case
of online (non-shutdown) checkpoints. For example, curInsert is used
only for the shutdown path, so we don't need to acquire all locks for
it in the cases except for the shutdown case. Here, we are reading
Insert->fullPageWrites which requires an insertion lock but not all
the locks (as per comments in structure XLogCtlInsert). Now, I haven't
done detailed analysis for
XLogCtl->InsertTimeLineID/XLogCtl->PrevTimeLineID but some places
reading InsertTimeLineID have a comment like "Given that we're not in
recovery, InsertTimeLineID is set and can't change, so we can read it
without a lock." which suggests that some analysis is required whether
reading those requires all locks in this code path. OTOH, it won't
matter to acquire all locks in this code path for the reasons
mentioned by you and it may help in keeping the code simple. So, it is
up to you to take the final call on this matter. I am fine with your
decision.

BTW, I would like to mention that there is a slight interaction of
this work with the patch to upgrade/migrate slots [1]. Basically in
[1], to allow slots migration from lower to higher version, we need to
ensure that all the WAL has been consumed by the slots before clean
shutdown. However, during upgrade we can generate few records like
checkpoint which we will ignore for the slot consistency checking as
such records doesn't matter for data consistency after upgrade. We
probably need to add this record to that list. I'll keep an eye on
both the patches so that we don't miss that interaction but mentioned
it here to make others also aware of the same.

If your approach requires a code change every time someone adds a new
WAL record that doesn't modify table data, you might want to rethink
the approach a bit.

I understand your hesitation and we have discussed several approaches
that do not rely on the WAL record type to determine if the slots have
caught up but the other approaches seem to have different other
downsides. I know it may not be a good idea to discuss those here but
as there was a slight interaction with this work, so I thought to
bring it up. To be precise, we need to ensure that we ignore WAL
records that got generated during pg_upgrade operation (say during
pg_upgrade --check).

The approach we initially followed was to check if the slot's
confirmed_flush_lsn is equal to the latest checkpoint in
pg_controldata (which is the shutdown checkpoint after stopping the
server). This approach doesn't work for the use case where the user
runs pg_upgrade --check before actually performing the upgrade [1]/messages/by-id/CAA4eK1LzeZLoTLaAuadmuiggc5mq39oLY6fK95oFKiPBPBf+eQ@mail.gmail.com.
This is because during the upgrade check, the server will be
stopped/started and update the position of the latest checkpoint,
causing the check to fail in the actual upgrade and leading pg_upgrade
to believe that the slot has not been caught up.

To address the issues in the above approach, we also discussed several
alternative approaches[2]/messages/by-id/OS0PR01MB571640E1B58741979A5E586594F7A@OS0PR01MB5716.jpnprd01.prod.outlook.com[3]/messages/by-id/TYAPR01MB5866EF7398CB13FFDBF230E7F5F0A@TYAPR01MB5866.jpnprd01.prod.outlook.com: a) Adding a new field in pg_controldata
to record the last checkpoint that happens in non-upgrade mode, so
that we can compare the slot's confirmed_flush_lsn with this value.
However, we were not sure if this was a good enough reason to add a
new field in controldata field and sprinkle IsBinaryUpgrade check in
checkpointer code path. b) Advancing each slot's confirmed_flush_lsn
to the latest position if the first upgrade check passes. This way,
when performing the actual upgrade, the confirmed_flush_lsn will also
pass. However, internally advancing the LSN seems unconventional. c)
Introducing a new pg_upgrade option to skip the check for slot
catch-up so that if it is already done at the time of pg_upgrade
--check, we can avoid rechecking during actual upgrade. Although this
might work, the user would need to specify this manually, which is not
ideal. d) Document this and suggest users consume the WALs, but this
doesn't look acceptable to users.

All the above approaches have their downsides, prompting us to
consider the WAL scan approach which is to scan the end of the WAL for
records that should have been streamed out. This approach was first
proposed by Andres[4]/messages/by-id/20230725170319.h423jbthfohwgnf7@awork3.anarazel.de and was chosen[5]/messages/by-id/CAA4eK1KqqWayKtRhvyRgkhEHvAUemW_dEqgFn7UOG3D4B6f0ew@mail.gmail.com after considering all other
approaches. If we don't like relying on WAL record types then I think
the alternative (a) to add a new field in ControlDataFile is worth
considering.

[1]: /messages/by-id/CAA4eK1LzeZLoTLaAuadmuiggc5mq39oLY6fK95oFKiPBPBf+eQ@mail.gmail.com
[2]: /messages/by-id/OS0PR01MB571640E1B58741979A5E586594F7A@OS0PR01MB5716.jpnprd01.prod.outlook.com
[3]: /messages/by-id/TYAPR01MB5866EF7398CB13FFDBF230E7F5F0A@TYAPR01MB5866.jpnprd01.prod.outlook.com
[4]: /messages/by-id/20230725170319.h423jbthfohwgnf7@awork3.anarazel.de
[5]: /messages/by-id/CAA4eK1KqqWayKtRhvyRgkhEHvAUemW_dEqgFn7UOG3D4B6f0ew@mail.gmail.com

--
With Regards,
Amit Kapila.

#22Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#17)
3 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Wed, Sep 20, 2023 at 4:20 PM Robert Haas <robertmhaas@gmail.com> wrote:

Here are some patches.

Here are some updated patches. Following some off-list conversation
with Andres, I restructured 0003 to put the common case first and use
likely(), and I fixed the brown-paper-bag noted by Amit. I then turned
my attention to performance testing. I was happy to find out when I
did a bunch of testing on Friday that my branch with these patches
applied outperformed master. I was then less happy to find that when I
repeated the same tests today, master outperformed the branch. So now
I don't know what is going on, but it doesn't seem like my test
results are stable enough to draw meaningful conclusions.

I was trying to think of a test case where XLogInsertRecord would be
exercised as heavily as possible, so I really wanted to generate a lot
of WAL while doing as little real work as possible. The best idea that
I had was to run pg_create_restore_point() in a loop. Initially,
performance was dominated by the log messages which that function
emits, so I set log_min_messages='FATAL' to suppress those. To try to
further reduce other bottlenecks, I also set max_wal_size='50GB',
fsync='off', synchronous_commit='off', and wal_buffers='256MB'. Then I
ran this query:

select count(*) from (SELECT pg_create_restore_point('banana') from
generate_series(1,100000000) g) x;

I can't help laughing at the comedy of creating 100 million
banana-named restore points with no fsyncs or logging, but here we
are. All of my test runs with master, and with the patches, and with
just the first patch run in between 34 and 39 seconds. As I say, I
can't really separate out which versions are faster and slower with
any confidence. Before I fixed the brown-paper bag that Amit pointed
out, it was using WALInsertLockAcquireExclusive() instead of
WALInsertLockAcquire() for *all* WAL records, and that created an
extremely large and obvious increase in the runtime of the tests. So
I'm relatively confident that this test case is sensitive to changes
in execution time of XLogInsertRecord(), but apparently the changes
caused by rearranging the branches are a bit too marginal for them to
show up here.

One possible conclusion is that the differences here aren't actually
big enough to get stressed about, but I don't want to jump to that
conclusion without investigating the competing hypothesis that this
isn't the right way to test this, and that some better test would show
clearer results. Suggestions?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v7-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchapplication/octet-stream; name=v7-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchDownload
From bc3c62a96e9e476f3a2095d148aca75bd51bc3a6 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Sep 2023 13:48:19 -0400
Subject: [PATCH v7 1/3] Unify two isLogSwitch tests in XLogInsertRecord.

An upcoming patch wants to introduce an additional special case in this
function. To keep that as cheap as possible, minimize the amount of branching
that we do based on whether this is an XLOG_SWITCH record.
---
 src/backend/access/transam/xlog.c | 97 ++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..0924afa57f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -792,57 +792,72 @@ XLogInsertRecord(XLogRecData *rdata,
 	 *----------
 	 */
 	START_CRIT_SECTION();
+
 	if (isLogSwitch)
+	{
+		/*
+		 * In order to insert an XLOG_SWITCH record, we need to hold all of
+		 * the WAL insertion locks, not just one, so that no one else can
+		 * begin inserting a record until we've figured out how much space
+		 * remains in the current WAL segment and claimed all of it.
+		 *
+		 * Nonetheless, this case is simpler than the normal cases handled
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain
+		 * full-pages images, and an XLOG_SWITCH record never does.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+	}
 	else
-		WALInsertLockAcquire();
-
-	/*
-	 * Check to see if my copy of RedoRecPtr is out of date. If so, may have
-	 * to go back and have the caller recompute everything. This can only
-	 * happen just after a checkpoint, so it's better to be slow in this case
-	 * and fast otherwise.
-	 *
-	 * Also check to see if fullPageWrites was just turned on or there's a
-	 * running backup (which forces full-page writes); if we weren't already
-	 * doing full-page writes then go back and recompute.
-	 *
-	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
-	 * affect the contents of the XLOG record, so we'll update our local copy
-	 * but not force a recomputation.  (If doPageWrites was just turned off,
-	 * we could recompute the record without full pages, but we choose not to
-	 * bother.)
-	 */
-	if (RedoRecPtr != Insert->RedoRecPtr)
 	{
-		Assert(RedoRecPtr < Insert->RedoRecPtr);
-		RedoRecPtr = Insert->RedoRecPtr;
-	}
-	doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
+		WALInsertLockAcquire();
 
-	if (doPageWrites &&
-		(!prevDoPageWrites ||
-		 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
-	{
 		/*
-		 * Oops, some buffer now needs to be backed up that the caller didn't
-		 * back up.  Start over.
+		 * Check to see if my copy of RedoRecPtr is out of date. If so, may
+		 * have to go back and have the caller recompute everything. This can
+		 * only happen just after a checkpoint, so it's better to be slow in
+		 * this case and fast otherwise.
+		 *
+		 * Also check to see if fullPageWrites was just turned on or there's a
+		 * running backup (which forces full-page writes); if we weren't
+		 * already doing full-page writes then go back and recompute.
+		 *
+		 * If we aren't doing full-page writes then RedoRecPtr doesn't
+		 * actually affect the contents of the XLOG record, so we'll update
+		 * our local copy but not force a recomputation.  (If doPageWrites was
+		 * just turned off, we could recompute the record without full pages,
+		 * but we choose not to bother.)
 		 */
-		WALInsertLockRelease();
-		END_CRIT_SECTION();
-		return InvalidXLogRecPtr;
-	}
+		if (RedoRecPtr != Insert->RedoRecPtr)
+		{
+			Assert(RedoRecPtr < Insert->RedoRecPtr);
+			RedoRecPtr = Insert->RedoRecPtr;
+		}
+		doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
 
-	/*
-	 * Reserve space for the record in the WAL. This also sets the xl_prev
-	 * pointer.
-	 */
-	if (isLogSwitch)
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
-	else
-	{
+		if (doPageWrites &&
+			(!prevDoPageWrites ||
+			 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
+		{
+			/*
+			 * Oops, some buffer now needs to be backed up that the caller
+			 * didn't back up.  Start over.
+			 */
+			WALInsertLockRelease();
+			END_CRIT_SECTION();
+			return InvalidXLogRecPtr;
+		}
+
+		/*
+		 * Reserve space for the record in the WAL. This also sets the xl_prev
+		 * pointer.
+		 */
 		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
 								  &rechdr->xl_prev);
+
+		/* Normal records are always inserted. */
 		inserted = true;
 	}
 
-- 
2.37.1 (Apple Git-137.1)

v7-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchapplication/octet-stream; name=v7-0003-WIP-Insert-XLOG_CHECKPOINT_REDO-at-the-redo-point.patchDownload
From f1be36ff8099ea2eed2460cb4de94ce13e5435dc Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 29 Sep 2023 16:43:38 -0400
Subject: [PATCH v7 3/3] WIP: Insert XLOG_CHECKPOINT_REDO at the redo point.

Merge this into previous commit.
---
 src/backend/access/transam/xlog.c | 167 +++++++++++++++++++++---------
 src/tools/pgindent/typedefs.list  |   1 +
 2 files changed, 118 insertions(+), 50 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 64bf3d7eea..12ba6cdaa6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -559,6 +559,16 @@ typedef struct XLogCtlData
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+/*
+ * Classification of XLogRecordInsert operations.
+ */
+typedef enum
+{
+	WALINSERT_NORMAL,
+	WALINSERT_SPECIAL_SWITCH,
+	WALINSERT_SPECIAL_CHECKPOINT
+} WalInsertClass;
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -739,13 +749,21 @@ XLogInsertRecord(XLogRecData *rdata,
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+	WalInsertClass class = WALINSERT_NORMAL;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
+	/* Does this record type require special handling? */
+	if (rechdr->xl_rmid == RM_XLOG_ID)
+	{
+		if (info == XLOG_SWITCH)
+			class = WALINSERT_SPECIAL_SWITCH;
+		else if (info == XLOG_CHECKPOINT_REDO)
+			class = WALINSERT_SPECIAL_CHECKPOINT;
+	}
+
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
 
@@ -793,24 +811,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	START_CRIT_SECTION();
 
-	if (isLogSwitch)
-	{
-		/*
-		 * In order to insert an XLOG_SWITCH record, we need to hold all of
-		 * the WAL insertion locks, not just one, so that no one else can
-		 * begin inserting a record until we've figured out how much space
-		 * remains in the current WAL segment and claimed all of it.
-		 *
-		 * Nonetheless, this case is simpler than the normal cases handled
-		 * below, which must check for changes in doPageWrites and RedoRecPtr.
-		 * Those checks are only needed for records that can contain
-		 * full-pages images, and an XLOG_SWITCH record never does.
-		 */
-		Assert(fpw_lsn == InvalidXLogRecPtr);
-		WALInsertLockAcquireExclusive();
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
-	}
-	else
+	if (likely(class == WALINSERT_NORMAL))
 	{
 		WALInsertLockAcquire();
 
@@ -860,6 +861,41 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* Normal records are always inserted. */
 		inserted = true;
 	}
+	else if (class == WALINSERT_SPECIAL_SWITCH)
+	{
+		/*
+		 * In order to insert an XLOG_SWITCH record, we need to hold all of
+		 * the WAL insertion locks, not just one, so that no one else can
+		 * begin inserting a record until we've figured out how much space
+		 * remains in the current WAL segment and claimed all of it.
+		 *
+		 * Nonetheless, this case is simpler than the normal cases handled
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain buffer
+		 * references, and an XLOG_SWITCH record never does.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+	}
+	else
+	{
+		Assert(class == WALINSERT_SPECIAL_CHECKPOINT);
+
+		/*
+		 * We need to update both the local and shared copies of RedoRecPtr,
+		 * which means that we need to hold all the WAL insertion locks.
+		 * However, there can't be any buffer references, so as above, we need
+		 * not check RedoRecPtr before inserting the record; we just need to
+		 * update it afterwards.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+								  &rechdr->xl_prev);
+		RedoRecPtr = Insert->RedoRecPtr = StartPos;
+		inserted = true;
+	}
 
 	if (inserted)
 	{
@@ -876,7 +912,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(rechdr->xl_tot_len,
+							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -935,7 +972,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * padding space that fills the rest of the segment, and perform
 	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		TRACE_POSTGRESQL_WAL_SWITCH();
 		XLogFlush(EndPos);
@@ -6486,6 +6523,8 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * All of this mechanism allows us to continue working while we checkpoint.
  * As a result, timing of actions is critical here and be careful to note that
  * this function will likely take minutes to execute on a busy system.
+ *
+ * XXX FIX ABOVE COMMENTS
  */
 void
 CreateCheckPoint(int flags)
@@ -6608,36 +6647,38 @@ CreateCheckPoint(int flags)
 
 	checkPoint.fullPageWrites = Insert->fullPageWrites;
 
-	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
-	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+		/*
+		 * Compute new REDO record ptr = location of next XLOG record.
+		 *
+		 * Since this is a shutdown checkpoint, there can't be any concurrent
+		 * WAL insertion.
+		 */
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
 
-	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
-	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
-	 * pointing past where it really needs to point.  This is okay; the only
-	 * consequence is that XLogInsert might back up whole buffers that it
-	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
-	 * XLogInserts that happen while we are dumping buffers must assume that
-	 * their buffer changes are not included in the checkpoint.
-	 */
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+		/*
+		 * Here we update the shared RedoRecPtr for future XLogInsert calls;
+		 * this must be done while holding all the insertion locks.
+		 *
+		 * Note: if we fail to complete the checkpoint, RedoRecPtr will be
+		 * left pointing past where it really needs to point.  This is okay;
+		 * the only consequence is that XLogInsert might back up whole buffers
+		 * that it didn't really need to.  We can't postpone advancing
+		 * RedoRecPtr because XLogInserts that happen while we are dumping
+		 * buffers must assume that their buffer changes are not included in
+		 * the checkpoint.
+		 */
+		RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	}
 
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
@@ -6645,6 +6686,32 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockRelease();
 
+	/*
+	 * If this is an online checkpoint, we have not yet determined the redo
+	 * point. We do so now by inserting the special XLOG_CHECKPOINT_REDO
+	 * record; the LSN at which it starts becomes the new redo pointer. We
+	 * don't do this for a shutdown checkpoint, because in that case no WAL
+	 * can be written between the redo point and the insertion of the
+	 * checkpoint record itself, so the checkpoint record itself services to
+	 * mark the redo point.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		/* Record must have payload to avoid assertion failure. */
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		(void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/*
+		 * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
+		 * that into the record that will be inserted when the checkpoint is
+		 * complete.
+		 */
+		checkPoint.redo = RedoRecPtr;
+	}
+
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->RedoRecPtr = checkPoint.redo;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de90c4958..3252b3aedd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2971,6 +2971,7 @@ VolatileFunctionStatus
 Vsrt
 WAIT_ORDER
 WALAvailability
+WalInsertClass
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
-- 
2.37.1 (Apple Git-137.1)

v7-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchapplication/octet-stream; name=v7-0002-Minimally-add-XLOG_CHECKPOINT_REDO.patchDownload
From c9469b53afc200ad3714676d046f6e8235516e48 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 20 Sep 2023 13:55:28 -0400
Subject: [PATCH v7 2/3] Minimally add XLOG_CHECKPOINT_REDO.

Not to be committed separately.

Based on a previous patch by Dilip Kumar.
---
 src/backend/access/rmgrdesc/xlogdesc.c   | 7 +++++++
 src/backend/access/transam/xlog.c        | 4 ++++
 src/backend/replication/logical/decode.c | 1 +
 src/include/catalog/pg_control.h         | 1 +
 4 files changed, 13 insertions(+)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..37f59bda7e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0924afa57f..64bf3d7eea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8101,6 +8101,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here, just for informational purposes */
+	}
 }
 
 /*
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 730061c9da..24b712aa66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
-- 
2.37.1 (Apple Git-137.1)

#23Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#22)
Re: New WAL record to detect the checkpoint redo location

Hi,

On 2023-10-02 10:42:37 -0400, Robert Haas wrote:

I was trying to think of a test case where XLogInsertRecord would be
exercised as heavily as possible, so I really wanted to generate a lot
of WAL while doing as little real work as possible. The best idea that
I had was to run pg_create_restore_point() in a loop.

What I use for that is pg_logical_emit_message(). Something like

SELECT count(*)
FROM
(
SELECT pg_logical_emit_message(false, '1', 'short'), generate_series(1, 10000)
);

run via pgbench does seem to exercise that path nicely.

One possible conclusion is that the differences here aren't actually
big enough to get stressed about, but I don't want to jump to that
conclusion without investigating the competing hypothesis that this
isn't the right way to test this, and that some better test would show
clearer results. Suggestions?

I saw some small differences in runtime running pgbench with the above query,
with a single client. Comparing profiles showed a surprising degree of
difference. That turns out to mostly a consequence of the fact that
ReserveXLogInsertLocation() isn't inlined anymore, because there now are two
callers of the function in XLogInsertRecord().

Unfortunately, I still see a small performance difference after that. To get
the most reproducible numbers, I disable turbo boost, bound postgres to one
cpu core, bound pgbench to another core. Over a few runs I quite reproducibly
get ~319.323 tps with your patches applied (+ always inline), and ~324.674
with master.

If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
performance does improve. But that "only" brings it up to 322.406. Not sure
what the rest is.

One thing that's notable, but not related to the patch, is that we waste a
fair bit of cpu time below XLogInsertRecord() with divisions. I think they're
all due to the use of UsableBytesInSegment in
XLogBytePosToRecPtr/XLogBytePosToEndRecPtr. The multiplication of
XLogSegNoOffsetToRecPtr() also shows.

Greetings,

Andres Freund

#24Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
Re: New WAL record to detect the checkpoint redo location

On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
performance does improve. But that "only" brings it up to 322.406. Not sure
what the rest is.

I don't really think this is worth worrying about. A sub-one-percent
regression on a highly artificial test case doesn't seem like a big
deal. Anybody less determined than you would have been unable to
measure that there even is a regression in the first place, and that's
basically everyone.

--
Robert Haas
EDB: http://www.enterprisedb.com

#25Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#23)
Re: New WAL record to detect the checkpoint redo location

On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

One thing that's notable, but not related to the patch, is that we waste a
fair bit of cpu time below XLogInsertRecord() with divisions. I think they're
all due to the use of UsableBytesInSegment in
XLogBytePosToRecPtr/XLogBytePosToEndRecPtr. The multiplication of
XLogSegNoOffsetToRecPtr() also shows.

Despite what I said in my earlier email, and with a feeling like unto
that created by the proximity of the sword of Damocles or some ghostly
albatross, I spent some time reflecting on this. Some observations:

1. The reason why we're doing this multiplication and division is to
make sure that the code in ReserveXLogInsertLocation which executes
while holding insertpos_lck remains as simple and brief as possible.
We could eliminate the conversion between usable byte positions and
LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had
ReserveXLogInsertLocation work out by how much to advance the LSN, but
it would have to be worked out while holding insertpos_lck (or some
replacement lwlock, perhaps) and that cure seems worse than the
disease. Given that, I think we're stuck with converting between
usable bye positions and LSNs, and that intrinsically needs some
multiplication and division.

2. It seems possible to remove one branch in each of
XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing
whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply
increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then
the rest of the calculations can be performed as if every page in the
segment had a header of length SizeOfXLogShortPHD, with no need to
special-case the first page. However, that doesn't get rid of any
multiplication or division, just a branch.

3. Aside from that, there seems to be no simple way to reduce the
complexity of an individual calculation, but ReserveXLogInsertLocation
does perform 3 rather similar computations, and I believe that we know
that it will always be the case that *PrevPtr < *StartPos < *EndPos.
Maybe we could have a fast-path for the case where they are all in the
same segment. We could take prevbytepos modulo UsableBytesInSegment;
call the result prevsegoff. If UsableBytesInSegment - prevsegoff >
endbytepos - prevbytepos, then all three pointers are in the same
segment, and maybe we could take advantage of that to avoid performing
the segment calculations more than once, but still needing to repeat
the page calculations. Or, instead or in addition, I think we could by
a similar technique check whether all three pointers are on the same
page; if so, then *StartPos and *EndPos can be computed from *PrevPtr
by just adding the difference between the corresponding byte
positions.

I'm not really sure whether that would come out cheaper. It's just the
only idea that I have. It did also occur to me to wonder whether the
apparent delays performing multiplication and division here were
really the result of the arithmetic itself being slow or whether they
were synchronization-related, SpinLockRelease(&Insert->insertpos_lck)
being a memory barrier just before. But I assume you thought about
that and concluded that wasn't the issue here.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#24)
Re: New WAL record to detect the checkpoint redo location

Hi,

On 2023-10-06 13:44:55 -0400, Robert Haas wrote:

On Thu, Oct 5, 2023 at 2:34 PM Andres Freund <andres@anarazel.de> wrote:

If I add an unlikely around if (rechdr->xl_rmid == RM_XLOG_ID), the
performance does improve. But that "only" brings it up to 322.406. Not sure
what the rest is.

I don't really think this is worth worrying about. A sub-one-percent
regression on a highly artificial test case doesn't seem like a big
deal.

I agree. I think it's worth measuring and looking at, after all the fix might
be trivial (like the case of the unlikely for the earlier if()). But it
shouldn't block progress on significant features.

I think this "issue" might be measurable in some other, not quite as artifical
cases, like INSERT ... SELECT or such. But even then it's going to be tiny.

Greetings,

Andres Freund

#27Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#25)
Re: New WAL record to detect the checkpoint redo location

Hi,

As noted in my email from a few minutes ago, I agree that optimizing this
shouldn't be a requirement for merging the patch.

On 2023-10-09 15:58:36 -0400, Robert Haas wrote:

1. The reason why we're doing this multiplication and division is to
make sure that the code in ReserveXLogInsertLocation which executes
while holding insertpos_lck remains as simple and brief as possible.
We could eliminate the conversion between usable byte positions and
LSNs if we replaced Insert->{Curr,Prev}BytePos with LSNs and had
ReserveXLogInsertLocation work out by how much to advance the LSN, but
it would have to be worked out while holding insertpos_lck (or some
replacement lwlock, perhaps) and that cure seems worse than the
disease. Given that, I think we're stuck with converting between
usable bye positions and LSNs, and that intrinsically needs some
multiplication and division.

Right, that's absolutely crucial for scalability.

2. It seems possible to remove one branch in each of
XLogBytePosToRecPtr and XLogBytePosToEndRecPtr. Rather than testing
whether bytesleft < XLOG_BLCKSZ - SizeOfXLogLongPHD, we could simply
increment bytesleft by SizeOfXLogLongPHD - SizeOfXLogShortPHD. Then
the rest of the calculations can be performed as if every page in the
segment had a header of length SizeOfXLogShortPHD, with no need to
special-case the first page. However, that doesn't get rid of any
multiplication or division, just a branch.

This reminded me about something I've been bugged by for a while: The whole
idea of short xlog page headers seems like a completely premature
optimization. The page header is a very small amount of the overall data
(long: 40/8192 ~= 0.00488, short: 24/8192 ~= 0.00292), compared to the space
we waste in many other places, including on a per-record level, it doesn't
seem worth the complexity.

3. Aside from that, there seems to be no simple way to reduce the
complexity of an individual calculation, but ReserveXLogInsertLocation
does perform 3 rather similar computations, and I believe that we know
that it will always be the case that *PrevPtr < *StartPos < *EndPos.
Maybe we could have a fast-path for the case where they are all in the
same segment. We could take prevbytepos modulo UsableBytesInSegment;
call the result prevsegoff. If UsableBytesInSegment - prevsegoff >
endbytepos - prevbytepos, then all three pointers are in the same
segment, and maybe we could take advantage of that to avoid performing
the segment calculations more than once, but still needing to repeat
the page calculations. Or, instead or in addition, I think we could by
a similar technique check whether all three pointers are on the same
page; if so, then *StartPos and *EndPos can be computed from *PrevPtr
by just adding the difference between the corresponding byte
positions.

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).

I'm not really sure whether that would come out cheaper. It's just the
only idea that I have. It did also occur to me to wonder whether the
apparent delays performing multiplication and division here were
really the result of the arithmetic itself being slow or whether they
were synchronization-related, SpinLockRelease(&Insert->insertpos_lck)
being a memory barrier just before. But I assume you thought about
that and concluded that wasn't the issue here.

I did verify that they continue to be a bottleneck even after (incorrectly
obviously), removing the spinlock. It's also not too surprising, the latency
of 64bit divs is just high, particularly on intel from a few years ago (my
cascade lake workstation) and IIRC there's just a single execution port for it
too, so multiple instructions can't be fully parallelized.

https://uops.info/table.html documents a worst case latency of 89 cycles on
cascade lake, with the division broken up into 36 uops (reducing what's
available to track other in-flight instructions). It's much better on alter
lake (9 cycles and 7 uops on the perf cores, 44 cycles and 4 uops on
efficiency cores) and on zen 3+ (19 cycles, 2 uops).

Greetings,

Andres Freund

#28Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#27)
Re: New WAL record to detect the checkpoint redo location

On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).

Huh, is there a general technique for this when dividing by a
non-power-of-two? The segment size is a power of two, as is the page
size, but UsableBytesIn{Page,Segment} are some random value slightly
less than a power of two.

--
Robert Haas
EDB: http://www.enterprisedb.com

#29Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#28)
Re: New WAL record to detect the checkpoint redo location

Hi,

On 2023-10-09 18:31:11 -0400, Robert Haas wrote:

On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).

Huh, is there a general technique for this when dividing by a
non-power-of-two?

There is, but I was just having a brainfart, forgetting that UsableBytesInPage
isn't itself a power of two. The general technique is used by compilers, but
doesn't iirc lend itself well to be done at runtime.

Greetings,

Andres Freund

#30Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#27)
2 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:

As noted in my email from a few minutes ago, I agree that optimizing this
shouldn't be a requirement for merging the patch.

Here's a new patch set. I think I've incorporated the performance
fixes that you've suggested so far into this version. I also adjusted
a couple of other things:

- After further study of a point previously raised by Amit, I adjusted
CreateCheckPoint slightly to call WALInsertLockAcquireExclusive
significantly later than before. I think that there's no real reason
to do it so early and that the current coding is probably just a
historical leftover, but it would be good to have some review here.

- I added a cross-check that when starting redo from a checkpoint
whose redo pointer points to an earlier LSN that the checkpoint
itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO
record.

- I combined what were previously 0002 and 0003 into a single patch,
since that's how this would get committed.

- I fixed up some comments.

- I updated commit messages.

Hopefully this is getting close to good enough.

I did verify that they continue to be a bottleneck even after (incorrectly
obviously), removing the spinlock. It's also not too surprising, the latency
of 64bit divs is just high, particularly on intel from a few years ago (my
cascade lake workstation) and IIRC there's just a single execution port for it
too, so multiple instructions can't be fully parallelized.

The chipset on my laptop is even older. Coffee Lake, I think.

I'm not really sure that there's a whole lot we can reasonably do
about the divs unless you like the fastpath idea that I proposed
earlier, or unless you want to write a patch to either get rid of
short page headers or make long and short page headers the same number
of bytes. I have to admit I'm surprised by how visible the division
overhead is in this code path -- but I'm also somewhat inclined to
view that less as evidence that division is something we should be
desperate to eliminate and more as evidence that this code path is
quite fast already. In light of your findings, it doesn't seem
completely impossible to me that the speed of integer division in this
code path could be part of what limits performance for some users, but
I'm also not sure it's all that likely or all that serious, because
we're deliberating creating test cases that insert unreasonable
amounts of WAL without doing any actual work. In the real world,
there's going to be a lot more other code running along with this code
- probably at least the executor and some heap AM code - and I bet not
all of that is as well-optimized as this is already. And it's also
quite likely for many users that the real limits on the speed of the
workload will be related to I/O or lock contention rather than CPU
cost in any form. I'm not saying it's not worth worrying about it. I'm
just saying that we should make sure the amount of worrying we do is
calibrated to the true importance of the issue.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v8-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchapplication/octet-stream; name=v8-0001-Unify-two-isLogSwitch-tests-in-XLogInsertRecord.patchDownload
From 414cabf6730936763e8f6cd2e9b5c761949eb376 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 10 Oct 2023 11:30:20 -0400
Subject: [PATCH v8 1/2] Unify two isLogSwitch tests in XLogInsertRecord.

An upcoming patch wants to introduce an additional special case in
this function. To keep that as cheap as possible, minimize the amount
of branching that we do based on whether this is an XLOG_SWITCH
record.

Additionally, and also in the interest of keeping the overhead of
special-case code paths as low as possible, apply likely() to the
non-XLOG_SWITCH case, since only a very tiny fraction of WAL records
will be XLOG_SWITCH records.

Patch by me, reviewed by Dilip Kumar, Amit Kapila, and Andres Freund.
Earlier work in this area by Dilip Kumar, reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
---
 src/backend/access/transam/xlog.c | 101 +++++++++++++++++-------------
 1 file changed, 58 insertions(+), 43 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fcbde10529..39aec70b62 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -792,59 +792,74 @@ XLogInsertRecord(XLogRecData *rdata,
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	if (isLogSwitch)
-		WALInsertLockAcquireExclusive();
-	else
-		WALInsertLockAcquire();
 
-	/*
-	 * Check to see if my copy of RedoRecPtr is out of date. If so, may have
-	 * to go back and have the caller recompute everything. This can only
-	 * happen just after a checkpoint, so it's better to be slow in this case
-	 * and fast otherwise.
-	 *
-	 * Also check to see if fullPageWrites was just turned on or there's a
-	 * running backup (which forces full-page writes); if we weren't already
-	 * doing full-page writes then go back and recompute.
-	 *
-	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
-	 * affect the contents of the XLOG record, so we'll update our local copy
-	 * but not force a recomputation.  (If doPageWrites was just turned off,
-	 * we could recompute the record without full pages, but we choose not to
-	 * bother.)
-	 */
-	if (RedoRecPtr != Insert->RedoRecPtr)
+	if (likely(!isLogSwitch))
 	{
-		Assert(RedoRecPtr < Insert->RedoRecPtr);
-		RedoRecPtr = Insert->RedoRecPtr;
-	}
-	doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
+		WALInsertLockAcquire();
 
-	if (doPageWrites &&
-		(!prevDoPageWrites ||
-		 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
-	{
 		/*
-		 * Oops, some buffer now needs to be backed up that the caller didn't
-		 * back up.  Start over.
+		 * Check to see if my copy of RedoRecPtr is out of date. If so, may
+		 * have to go back and have the caller recompute everything. This can
+		 * only happen just after a checkpoint, so it's better to be slow in
+		 * this case and fast otherwise.
+		 *
+		 * Also check to see if fullPageWrites was just turned on or there's a
+		 * running backup (which forces full-page writes); if we weren't
+		 * already doing full-page writes then go back and recompute.
+		 *
+		 * If we aren't doing full-page writes then RedoRecPtr doesn't
+		 * actually affect the contents of the XLOG record, so we'll update
+		 * our local copy but not force a recomputation.  (If doPageWrites was
+		 * just turned off, we could recompute the record without full pages,
+		 * but we choose not to bother.)
 		 */
-		WALInsertLockRelease();
-		END_CRIT_SECTION();
-		return InvalidXLogRecPtr;
-	}
+		if (RedoRecPtr != Insert->RedoRecPtr)
+		{
+			Assert(RedoRecPtr < Insert->RedoRecPtr);
+			RedoRecPtr = Insert->RedoRecPtr;
+		}
+		doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
 
-	/*
-	 * Reserve space for the record in the WAL. This also sets the xl_prev
-	 * pointer.
-	 */
-	if (isLogSwitch)
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
-	else
-	{
+		if (doPageWrites &&
+			(!prevDoPageWrites ||
+			 (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr)))
+		{
+			/*
+			 * Oops, some buffer now needs to be backed up that the caller
+			 * didn't back up.  Start over.
+			 */
+			WALInsertLockRelease();
+			END_CRIT_SECTION();
+			return InvalidXLogRecPtr;
+		}
+
+		/*
+		 * Reserve space for the record in the WAL. This also sets the xl_prev
+		 * pointer.
+		 */
 		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
 								  &rechdr->xl_prev);
+
+		/* Normal records are always inserted. */
 		inserted = true;
 	}
+	else
+	{
+		/*
+		 * In order to insert an XLOG_SWITCH record, we need to hold all of
+		 * the WAL insertion locks, not just one, so that no one else can
+		 * begin inserting a record until we've figured out how much space
+		 * remains in the current WAL segment and claimed all of it.
+		 *
+		 * Nonetheless, this case is simpler than the normal cases handled
+		 * above, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain
+		 * full-pages images, and an XLOG_SWITCH record never does.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+	}
 
 	if (inserted)
 	{
-- 
2.37.1 (Apple Git-137.1)

v8-0002-During-online-checkpoints-insert-XLOG_CHECKPOINT_.patchapplication/octet-stream; name=v8-0002-During-online-checkpoints-insert-XLOG_CHECKPOINT_.patchDownload
From 92d72a726c6d7006d694fca3fe2c4fe4db172d15 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 10 Oct 2023 13:35:02 -0400
Subject: [PATCH v8 2/2] During online checkpoints, insert XLOG_CHECKPOINT_REDO
 at redo point.

This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream.

Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.

While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.

Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila and Andres Freund. Dilip's earlier patch was
also reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
---
 src/backend/access/rmgrdesc/xlogdesc.c    |   7 +
 src/backend/access/transam/xlog.c         | 192 +++++++++++++++-------
 src/backend/access/transam/xlogrecovery.c |  10 ++
 src/backend/replication/logical/decode.c  |   1 +
 src/include/catalog/pg_control.h          |   1 +
 src/tools/pgindent/typedefs.list          |   1 +
 6 files changed, 154 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..37f59bda7e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 39aec70b62..6bffaf387c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -559,6 +559,16 @@ typedef struct XLogCtlData
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+/*
+ * Classification of XLogRecordInsert operations.
+ */
+typedef enum
+{
+	WALINSERT_NORMAL,
+	WALINSERT_SPECIAL_SWITCH,
+	WALINSERT_SPECIAL_CHECKPOINT
+} WalInsertClass;
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -739,13 +749,21 @@ XLogInsertRecord(XLogRecData *rdata,
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+	WalInsertClass class = WALINSERT_NORMAL;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
+	/* Does this record type require special handling? */
+	if (unlikely(rechdr->xl_rmid == RM_XLOG_ID))
+	{
+		if (info == XLOG_SWITCH)
+			class = WALINSERT_SPECIAL_SWITCH;
+		else if (info == XLOG_CHECKPOINT_REDO)
+			class = WALINSERT_SPECIAL_CHECKPOINT;
+	}
+
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
 
@@ -793,7 +811,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	START_CRIT_SECTION();
 
-	if (likely(!isLogSwitch))
+	if (likely(class == WALINSERT_NORMAL))
 	{
 		WALInsertLockAcquire();
 
@@ -843,7 +861,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* Normal records are always inserted. */
 		inserted = true;
 	}
-	else
+	else if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		/*
 		 * In order to insert an XLOG_SWITCH record, we need to hold all of
@@ -852,14 +870,32 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * remains in the current WAL segment and claimed all of it.
 		 *
 		 * Nonetheless, this case is simpler than the normal cases handled
-		 * above, which must check for changes in doPageWrites and RedoRecPtr.
-		 * Those checks are only needed for records that can contain
-		 * full-pages images, and an XLOG_SWITCH record never does.
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain buffer
+		 * references, and an XLOG_SWITCH record never does.
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
 		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
 	}
+	else
+	{
+		Assert(class == WALINSERT_SPECIAL_CHECKPOINT);
+
+		/*
+		 * We need to update both the local and shared copies of RedoRecPtr,
+		 * which means that we need to hold all the WAL insertion locks.
+		 * However, there can't be any buffer references, so as above, we need
+		 * not check RedoRecPtr before inserting the record; we just need to
+		 * update it afterwards.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+								  &rechdr->xl_prev);
+		RedoRecPtr = Insert->RedoRecPtr = StartPos;
+		inserted = true;
+	}
 
 	if (inserted)
 	{
@@ -876,7 +912,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(rechdr->xl_tot_len,
+							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -935,7 +972,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * padding space that fills the rest of the segment, and perform
 	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		TRACE_POSTGRESQL_WAL_SWITCH();
 		XLogFlush(EndPos);
@@ -1054,8 +1091,12 @@ XLogInsertRecord(XLogRecData *rdata,
  *
  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
  * where we actually copy the record to the reserved space.
+ *
+ * NB: Testing shows that XLogInsertRecord runs faster if this code is inlined;
+ * however, because there are two call sites, the compiler is reluctant to
+ * inline. We use pg_attribute_always_inline here to try to convince it.
  */
-static void
+static pg_attribute_always_inline void
 ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 						  XLogRecPtr *PrevPtr)
 {
@@ -6475,17 +6516,22 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * In particular note that this routine is synchronous and does not pay
  * attention to CHECKPOINT_WAIT.
  *
- * If !shutdown then we are writing an online checkpoint. This is a very special
- * kind of operation and WAL record because the checkpoint action occurs over
- * a period of time yet logically occurs at just a single LSN. The logical
- * position of the WAL record (redo ptr) is the same or earlier than the
- * physical position. When we replay WAL we locate the checkpoint via its
- * physical position then read the redo ptr and actually start replay at the
- * earlier logical position. Note that we don't write *anything* to WAL at
- * the logical position, so that location could be any other kind of WAL record.
- * All of this mechanism allows us to continue working while we checkpoint.
- * As a result, timing of actions is critical here and be careful to note that
- * this function will likely take minutes to execute on a busy system.
+ * If !shutdown then we are writing an online checkpoint. An XLOG_CHECKPOINT_REDO
+ * record is inserted into WAL at the logical location of the checkpoint, before
+ * flushing anything to disk, and when the checkpoint is eventually completed,
+ * and it is from this point that WAL replay will begin in the case of a recovery
+ * from this checkpoint. Once everything is written to disk, an
+ * XLOG_CHECKPOINT_ONLINE record is written to complete the checkpoint, and
+ * points back to the earlier XLOG_CHECKPOINT_REDO record. This mechanism allows
+ * other write-ahead log records to be written while the checkpoint is in
+ * progress, but we must be very careful about order of operations. This function
+ * may take many minutes to execute on a busy system.
+ *
+ * On the other hand, when shutdown is true, concurrent insertion into the
+ * write-ahead log is impossible, so there is no need for two separate records.
+ * In this case, we only insert an XLOG_CHECKPOINT_SHUTDOWN record, and it's
+ * both the record marking the completion of the checkpoint and the location
+ * from which WAL replay would begin if needed.
  */
 void
 CreateCheckPoint(int flags)
@@ -6497,7 +6543,6 @@ CreateCheckPoint(int flags)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -6567,13 +6612,6 @@ CreateCheckPoint(int flags)
 	 */
 	last_important_lsn = GetLastImportantRecPtr();
 
-	/*
-	 * We must block concurrent insertions while examining insert state to
-	 * determine the checkpoint REDO pointer.
-	 */
-	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
@@ -6584,7 +6622,6 @@ CreateCheckPoint(int flags)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
-			WALInsertLockRelease();
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg_internal("checkpoint skipped because system is idle")));
@@ -6606,38 +6643,47 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
-	checkPoint.fullPageWrites = Insert->fullPageWrites;
-
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * We must block concurrent insertions while examining insert state.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	WALInsertLockAcquireExclusive();
+
+	checkPoint.fullPageWrites = Insert->fullPageWrites;
+
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
-	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
-	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
-	 * pointing past where it really needs to point.  This is okay; the only
-	 * consequence is that XLogInsert might back up whole buffers that it
-	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
-	 * XLogInserts that happen while we are dumping buffers must assume that
-	 * their buffer changes are not included in the checkpoint.
-	 */
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+		/*
+		 * Compute new REDO record ptr = location of next XLOG record.
+		 *
+		 * Since this is a shutdown checkpoint, there can't be any concurrent
+		 * WAL insertion.
+		 */
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
+
+		/*
+		 * Here we update the shared RedoRecPtr for future XLogInsert calls;
+		 * this must be done while holding all the insertion locks.
+		 *
+		 * Note: if we fail to complete the checkpoint, RedoRecPtr will be
+		 * left pointing past where it really needs to point.  This is okay;
+		 * the only consequence is that XLogInsert might back up whole buffers
+		 * that it didn't really need to.  We can't postpone advancing
+		 * RedoRecPtr because XLogInserts that happen while we are dumping
+		 * buffers must assume that their buffer changes are not included in
+		 * the checkpoint.
+		 */
+		RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	}
 
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
@@ -6645,6 +6691,32 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockRelease();
 
+	/*
+	 * If this is an online checkpoint, we have not yet determined the redo
+	 * point. We do so now by inserting the special XLOG_CHECKPOINT_REDO
+	 * record; the LSN at which it starts becomes the new redo pointer. We
+	 * don't do this for a shutdown checkpoint, because in that case no WAL
+	 * can be written between the redo point and the insertion of the
+	 * checkpoint record itself, so the checkpoint record itself serves to
+	 * mark the redo point.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		/* Record must have payload to avoid assertion failure. */
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		(void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/*
+		 * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
+		 * that into the record that will be inserted when the checkpoint is
+		 * complete.
+		 */
+		checkPoint.redo = RedoRecPtr;
+	}
+
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->RedoRecPtr = checkPoint.redo;
@@ -8101,6 +8173,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here, just for informational purposes */
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..28003c852c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1638,6 +1638,16 @@ PerformWalRecovery(void)
 		replayTLI = RedoStartTLI;
 		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
 		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
+
+		/*
+		 * If a checkpoint record's redo pointer points back to an earlier LSN,
+		 * the record at that LSN should be an XLOG_CHECKPOINT_REDO record.
+		 */
+		if (record->xl_rmid != RM_XLOG_ID ||
+			(record->xl_info & ~XLR_INFO_MASK) != XLOG_CHECKPOINT_REDO)
+			ereport(FATAL,
+					(errmsg("unexpected record type found at redo point %X/%X",
+							LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));
 	}
 	else
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 730061c9da..24b712aa66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8de90c4958..3252b3aedd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2971,6 +2971,7 @@ VolatileFunctionStatus
 Vsrt
 WAIT_ORDER
 WALAvailability
+WalInsertClass
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
-- 
2.37.1 (Apple Git-137.1)

#31Thomas Munro
thomas.munro@gmail.com
In reply to: Robert Haas (#28)
Re: New WAL record to detect the checkpoint redo location

On Tue, Oct 10, 2023 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 9, 2023 at 4:47 PM Andres Freund <andres@anarazel.de> wrote:

I think we might be able to speed some of this up by pre-compute values so we
can implement things like bytesleft / UsableBytesInPage with shifts. IIRC we
already insist on power-of-two segment sizes, so instead of needing to divide
by a runtime value, we should be able to shift by a runtime value (and the
modulo should be a mask).

Huh, is there a general technique for this when dividing by a
non-power-of-two? The segment size is a power of two, as is the page
size, but UsableBytesIn{Page,Segment} are some random value slightly
less than a power of two.

BTW in case someone is interested, Hacker's Delight (a book that has
come up on this list a few times before) devotes a couple of chapters
of magical incantations to this topic. Compilers know that magic, and
one thought I had when I first saw this discussion was that we could
specialise the code for the permissible wal segment sizes. But nuking
the variable sized page headers sounds better.

#32Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
Re: New WAL record to detect the checkpoint redo location

On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote:

- I combined what were previously 0002 and 0003 into a single patch,
since that's how this would get committed.

- I fixed up some comments.

- I updated commit messages.

Hopefully this is getting close to good enough.

I have looked at 0001, for now.. And it looks OK to me.

+        * Nonetheless, this case is simpler than the normal cases handled
+        * above, which must check for changes in doPageWrites and RedoRecPtr.
+        * Those checks are only needed for records that can contain
+        * full-pages images, and an XLOG_SWITCH record never does.
+        Assert(fpw_lsn == InvalidXLogRecPtr);

Right, that's the core reason behind the refactoring. The assertion
is a good idea.
--
Michael

#33Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#32)
Re: New WAL record to detect the checkpoint redo location

On Thu, Oct 12, 2023 at 3:27 AM Michael Paquier <michael@paquier.xyz> wrote:

I have looked at 0001, for now.. And it looks OK to me.

Cool. I've committed that one. Thanks for the review.

--
Robert Haas
EDB: http://www.enterprisedb.com

#34Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#30)
Re: New WAL record to detect the checkpoint redo location

On Tue, Oct 10, 2023 at 02:43:34PM -0400, Robert Haas wrote:

Here's a new patch set. I think I've incorporated the performance
fixes that you've suggested so far into this version. I also adjusted
a couple of other things:

Now looking at 0002, where you should be careful about the code
indentation or koel will complain.

- After further study of a point previously raised by Amit, I adjusted
CreateCheckPoint slightly to call WALInsertLockAcquireExclusive
significantly later than before. I think that there's no real reason
to do it so early and that the current coding is probably just a
historical leftover, but it would be good to have some review here.

This makes the new code call LocalSetXLogInsertAllowed() and what we
set for checkPoint.PrevTimeLineID after taking the insertion locks,
which should be OK.

- I added a cross-check that when starting redo from a checkpoint
whose redo pointer points to an earlier LSN that the checkpoint
itself, the record we read from that LSN must an XLOG_CHECKPOINT_REDO
record.

I've mentioned as well a test in pg_walinspect after one of the
checkpoints generated there, but what you do here is enough for the
online case.

+       /*
+        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
+        * that into the record that will be inserted when the checkpoint is
+        * complete.
+        */
+       checkPoint.redo = RedoRecPtr;

For online checkpoints, a very important point is that
XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord().
Perhaps that's worth an addition? I was a bit confused first that we
do the following for shutdown checkpoints:
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

Then repeat this pattern for non-shutdown checkpoints a few lines down
without touching the copy of the redo LSN in XLogCtl->Insert, because
of course we don't hold the WAL insert locks in an exclusive fashion
here:
checkPoint.redo = RedoRecPtr;

My point is that this is not only about RedoRecPtr, but also about
XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch()
says that.
--
Michael

#35Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#34)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote:

Now looking at 0002, where you should be careful about the code
indentation or koel will complain.

Fixed in the attached version.

This makes the new code call LocalSetXLogInsertAllowed() and what we
set for checkPoint.PrevTimeLineID after taking the insertion locks,
which should be OK.

Cool.

I've mentioned as well a test in pg_walinspect after one of the
checkpoints generated there, but what you do here is enough for the
online case.

I don't quite understand what you're saying here. If you're suggesting
a potential improvement, can you be a bit more clear and explicit
about what the suggestion is?

+       /*
+        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
+        * that into the record that will be inserted when the checkpoint is
+        * complete.
+        */
+       checkPoint.redo = RedoRecPtr;

For online checkpoints, a very important point is that
XLogCtl->Insert.RedoRecPtr is also updated in XLogInsertRecord().
Perhaps that's worth an addition? I was a bit confused first that we
do the following for shutdown checkpoints:
RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

Then repeat this pattern for non-shutdown checkpoints a few lines down
without touching the copy of the redo LSN in XLogCtl->Insert, because
of course we don't hold the WAL insert locks in an exclusive fashion
here:
checkPoint.redo = RedoRecPtr;

My point is that this is not only about RedoRecPtr, but also about
XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch()
says that.

I have adjusted the comment in CreateCheckPoint to hopefully address
this concern. I don't understand what you mean about
ReserveXLogSwitch(), though.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v9-0001-During-online-checkpoints-insert-XLOG_CHECKPOINT_.patchapplication/octet-stream; name=v9-0001-During-online-checkpoints-insert-XLOG_CHECKPOINT_.patchDownload
From cd54e492f19fe9384850e73ab4ee3b3e4881bb58 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Tue, 17 Oct 2023 12:27:03 -0400
Subject: [PATCH v9] During online checkpoints, insert XLOG_CHECKPOINT_REDO at
 redo point.

This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream.

Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.

While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.

Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila and Andres Freund. Dilip's earlier patch was
also reviewed by Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
---
 src/backend/access/rmgrdesc/xlogdesc.c    |   7 +
 src/backend/access/transam/xlog.c         | 193 +++++++++++++++-------
 src/backend/access/transam/xlogrecovery.c |  11 ++
 src/backend/replication/logical/decode.c  |   1 +
 src/include/catalog/pg_control.h          |   1 +
 src/tools/pgindent/typedefs.list          |   1 +
 6 files changed, 156 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..37f59bda7e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e4ca5089..cea13e3d58 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -559,6 +559,16 @@ typedef struct XLogCtlData
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+/*
+ * Classification of XLogRecordInsert operations.
+ */
+typedef enum
+{
+	WALINSERT_NORMAL,
+	WALINSERT_SPECIAL_SWITCH,
+	WALINSERT_SPECIAL_CHECKPOINT
+} WalInsertClass;
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -739,13 +749,21 @@ XLogInsertRecord(XLogRecData *rdata,
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+	WalInsertClass class = WALINSERT_NORMAL;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
+	/* Does this record type require special handling? */
+	if (unlikely(rechdr->xl_rmid == RM_XLOG_ID))
+	{
+		if (info == XLOG_SWITCH)
+			class = WALINSERT_SPECIAL_SWITCH;
+		else if (info == XLOG_CHECKPOINT_REDO)
+			class = WALINSERT_SPECIAL_CHECKPOINT;
+	}
+
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
 
@@ -793,7 +811,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	START_CRIT_SECTION();
 
-	if (likely(!isLogSwitch))
+	if (likely(class == WALINSERT_NORMAL))
 	{
 		WALInsertLockAcquire();
 
@@ -843,7 +861,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* Normal records are always inserted. */
 		inserted = true;
 	}
-	else
+	else if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		/*
 		 * In order to insert an XLOG_SWITCH record, we need to hold all of
@@ -852,14 +870,32 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * remains in the current WAL segment and claimed all of it.
 		 *
 		 * Nonetheless, this case is simpler than the normal cases handled
-		 * above, which must check for changes in doPageWrites and RedoRecPtr.
-		 * Those checks are only needed for records that can contain
-		 * full-pages images, and an XLOG_SWITCH record never does.
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain buffer
+		 * references, and an XLOG_SWITCH record never does.
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
 		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
 	}
+	else
+	{
+		Assert(class == WALINSERT_SPECIAL_CHECKPOINT);
+
+		/*
+		 * We need to update both the local and shared copies of RedoRecPtr,
+		 * which means that we need to hold all the WAL insertion locks.
+		 * However, there can't be any buffer references, so as above, we need
+		 * not check RedoRecPtr before inserting the record; we just need to
+		 * update it afterwards.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+								  &rechdr->xl_prev);
+		RedoRecPtr = Insert->RedoRecPtr = StartPos;
+		inserted = true;
+	}
 
 	if (inserted)
 	{
@@ -876,7 +912,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(rechdr->xl_tot_len,
+							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -935,7 +972,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * padding space that fills the rest of the segment, and perform
 	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		TRACE_POSTGRESQL_WAL_SWITCH();
 		XLogFlush(EndPos);
@@ -1054,8 +1091,12 @@ XLogInsertRecord(XLogRecData *rdata,
  *
  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
  * where we actually copy the record to the reserved space.
+ *
+ * NB: Testing shows that XLogInsertRecord runs faster if this code is inlined;
+ * however, because there are two call sites, the compiler is reluctant to
+ * inline. We use pg_attribute_always_inline here to try to convince it.
  */
-static void
+static pg_attribute_always_inline void
 ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 						  XLogRecPtr *PrevPtr)
 {
@@ -6475,17 +6516,22 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * In particular note that this routine is synchronous and does not pay
  * attention to CHECKPOINT_WAIT.
  *
- * If !shutdown then we are writing an online checkpoint. This is a very special
- * kind of operation and WAL record because the checkpoint action occurs over
- * a period of time yet logically occurs at just a single LSN. The logical
- * position of the WAL record (redo ptr) is the same or earlier than the
- * physical position. When we replay WAL we locate the checkpoint via its
- * physical position then read the redo ptr and actually start replay at the
- * earlier logical position. Note that we don't write *anything* to WAL at
- * the logical position, so that location could be any other kind of WAL record.
- * All of this mechanism allows us to continue working while we checkpoint.
- * As a result, timing of actions is critical here and be careful to note that
- * this function will likely take minutes to execute on a busy system.
+ * If !shutdown then we are writing an online checkpoint. An XLOG_CHECKPOINT_REDO
+ * record is inserted into WAL at the logical location of the checkpoint, before
+ * flushing anything to disk, and when the checkpoint is eventually completed,
+ * and it is from this point that WAL replay will begin in the case of a recovery
+ * from this checkpoint. Once everything is written to disk, an
+ * XLOG_CHECKPOINT_ONLINE record is written to complete the checkpoint, and
+ * points back to the earlier XLOG_CHECKPOINT_REDO record. This mechanism allows
+ * other write-ahead log records to be written while the checkpoint is in
+ * progress, but we must be very careful about order of operations. This function
+ * may take many minutes to execute on a busy system.
+ *
+ * On the other hand, when shutdown is true, concurrent insertion into the
+ * write-ahead log is impossible, so there is no need for two separate records.
+ * In this case, we only insert an XLOG_CHECKPOINT_SHUTDOWN record, and it's
+ * both the record marking the completion of the checkpoint and the location
+ * from which WAL replay would begin if needed.
  */
 void
 CreateCheckPoint(int flags)
@@ -6497,7 +6543,6 @@ CreateCheckPoint(int flags)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -6567,13 +6612,6 @@ CreateCheckPoint(int flags)
 	 */
 	last_important_lsn = GetLastImportantRecPtr();
 
-	/*
-	 * We must block concurrent insertions while examining insert state to
-	 * determine the checkpoint REDO pointer.
-	 */
-	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
@@ -6584,7 +6622,6 @@ CreateCheckPoint(int flags)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
-			WALInsertLockRelease();
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg_internal("checkpoint skipped because system is idle")));
@@ -6606,38 +6643,47 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
-	checkPoint.fullPageWrites = Insert->fullPageWrites;
-
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * We must block concurrent insertions while examining insert state.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	WALInsertLockAcquireExclusive();
+
+	checkPoint.fullPageWrites = Insert->fullPageWrites;
+
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
-	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
-	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
-	 * pointing past where it really needs to point.  This is okay; the only
-	 * consequence is that XLogInsert might back up whole buffers that it
-	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
-	 * XLogInserts that happen while we are dumping buffers must assume that
-	 * their buffer changes are not included in the checkpoint.
-	 */
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+		/*
+		 * Compute new REDO record ptr = location of next XLOG record.
+		 *
+		 * Since this is a shutdown checkpoint, there can't be any concurrent
+		 * WAL insertion.
+		 */
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
+
+		/*
+		 * Here we update the shared RedoRecPtr for future XLogInsert calls;
+		 * this must be done while holding all the insertion locks.
+		 *
+		 * Note: if we fail to complete the checkpoint, RedoRecPtr will be
+		 * left pointing past where it really needs to point.  This is okay;
+		 * the only consequence is that XLogInsert might back up whole buffers
+		 * that it didn't really need to.  We can't postpone advancing
+		 * RedoRecPtr because XLogInserts that happen while we are dumping
+		 * buffers must assume that their buffer changes are not included in
+		 * the checkpoint.
+		 */
+		RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	}
 
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
@@ -6645,6 +6691,33 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockRelease();
 
+	/*
+	 * If this is an online checkpoint, we have not yet determined the redo
+	 * point. We do so now by inserting the special XLOG_CHECKPOINT_REDO
+	 * record; the LSN at which it starts becomes the new redo pointer. We
+	 * don't do this for a shutdown checkpoint, because in that case no WAL
+	 * can be written between the redo point and the insertion of the
+	 * checkpoint record itself, so the checkpoint record itself serves to
+	 * mark the redo point.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		/* Record must have payload to avoid assertion failure. */
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		(void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/*
+		 * XLogInsertRecord will have updated XLogCtl->Insert.RedoRecPtr in
+		 * shared memory and RedoRecPtr in backend-local memory, but we need
+		 * to copy that into the record that will be inserted when the
+		 * checkpoint is complete.
+		 */
+		checkPoint.redo = RedoRecPtr;
+	}
+
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->RedoRecPtr = checkPoint.redo;
@@ -8105,6 +8178,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here, just for informational purposes */
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..d6f2bb8286 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1638,6 +1638,17 @@ PerformWalRecovery(void)
 		replayTLI = RedoStartTLI;
 		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
 		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
+
+		/*
+		 * If a checkpoint record's redo pointer points back to an earlier
+		 * LSN, the record at that LSN should be an XLOG_CHECKPOINT_REDO
+		 * record.
+		 */
+		if (record->xl_rmid != RM_XLOG_ID ||
+			(record->xl_info & ~XLR_INFO_MASK) != XLOG_CHECKPOINT_REDO)
+			ereport(FATAL,
+					(errmsg("unexpected record type found at redo point %X/%X",
+							LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));
 	}
 	else
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 730061c9da..24b712aa66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e69bb671bf..06b25617bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2971,6 +2971,7 @@ VolatileFunctionStatus
 Vsrt
 WAIT_ORDER
 WALAvailability
+WalInsertClass
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
-- 
2.37.1 (Apple Git-137.1)

#36Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#35)
Re: New WAL record to detect the checkpoint redo location

On Tue, Oct 17, 2023 at 12:45:52PM -0400, Robert Haas wrote:

On Fri, Oct 13, 2023 at 3:29 AM Michael Paquier <michael@paquier.xyz> wrote:

I've mentioned as well a test in pg_walinspect after one of the
checkpoints generated there, but what you do here is enough for the
online case.

I don't quite understand what you're saying here. If you're suggesting
a potential improvement, can you be a bit more clear and explicit
about what the suggestion is?

Suggestion is from here, with a test for pg_walinspect after it runs
its online checkpoint (see the full-page case):
/messages/by-id/ZOvf1tu6rfL/B2PW@paquier.xyz

+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');

Then repeat this pattern for non-shutdown checkpoints a few lines down
without touching the copy of the redo LSN in XLogCtl->Insert, because
of course we don't hold the WAL insert locks in an exclusive fashion
here:
checkPoint.redo = RedoRecPtr;

My point is that this is not only about RedoRecPtr, but also about
XLogCtl->Insert.RedoRecPtr here. The comment in ReserveXLogSwitch()
says that.

I have adjusted the comment in CreateCheckPoint to hopefully address
this concern.

-        * XLogInsertRecord will have updated RedoRecPtr, but we need to copy
-        * that into the record that will be inserted when the checkpoint is
-        * complete.
+        * XLogInsertRecord will have updated XLogCtl->Insert.RedoRecPtr in
+        * shared memory and RedoRecPtr in backend-local memory, but we need
+        * to copy that into the record that will be inserted when the
+        * checkpoint is complete. 

This comment diff between v8 and v9 looks OK to me. Thanks.

I don't understand what you mean about
ReserveXLogSwitch(), though.

I am not sure either, looking back at that :p
--
Michael

#37Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#36)
1 attachment(s)
Re: New WAL record to detect the checkpoint redo location

On Tue, Oct 17, 2023 at 8:35 PM Michael Paquier <michael@paquier.xyz> wrote:

Suggestion is from here, with a test for pg_walinspect after it runs
its online checkpoint (see the full-page case):
/messages/by-id/ZOvf1tu6rfL/B2PW@paquier.xyz

+-- Check presence of REDO record.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, record_type
+  FROM pg_get_wal_record_info(:'redo_lsn');

I added a variant of this test case. Here's v10.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v10-0001-During-online-checkpoints-insert-XLOG_CHECKPOINT.patchapplication/octet-stream; name=v10-0001-During-online-checkpoints-insert-XLOG_CHECKPOINT.patchDownload
From a8827253d3101d8234f37cc13b9d59bda8fd2c28 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 18 Oct 2023 10:18:44 -0400
Subject: [PATCH v10] During online checkpoints, insert XLOG_CHECKPOINT_REDO at
 redo point.

This allows tools that read the WAL sequentially to identify (possible)
redo points when they're reached, rather than only being able to
detect them in retrospect when XLOG_CHECKPOINT_ONLINE is found, possibly
much later in the WAL stream.

Any redo location that precedes the checkpoint location should now point
to an XLOG_CHECKPOINT_REDO record, so add a cross-check to verify this.

While adjusting the code in CreateCheckPoint() for this patch, I made it
call WALInsertLockAcquireExclusive a bit later than before, since there
appears to be no need for it to be held while checking whether the system
is idle, whether this is an end-of-recovery checkpoint, or what the current
timeline is.

Patch by me, based in part on earlier work from Dilip Kumar. Review by
Dilip Kumar, Amit Kapila, Andres Freund, and Michael Paquier.

Discussion: http://postgr.es/m/CA+TgmoYy-Vc6G9QKcAKNksCa29cv__czr+N9X_QCxEfQVpp_8w@mail.gmail.com
---
 .../pg_walinspect/expected/pg_walinspect.out  |  13 +-
 contrib/pg_walinspect/sql/pg_walinspect.sql   |  10 +-
 src/backend/access/rmgrdesc/xlogdesc.c        |   7 +
 src/backend/access/transam/xlog.c             | 193 ++++++++++++------
 src/backend/access/transam/xlogrecovery.c     |  11 +
 src/backend/replication/logical/decode.c      |   1 +
 src/include/catalog/pg_control.h              |   1 +
 src/tools/pgindent/typedefs.list              |   1 +
 8 files changed, 177 insertions(+), 60 deletions(-)

diff --git a/contrib/pg_walinspect/expected/pg_walinspect.out b/contrib/pg_walinspect/expected/pg_walinspect.out
index a8f4c91060..c010eed8c5 100644
--- a/contrib/pg_walinspect/expected/pg_walinspect.out
+++ b/contrib/pg_walinspect/expected/pg_walinspect.out
@@ -127,9 +127,20 @@ SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn3', :'wal_lsn4')
  t
 (1 row)
 
--- Force full-page image on the next update.
+-- Force a checkpoint so that the next update will log a full-page image.
 SELECT pg_current_wal_lsn() AS wal_lsn5 \gset
 CHECKPOINT;
+-- Verify that an XLOG_CHECKPOINT_REDO record begins at precisely the redo LSN
+-- of the checkpoint we just performed.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, resource_manager,
+    record_type FROM pg_get_wal_record_info(:'redo_lsn');
+ same_lsn | resource_manager |   record_type   
+----------+------------------+-----------------
+ t        | XLOG             | CHECKPOINT_REDO
+(1 row)
+
+-- This update should produce a full-page image because of the checkpoint.
 UPDATE sample_tbl SET col1 = col1 + 1 WHERE col1 = 2;
 SELECT pg_current_wal_lsn() AS wal_lsn6 \gset
 -- Check if we get FPI from WAL record.
diff --git a/contrib/pg_walinspect/sql/pg_walinspect.sql b/contrib/pg_walinspect/sql/pg_walinspect.sql
index f987ca31c4..1e64a22d29 100644
--- a/contrib/pg_walinspect/sql/pg_walinspect.sql
+++ b/contrib/pg_walinspect/sql/pg_walinspect.sql
@@ -80,9 +80,17 @@ SELECT pg_current_wal_lsn() AS wal_lsn4 \gset
 SELECT COUNT(*) >= 1 AS ok FROM pg_get_wal_block_info(:'wal_lsn3', :'wal_lsn4')
   WHERE relfilenode = :'sample_tbl_oid' AND block_data IS NOT NULL;
 
--- Force full-page image on the next update.
+-- Force a checkpoint so that the next update will log a full-page image.
 SELECT pg_current_wal_lsn() AS wal_lsn5 \gset
 CHECKPOINT;
+
+-- Verify that an XLOG_CHECKPOINT_REDO record begins at precisely the redo LSN
+-- of the checkpoint we just performed.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, resource_manager,
+    record_type FROM pg_get_wal_record_info(:'redo_lsn');
+
+-- This update should produce a full-page image because of the checkpoint.
 UPDATE sample_tbl SET col1 = col1 + 1 WHERE col1 = 2;
 SELECT pg_current_wal_lsn() AS wal_lsn6 \gset
 -- Check if we get FPI from WAL record.
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index f390c177e4..37f59bda7e 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -148,6 +148,10 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 LSN_FORMAT_ARGS(xlrec.overwritten_lsn),
 						 timestamptz_to_str(xlrec.overwrite_time));
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* No details to write out */
+	}
 }
 
 const char *
@@ -196,6 +200,9 @@ xlog_identify(uint8 info)
 		case XLOG_FPI_FOR_HINT:
 			id = "FPI_FOR_HINT";
 			break;
+		case XLOG_CHECKPOINT_REDO:
+			id = "CHECKPOINT_REDO";
+			break;
 	}
 
 	return id;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c0e4ca5089..cea13e3d58 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -559,6 +559,16 @@ typedef struct XLogCtlData
 	slock_t		info_lck;		/* locks shared variables shown above */
 } XLogCtlData;
 
+/*
+ * Classification of XLogRecordInsert operations.
+ */
+typedef enum
+{
+	WALINSERT_NORMAL,
+	WALINSERT_SPECIAL_SWITCH,
+	WALINSERT_SPECIAL_CHECKPOINT
+} WalInsertClass;
+
 static XLogCtlData *XLogCtl = NULL;
 
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
@@ -739,13 +749,21 @@ XLogInsertRecord(XLogRecData *rdata,
 	bool		inserted;
 	XLogRecord *rechdr = (XLogRecord *) rdata->data;
 	uint8		info = rechdr->xl_info & ~XLR_INFO_MASK;
-	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
-							   info == XLOG_SWITCH);
+	WalInsertClass class = WALINSERT_NORMAL;
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
 	bool		prevDoPageWrites = doPageWrites;
 	TimeLineID	insertTLI;
 
+	/* Does this record type require special handling? */
+	if (unlikely(rechdr->xl_rmid == RM_XLOG_ID))
+	{
+		if (info == XLOG_SWITCH)
+			class = WALINSERT_SPECIAL_SWITCH;
+		else if (info == XLOG_CHECKPOINT_REDO)
+			class = WALINSERT_SPECIAL_CHECKPOINT;
+	}
+
 	/* we assume that all of the record header is in the first chunk */
 	Assert(rdata->len >= SizeOfXLogRecord);
 
@@ -793,7 +811,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 */
 	START_CRIT_SECTION();
 
-	if (likely(!isLogSwitch))
+	if (likely(class == WALINSERT_NORMAL))
 	{
 		WALInsertLockAcquire();
 
@@ -843,7 +861,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		/* Normal records are always inserted. */
 		inserted = true;
 	}
-	else
+	else if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		/*
 		 * In order to insert an XLOG_SWITCH record, we need to hold all of
@@ -852,14 +870,32 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * remains in the current WAL segment and claimed all of it.
 		 *
 		 * Nonetheless, this case is simpler than the normal cases handled
-		 * above, which must check for changes in doPageWrites and RedoRecPtr.
-		 * Those checks are only needed for records that can contain
-		 * full-pages images, and an XLOG_SWITCH record never does.
+		 * below, which must check for changes in doPageWrites and RedoRecPtr.
+		 * Those checks are only needed for records that can contain buffer
+		 * references, and an XLOG_SWITCH record never does.
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
 		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
 	}
+	else
+	{
+		Assert(class == WALINSERT_SPECIAL_CHECKPOINT);
+
+		/*
+		 * We need to update both the local and shared copies of RedoRecPtr,
+		 * which means that we need to hold all the WAL insertion locks.
+		 * However, there can't be any buffer references, so as above, we need
+		 * not check RedoRecPtr before inserting the record; we just need to
+		 * update it afterwards.
+		 */
+		Assert(fpw_lsn == InvalidXLogRecPtr);
+		WALInsertLockAcquireExclusive();
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
+								  &rechdr->xl_prev);
+		RedoRecPtr = Insert->RedoRecPtr = StartPos;
+		inserted = true;
+	}
 
 	if (inserted)
 	{
@@ -876,7 +912,8 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * All the record data, including the header, is now ready to be
 		 * inserted. Copy the record in the space reserved.
 		 */
-		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
+		CopyXLogRecordToWAL(rechdr->xl_tot_len,
+							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
 		/*
@@ -935,7 +972,7 @@ XLogInsertRecord(XLogRecData *rdata,
 	 * padding space that fills the rest of the segment, and perform
 	 * end-of-segment actions (eg, notifying archiver).
 	 */
-	if (isLogSwitch)
+	if (class == WALINSERT_SPECIAL_SWITCH)
 	{
 		TRACE_POSTGRESQL_WAL_SWITCH();
 		XLogFlush(EndPos);
@@ -1054,8 +1091,12 @@ XLogInsertRecord(XLogRecData *rdata,
  *
  * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
  * where we actually copy the record to the reserved space.
+ *
+ * NB: Testing shows that XLogInsertRecord runs faster if this code is inlined;
+ * however, because there are two call sites, the compiler is reluctant to
+ * inline. We use pg_attribute_always_inline here to try to convince it.
  */
-static void
+static pg_attribute_always_inline void
 ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 						  XLogRecPtr *PrevPtr)
 {
@@ -6475,17 +6516,22 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  * In particular note that this routine is synchronous and does not pay
  * attention to CHECKPOINT_WAIT.
  *
- * If !shutdown then we are writing an online checkpoint. This is a very special
- * kind of operation and WAL record because the checkpoint action occurs over
- * a period of time yet logically occurs at just a single LSN. The logical
- * position of the WAL record (redo ptr) is the same or earlier than the
- * physical position. When we replay WAL we locate the checkpoint via its
- * physical position then read the redo ptr and actually start replay at the
- * earlier logical position. Note that we don't write *anything* to WAL at
- * the logical position, so that location could be any other kind of WAL record.
- * All of this mechanism allows us to continue working while we checkpoint.
- * As a result, timing of actions is critical here and be careful to note that
- * this function will likely take minutes to execute on a busy system.
+ * If !shutdown then we are writing an online checkpoint. An XLOG_CHECKPOINT_REDO
+ * record is inserted into WAL at the logical location of the checkpoint, before
+ * flushing anything to disk, and when the checkpoint is eventually completed,
+ * and it is from this point that WAL replay will begin in the case of a recovery
+ * from this checkpoint. Once everything is written to disk, an
+ * XLOG_CHECKPOINT_ONLINE record is written to complete the checkpoint, and
+ * points back to the earlier XLOG_CHECKPOINT_REDO record. This mechanism allows
+ * other write-ahead log records to be written while the checkpoint is in
+ * progress, but we must be very careful about order of operations. This function
+ * may take many minutes to execute on a busy system.
+ *
+ * On the other hand, when shutdown is true, concurrent insertion into the
+ * write-ahead log is impossible, so there is no need for two separate records.
+ * In this case, we only insert an XLOG_CHECKPOINT_SHUTDOWN record, and it's
+ * both the record marking the completion of the checkpoint and the location
+ * from which WAL replay would begin if needed.
  */
 void
 CreateCheckPoint(int flags)
@@ -6497,7 +6543,6 @@ CreateCheckPoint(int flags)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint32		freespace;
 	XLogRecPtr	PriorRedoPtr;
-	XLogRecPtr	curInsert;
 	XLogRecPtr	last_important_lsn;
 	VirtualTransactionId *vxids;
 	int			nvxids;
@@ -6567,13 +6612,6 @@ CreateCheckPoint(int flags)
 	 */
 	last_important_lsn = GetLastImportantRecPtr();
 
-	/*
-	 * We must block concurrent insertions while examining insert state to
-	 * determine the checkpoint REDO pointer.
-	 */
-	WALInsertLockAcquireExclusive();
-	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
-
 	/*
 	 * If this isn't a shutdown or forced checkpoint, and if there has been no
 	 * WAL activity requiring a checkpoint, skip it.  The idea here is to
@@ -6584,7 +6622,6 @@ CreateCheckPoint(int flags)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
-			WALInsertLockRelease();
 			END_CRIT_SECTION();
 			ereport(DEBUG1,
 					(errmsg_internal("checkpoint skipped because system is idle")));
@@ -6606,38 +6643,47 @@ CreateCheckPoint(int flags)
 	else
 		checkPoint.PrevTimeLineID = checkPoint.ThisTimeLineID;
 
-	checkPoint.fullPageWrites = Insert->fullPageWrites;
-
 	/*
-	 * Compute new REDO record ptr = location of next XLOG record.
-	 *
-	 * NB: this is NOT necessarily where the checkpoint record itself will be,
-	 * since other backends may insert more XLOG records while we're off doing
-	 * the buffer flush work.  Those XLOG records are logically after the
-	 * checkpoint, even though physically before it.  Got that?
+	 * We must block concurrent insertions while examining insert state.
 	 */
-	freespace = INSERT_FREESPACE(curInsert);
-	if (freespace == 0)
+	WALInsertLockAcquireExclusive();
+
+	checkPoint.fullPageWrites = Insert->fullPageWrites;
+
+	if (shutdown)
 	{
-		if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
-			curInsert += SizeOfXLogLongPHD;
-		else
-			curInsert += SizeOfXLogShortPHD;
-	}
-	checkPoint.redo = curInsert;
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
-	/*
-	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding all the insertion locks.
-	 *
-	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
-	 * pointing past where it really needs to point.  This is okay; the only
-	 * consequence is that XLogInsert might back up whole buffers that it
-	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
-	 * XLogInserts that happen while we are dumping buffers must assume that
-	 * their buffer changes are not included in the checkpoint.
-	 */
-	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+		/*
+		 * Compute new REDO record ptr = location of next XLOG record.
+		 *
+		 * Since this is a shutdown checkpoint, there can't be any concurrent
+		 * WAL insertion.
+		 */
+		freespace = INSERT_FREESPACE(curInsert);
+		if (freespace == 0)
+		{
+			if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
+				curInsert += SizeOfXLogLongPHD;
+			else
+				curInsert += SizeOfXLogShortPHD;
+		}
+		checkPoint.redo = curInsert;
+
+		/*
+		 * Here we update the shared RedoRecPtr for future XLogInsert calls;
+		 * this must be done while holding all the insertion locks.
+		 *
+		 * Note: if we fail to complete the checkpoint, RedoRecPtr will be
+		 * left pointing past where it really needs to point.  This is okay;
+		 * the only consequence is that XLogInsert might back up whole buffers
+		 * that it didn't really need to.  We can't postpone advancing
+		 * RedoRecPtr because XLogInserts that happen while we are dumping
+		 * buffers must assume that their buffer changes are not included in
+		 * the checkpoint.
+		 */
+		RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
+	}
 
 	/*
 	 * Now we can release the WAL insertion locks, allowing other xacts to
@@ -6645,6 +6691,33 @@ CreateCheckPoint(int flags)
 	 */
 	WALInsertLockRelease();
 
+	/*
+	 * If this is an online checkpoint, we have not yet determined the redo
+	 * point. We do so now by inserting the special XLOG_CHECKPOINT_REDO
+	 * record; the LSN at which it starts becomes the new redo pointer. We
+	 * don't do this for a shutdown checkpoint, because in that case no WAL
+	 * can be written between the redo point and the insertion of the
+	 * checkpoint record itself, so the checkpoint record itself serves to
+	 * mark the redo point.
+	 */
+	if (!shutdown)
+	{
+		int			dummy = 0;
+
+		/* Record must have payload to avoid assertion failure. */
+		XLogBeginInsert();
+		XLogRegisterData((char *) &dummy, sizeof(dummy));
+		(void) XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_REDO);
+
+		/*
+		 * XLogInsertRecord will have updated XLogCtl->Insert.RedoRecPtr in
+		 * shared memory and RedoRecPtr in backend-local memory, but we need
+		 * to copy that into the record that will be inserted when the
+		 * checkpoint is complete.
+		 */
+		checkPoint.redo = RedoRecPtr;
+	}
+
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->RedoRecPtr = checkPoint.redo;
@@ -8105,6 +8178,10 @@ xlog_redo(XLogReaderState *record)
 		/* Keep track of full_page_writes */
 		lastFullPageWrites = fpw;
 	}
+	else if (info == XLOG_CHECKPOINT_REDO)
+	{
+		/* nothing to do here, just for informational purposes */
+	}
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index becc2bda62..d6f2bb8286 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -1638,6 +1638,17 @@ PerformWalRecovery(void)
 		replayTLI = RedoStartTLI;
 		XLogPrefetcherBeginRead(xlogprefetcher, RedoStartLSN);
 		record = ReadRecord(xlogprefetcher, PANIC, false, replayTLI);
+
+		/*
+		 * If a checkpoint record's redo pointer points back to an earlier
+		 * LSN, the record at that LSN should be an XLOG_CHECKPOINT_REDO
+		 * record.
+		 */
+		if (record->xl_rmid != RM_XLOG_ID ||
+			(record->xl_info & ~XLR_INFO_MASK) != XLOG_CHECKPOINT_REDO)
+			ereport(FATAL,
+					(errmsg("unexpected record type found at redo point %X/%X",
+							LSN_FORMAT_ARGS(xlogreader->ReadRecPtr))));
 	}
 	else
 	{
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 730061c9da..24b712aa66 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -190,6 +190,7 @@ xlog_decode(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 		case XLOG_OVERWRITE_CONTRECORD:
+		case XLOG_CHECKPOINT_REDO:
 			break;
 		default:
 			elog(ERROR, "unexpected RM_XLOG_ID record type: %u", info);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index dc953977c5..1136613259 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -78,6 +78,7 @@ typedef struct CheckPoint
 #define XLOG_FPI						0xB0
 /* 0xC0 is used in Postgres 9.5-11 */
 #define XLOG_OVERWRITE_CONTRECORD		0xD0
+#define XLOG_CHECKPOINT_REDO			0xE0
 
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e69bb671bf..06b25617bc 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2971,6 +2971,7 @@ VolatileFunctionStatus
 Vsrt
 WAIT_ORDER
 WALAvailability
+WalInsertClass
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
-- 
2.37.1 (Apple Git-137.1)

#38Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#37)
Re: New WAL record to detect the checkpoint redo location

On Wed, Oct 18, 2023 at 10:24:50AM -0400, Robert Haas wrote:

I added a variant of this test case. Here's v10.

+-- Verify that an XLOG_CHECKPOINT_REDO record begins at precisely the redo LSN
+-- of the checkpoint we just performed.
+SELECT redo_lsn FROM pg_control_checkpoint() \gset
+SELECT start_lsn = :'redo_lsn'::pg_lsn AS same_lsn, resource_manager,
+    record_type FROM pg_get_wal_record_info(:'redo_lsn');
+ same_lsn | resource_manager |   record_type   
+----------+------------------+-----------------
+ t        | XLOG             | CHECKPOINT_REDO
+(1 row)

Seems fine to me. Thanks for considering the idea.
--
Michael

#39Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#38)
Re: New WAL record to detect the checkpoint redo location

On Thu, Oct 19, 2023 at 1:53 AM Michael Paquier <michael@paquier.xyz> wrote:

Seems fine to me. Thanks for considering the idea.

I think it was a good idea!

I've committed the patch.

--
Robert Haas
EDB: http://www.enterprisedb.com