[RFC] Lock-free XLog Reservation from WAL

Started by Zhou, Zhiguoabout 1 year ago47 messages
#1Zhou, Zhiguo
zhiguo.zhou@intel.com
1 attachment(s)

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

Attachments:

0001-Lock-free-XLog-Reservation-from-WAL.patchapplication/octet-stream; name=0001-Lock-free-XLog-Reservation-from-WAL.patchDownload
From 2f4a32b1d09419167fe5040465c8a3464010e012 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.zhou@intel.com>
Date: Thu, 2 Jan 2025 12:04:02 +0800
Subject: [PATCH] Lock-free XLog Reservation from WAL

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Adjusted XLog reservation to exceed current XLog memory
slightly, enabling the next XLog's prev-link update without waiting for
the current XLog. Backends now update next XLog's prev-link before
inserting current log, breaking sequential write convention. Updated
GetXLogBuffer to handle separate access and update LSNs. Increased
NUM_XLOGINSERT_LOCKS.
---
 src/backend/access/transam/xlog.c | 186 +++++++++++++++++++++---------
 1 file changed, 131 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..6215ea6977 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -147,7 +147,7 @@ int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
  * which needs to iterate all the locks.
  */
-#define NUM_XLOGINSERT_LOCKS  8
+#define NUM_XLOGINSERT_LOCKS  128
 
 /*
  * Max distance from last checkpoint, before triggering a new xlog-based
@@ -404,8 +404,7 @@ typedef struct XLogCtlInsert
 	 * prev-link of the next record. These are stored as "usable byte
 	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64		CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -700,12 +699,14 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+static XLogRecPtr GetPrevRecPtr(XLogRecPtr hdr, TimeLineID tli);
+static void SetPrevRecPtr(XLogRecPtr hdr, XLogRecPtr PrevPos,
+						  TimeLineID tli, bool in_order);
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
-									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
-static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
-							  XLogRecPtr *PrevPtr);
+									  XLogRecPtr *EndPos);
+static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBuffer(XLogRecPtr ptr, XLogRecPtr upto, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -862,8 +863,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * Reserve space for the record in the WAL. This also sets the xl_prev
 		 * pointer.
 		 */
-		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
-								  &rechdr->xl_prev);
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos);
 
 		/* Normal records are always inserted. */
 		inserted = true;
@@ -883,7 +883,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos);
 	}
 	else
 	{
@@ -898,14 +898,22 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
-		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
-								  &rechdr->xl_prev);
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos);
 		RedoRecPtr = Insert->RedoRecPtr = StartPos;
 		inserted = true;
 	}
 
 	if (inserted)
 	{
+		bool		islargerecord;
+
+		islargerecord = EndPos - StartPos >= (Size) XLOGbuffers * XLOG_BLCKSZ;
+		if (!islargerecord)
+		{
+			SetPrevRecPtr(EndPos, StartPos, insertTLI, false);
+		}
+		while (!(rechdr->xl_prev = GetPrevRecPtr(StartPos, insertTLI)));
+
 		/*
 		 * Now that xl_prev has been filled in, calculate CRC of the record
 		 * header.
@@ -923,6 +931,11 @@ XLogInsertRecord(XLogRecData *rdata,
 							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
+		if (islargerecord)
+		{
+			SetPrevRecPtr(EndPos, StartPos, insertTLI, true);
+		}
+
 		/*
 		 * Unless record is flagged as not important, update LSN of last
 		 * important record in the current slot. When holding all locks, just
@@ -1105,13 +1118,11 @@ XLogInsertRecord(XLogRecData *rdata,
  * inline. We use pg_attribute_always_inline here to try to convince it.
  */
 static pg_attribute_always_inline void
-ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
-						  XLogRecPtr *PrevPtr)
+ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 
 	size = MAXALIGN(size);
 
@@ -1128,19 +1139,11 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 * because the usable byte position doesn't include any headers, reserving
 	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
 	/*
 	 * Check that the conversions between "usable byte positions" and
@@ -1148,7 +1151,6 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 */
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 }
 
 /*
@@ -1161,12 +1163,11 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
  * reserving any space, and the function returns false.
 */
 static bool
-ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
+ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 	uint32		size = MAXALIGN(SizeOfXLogRecord);
 	XLogRecPtr	ptr;
 	uint32		segleft;
@@ -1179,7 +1180,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	 */
 	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
@@ -1190,7 +1191,6 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1202,17 +1202,13 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
+	pg_atomic_write_u64(&Insert->CurrBytePos, endbytepos);
 
 	SpinLockRelease(&Insert->insertpos_lck);
 
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
-
 	Assert(XLogSegmentOffset(*EndPos, wal_segment_size) == 0);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 
 	return true;
 }
@@ -1236,7 +1232,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 	 * inserting to.
 	 */
 	CurrPos = StartPos;
-	currpos = GetXLogBuffer(CurrPos, tli);
+	currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 	freespace = INSERT_FREESPACE(CurrPos);
 
 	/*
@@ -1273,7 +1269,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
 			 * only backend that needs to set the contrecord flag.
 			 */
-			currpos = GetXLogBuffer(CurrPos, tli);
+			currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 			pagehdr = (XLogPageHeader) currpos;
 			pagehdr->xlp_rem_len = write_len - written;
 			pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
@@ -1346,7 +1342,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 			 * (which itself calls the two methods we need) to get the pointer
 			 * and zero most of the page.  Then we just zero the page header.
 			 */
-			currpos = GetXLogBuffer(CurrPos, tli);
+			currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 			MemSet(currpos, 0, SizeOfXLogShortPHD);
 
 			CurrPos += XLOG_BLCKSZ;
@@ -1364,6 +1360,90 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 				errmsg_internal("space reserved for WAL record does not match what was written"));
 }
 
+static XLogRecPtr
+GetPrevRecPtr(XLogRecPtr hdr, TimeLineID tli)
+{
+	uint64		xlprevbytepos;
+	char		*xl_prev_pos;
+	int			freespace;
+
+	XLogRecPtr	res;
+	XLogRecPtr	xl_prev_ptr;
+
+	xlprevbytepos = XLogRecPtrToBytePos(hdr) + offsetof(XLogRecord, xl_prev);
+	xl_prev_ptr = XLogBytePosToRecPtr(xlprevbytepos);
+	xl_prev_pos = GetXLogBuffer(xl_prev_ptr, hdr, tli);
+	freespace = INSERT_FREESPACE(xl_prev_ptr);
+
+	if (freespace >= sizeof(XLogRecPtr))
+	{
+		res = *((XLogRecPtr *) xl_prev_pos);
+	}
+	else
+	{
+		char	*res_data;
+
+		res_data = (char *) &res;
+		memcpy(res_data, xl_prev_pos, freespace);
+
+		if (XLogSegmentOffset(xl_prev_ptr + freespace, wal_segment_size) == 0)
+		{
+			xl_prev_pos += SizeOfXLogLongPHD;
+		}
+		else
+		{
+			xl_prev_pos += SizeOfXLogShortPHD;
+		}
+
+		memcpy(res_data + freespace, xl_prev_pos, sizeof(XLogRecPtr) - freespace);
+	}
+
+	return res;
+}
+
+static void
+SetPrevRecPtr(XLogRecPtr hdr, XLogRecPtr xl_prev_to_insert, TimeLineID tli, bool in_order)
+{
+	uint64		xlprevbytepos;
+	char		*xl_prev_pos;
+	int			freespace;
+
+	XLogRecPtr	xl_prev_ptr;
+
+	xlprevbytepos = XLogRecPtrToBytePos(hdr) + offsetof(XLogRecord, xl_prev);
+	xl_prev_ptr = XLogBytePosToRecPtr(xlprevbytepos);
+
+	if (in_order)
+		xl_prev_pos = GetXLogBuffer(xl_prev_ptr, xl_prev_ptr, tli);
+	else
+		xl_prev_pos = GetXLogBuffer(xl_prev_ptr, xl_prev_to_insert, tli);
+
+	freespace = INSERT_FREESPACE(xl_prev_ptr);
+
+	if (freespace >= sizeof(XLogRecPtr))
+	{
+		*((XLogRecPtr *) xl_prev_pos) = xl_prev_to_insert;
+	}
+	else
+	{
+		char	*xl_prev_to_insert_data;
+
+		xl_prev_to_insert_data = (char *) &xl_prev_to_insert;
+		memcpy(xl_prev_pos, xl_prev_to_insert_data, freespace);
+
+		if (XLogSegmentOffset(xl_prev_ptr + freespace, wal_segment_size) == 0)
+		{
+			xl_prev_pos += SizeOfXLogLongPHD;
+		}
+		else
+		{
+			xl_prev_pos += SizeOfXLogShortPHD;
+		}
+
+		memcpy(xl_prev_pos, xl_prev_to_insert_data + freespace, sizeof(XLogRecPtr) - freespace);
+	}
+}
+
 /*
  * Acquire a WAL insertion lock, for inserting to WAL.
  */
@@ -1522,9 +1602,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -1629,7 +1707,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  * later, because older buffers might be recycled already)
  */
 static char *
-GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
+GetXLogBuffer(XLogRecPtr ptr, XLogRecPtr upto, TimeLineID tli)
 {
 	int			idx;
 	XLogRecPtr	endptr;
@@ -1690,14 +1768,14 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 		 * sure that it's initialized, before we let insertingAt to move past
 		 * the page header.
 		 */
-		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
-			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogShortPHD;
-		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
-				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogLongPHD;
+		if (upto % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
+			XLogSegmentOffset(upto, wal_segment_size) > XLOG_BLCKSZ)
+			initializedUpto = upto - SizeOfXLogShortPHD;
+		else if (upto % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
+				 XLogSegmentOffset(upto, wal_segment_size) < XLOG_BLCKSZ)
+			initializedUpto = upto - SizeOfXLogLongPHD;
 		else
-			initializedUpto = ptr;
+			initializedUpto = upto;
 
 		WALInsertLockUpdateInsertingAt(initializedUpto);
 
@@ -6018,8 +6096,7 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	pg_atomic_write_u64(&Insert->CurrBytePos, XLogRecPtrToBytePos(EndOfLog));
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -6053,6 +6130,7 @@ StartupXLOG(void)
 		 */
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
+	SetPrevRecPtr(EndOfLog, endOfRecoveryInfo->lastRec, newTLI, true);
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
@@ -7005,7 +7083,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(pg_atomic_read_u64(&Insert->CurrBytePos));
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -7473,7 +7551,7 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn, XLogRecPtr pagePtr,
 	 * insertion lock is just pro forma.
 	 */
 	WALInsertLockAcquire();
-	pagehdr = (XLogPageHeader) GetXLogBuffer(pagePtr, newTLI);
+	pagehdr = (XLogPageHeader) GetXLogBuffer(pagePtr, pagePtr, newTLI);
 	pagehdr->xlp_info |= XLP_FIRST_IS_OVERWRITE_CONTRECORD;
 	WALInsertLockRelease();
 
@@ -9437,9 +9515,7 @@ GetXLogInsertRecPtr(void)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		current_bytepos;
 
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	current_bytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	return XLogBytePosToRecPtr(current_bytepos);
 }
-- 
2.43.0

#2Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Zhou, Zhiguo (#1)
1 attachment(s)

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

Attachments:

0001-Lock-free-XLog-Reservation-from-WAL.patchapplication/octet-stream; name=0001-Lock-free-XLog-Reservation-from-WAL.patchDownload
From 2f4a32b1d09419167fe5040465c8a3464010e012 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.zhou@intel.com>
Date: Thu, 2 Jan 2025 12:04:02 +0800
Subject: [PATCH] Lock-free XLog Reservation from WAL

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Adjusted XLog reservation to exceed current XLog memory
slightly, enabling the next XLog's prev-link update without waiting for
the current XLog. Backends now update next XLog's prev-link before
inserting current log, breaking sequential write convention. Updated
GetXLogBuffer to handle separate access and update LSNs. Increased
NUM_XLOGINSERT_LOCKS.
---
 src/backend/access/transam/xlog.c | 186 +++++++++++++++++++++---------
 1 file changed, 131 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bca..6215ea6977 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -147,7 +147,7 @@ int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
  * which needs to iterate all the locks.
  */
-#define NUM_XLOGINSERT_LOCKS  8
+#define NUM_XLOGINSERT_LOCKS  128
 
 /*
  * Max distance from last checkpoint, before triggering a new xlog-based
@@ -404,8 +404,7 @@ typedef struct XLogCtlInsert
 	 * prev-link of the next record. These are stored as "usable byte
 	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64		CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -700,12 +699,14 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+static XLogRecPtr GetPrevRecPtr(XLogRecPtr hdr, TimeLineID tli);
+static void SetPrevRecPtr(XLogRecPtr hdr, XLogRecPtr PrevPos,
+						  TimeLineID tli, bool in_order);
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
-									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
-static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
-							  XLogRecPtr *PrevPtr);
+									  XLogRecPtr *EndPos);
+static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBuffer(XLogRecPtr ptr, XLogRecPtr upto, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -862,8 +863,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 * Reserve space for the record in the WAL. This also sets the xl_prev
 		 * pointer.
 		 */
-		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
-								  &rechdr->xl_prev);
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos);
 
 		/* Normal records are always inserted. */
 		inserted = true;
@@ -883,7 +883,7 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
-		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
+		inserted = ReserveXLogSwitch(&StartPos, &EndPos);
 	}
 	else
 	{
@@ -898,14 +898,22 @@ XLogInsertRecord(XLogRecData *rdata,
 		 */
 		Assert(fpw_lsn == InvalidXLogRecPtr);
 		WALInsertLockAcquireExclusive();
-		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
-								  &rechdr->xl_prev);
+		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos);
 		RedoRecPtr = Insert->RedoRecPtr = StartPos;
 		inserted = true;
 	}
 
 	if (inserted)
 	{
+		bool		islargerecord;
+
+		islargerecord = EndPos - StartPos >= (Size) XLOGbuffers * XLOG_BLCKSZ;
+		if (!islargerecord)
+		{
+			SetPrevRecPtr(EndPos, StartPos, insertTLI, false);
+		}
+		while (!(rechdr->xl_prev = GetPrevRecPtr(StartPos, insertTLI)));
+
 		/*
 		 * Now that xl_prev has been filled in, calculate CRC of the record
 		 * header.
@@ -923,6 +931,11 @@ XLogInsertRecord(XLogRecData *rdata,
 							class == WALINSERT_SPECIAL_SWITCH, rdata,
 							StartPos, EndPos, insertTLI);
 
+		if (islargerecord)
+		{
+			SetPrevRecPtr(EndPos, StartPos, insertTLI, true);
+		}
+
 		/*
 		 * Unless record is flagged as not important, update LSN of last
 		 * important record in the current slot. When holding all locks, just
@@ -1105,13 +1118,11 @@ XLogInsertRecord(XLogRecData *rdata,
  * inline. We use pg_attribute_always_inline here to try to convince it.
  */
 static pg_attribute_always_inline void
-ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
-						  XLogRecPtr *PrevPtr)
+ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 
 	size = MAXALIGN(size);
 
@@ -1128,19 +1139,11 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 * because the usable byte position doesn't include any headers, reserving
 	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
 	/*
 	 * Check that the conversions between "usable byte positions" and
@@ -1148,7 +1151,6 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 */
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 }
 
 /*
@@ -1161,12 +1163,11 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
  * reserving any space, and the function returns false.
 */
 static bool
-ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
+ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos)
 {
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 	uint32		size = MAXALIGN(SizeOfXLogRecord);
 	XLogRecPtr	ptr;
 	uint32		segleft;
@@ -1179,7 +1180,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	 */
 	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
@@ -1190,7 +1191,6 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1202,17 +1202,13 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
+	pg_atomic_write_u64(&Insert->CurrBytePos, endbytepos);
 
 	SpinLockRelease(&Insert->insertpos_lck);
 
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
-
 	Assert(XLogSegmentOffset(*EndPos, wal_segment_size) == 0);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 
 	return true;
 }
@@ -1236,7 +1232,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 	 * inserting to.
 	 */
 	CurrPos = StartPos;
-	currpos = GetXLogBuffer(CurrPos, tli);
+	currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 	freespace = INSERT_FREESPACE(CurrPos);
 
 	/*
@@ -1273,7 +1269,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
 			 * only backend that needs to set the contrecord flag.
 			 */
-			currpos = GetXLogBuffer(CurrPos, tli);
+			currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 			pagehdr = (XLogPageHeader) currpos;
 			pagehdr->xlp_rem_len = write_len - written;
 			pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
@@ -1346,7 +1342,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 			 * (which itself calls the two methods we need) to get the pointer
 			 * and zero most of the page.  Then we just zero the page header.
 			 */
-			currpos = GetXLogBuffer(CurrPos, tli);
+			currpos = GetXLogBuffer(CurrPos, CurrPos, tli);
 			MemSet(currpos, 0, SizeOfXLogShortPHD);
 
 			CurrPos += XLOG_BLCKSZ;
@@ -1364,6 +1360,90 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 				errmsg_internal("space reserved for WAL record does not match what was written"));
 }
 
+static XLogRecPtr
+GetPrevRecPtr(XLogRecPtr hdr, TimeLineID tli)
+{
+	uint64		xlprevbytepos;
+	char		*xl_prev_pos;
+	int			freespace;
+
+	XLogRecPtr	res;
+	XLogRecPtr	xl_prev_ptr;
+
+	xlprevbytepos = XLogRecPtrToBytePos(hdr) + offsetof(XLogRecord, xl_prev);
+	xl_prev_ptr = XLogBytePosToRecPtr(xlprevbytepos);
+	xl_prev_pos = GetXLogBuffer(xl_prev_ptr, hdr, tli);
+	freespace = INSERT_FREESPACE(xl_prev_ptr);
+
+	if (freespace >= sizeof(XLogRecPtr))
+	{
+		res = *((XLogRecPtr *) xl_prev_pos);
+	}
+	else
+	{
+		char	*res_data;
+
+		res_data = (char *) &res;
+		memcpy(res_data, xl_prev_pos, freespace);
+
+		if (XLogSegmentOffset(xl_prev_ptr + freespace, wal_segment_size) == 0)
+		{
+			xl_prev_pos += SizeOfXLogLongPHD;
+		}
+		else
+		{
+			xl_prev_pos += SizeOfXLogShortPHD;
+		}
+
+		memcpy(res_data + freespace, xl_prev_pos, sizeof(XLogRecPtr) - freespace);
+	}
+
+	return res;
+}
+
+static void
+SetPrevRecPtr(XLogRecPtr hdr, XLogRecPtr xl_prev_to_insert, TimeLineID tli, bool in_order)
+{
+	uint64		xlprevbytepos;
+	char		*xl_prev_pos;
+	int			freespace;
+
+	XLogRecPtr	xl_prev_ptr;
+
+	xlprevbytepos = XLogRecPtrToBytePos(hdr) + offsetof(XLogRecord, xl_prev);
+	xl_prev_ptr = XLogBytePosToRecPtr(xlprevbytepos);
+
+	if (in_order)
+		xl_prev_pos = GetXLogBuffer(xl_prev_ptr, xl_prev_ptr, tli);
+	else
+		xl_prev_pos = GetXLogBuffer(xl_prev_ptr, xl_prev_to_insert, tli);
+
+	freespace = INSERT_FREESPACE(xl_prev_ptr);
+
+	if (freespace >= sizeof(XLogRecPtr))
+	{
+		*((XLogRecPtr *) xl_prev_pos) = xl_prev_to_insert;
+	}
+	else
+	{
+		char	*xl_prev_to_insert_data;
+
+		xl_prev_to_insert_data = (char *) &xl_prev_to_insert;
+		memcpy(xl_prev_pos, xl_prev_to_insert_data, freespace);
+
+		if (XLogSegmentOffset(xl_prev_ptr + freespace, wal_segment_size) == 0)
+		{
+			xl_prev_pos += SizeOfXLogLongPHD;
+		}
+		else
+		{
+			xl_prev_pos += SizeOfXLogShortPHD;
+		}
+
+		memcpy(xl_prev_pos, xl_prev_to_insert_data + freespace, sizeof(XLogRecPtr) - freespace);
+	}
+}
+
 /*
  * Acquire a WAL insertion lock, for inserting to WAL.
  */
@@ -1522,9 +1602,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -1629,7 +1707,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  * later, because older buffers might be recycled already)
  */
 static char *
-GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
+GetXLogBuffer(XLogRecPtr ptr, XLogRecPtr upto, TimeLineID tli)
 {
 	int			idx;
 	XLogRecPtr	endptr;
@@ -1690,14 +1768,14 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 		 * sure that it's initialized, before we let insertingAt to move past
 		 * the page header.
 		 */
-		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
-			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogShortPHD;
-		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
-				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogLongPHD;
+		if (upto % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
+			XLogSegmentOffset(upto, wal_segment_size) > XLOG_BLCKSZ)
+			initializedUpto = upto - SizeOfXLogShortPHD;
+		else if (upto % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
+				 XLogSegmentOffset(upto, wal_segment_size) < XLOG_BLCKSZ)
+			initializedUpto = upto - SizeOfXLogLongPHD;
 		else
-			initializedUpto = ptr;
+			initializedUpto = upto;
 
 		WALInsertLockUpdateInsertingAt(initializedUpto);
 
@@ -6018,8 +6096,7 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	pg_atomic_write_u64(&Insert->CurrBytePos, XLogRecPtrToBytePos(EndOfLog));
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -6053,6 +6130,7 @@ StartupXLOG(void)
 		 */
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
+	SetPrevRecPtr(EndOfLog, endOfRecoveryInfo->lastRec, newTLI, true);
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
@@ -7005,7 +7083,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(pg_atomic_read_u64(&Insert->CurrBytePos));
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -7473,7 +7551,7 @@ CreateOverwriteContrecordRecord(XLogRecPtr aborted_lsn, XLogRecPtr pagePtr,
 	 * insertion lock is just pro forma.
 	 */
 	WALInsertLockAcquire();
-	pagehdr = (XLogPageHeader) GetXLogBuffer(pagePtr, newTLI);
+	pagehdr = (XLogPageHeader) GetXLogBuffer(pagePtr, pagePtr, newTLI);
 	pagehdr->xlp_info |= XLP_FIRST_IS_OVERWRITE_CONTRECORD;
 	WALInsertLockRelease();
 
@@ -9437,9 +9515,7 @@ GetXLogInsertRecPtr(void)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		current_bytepos;
 
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	current_bytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	return XLogBytePosToRecPtr(current_bytepos);
 }
-- 
2.43.0

#3Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Zhou, Zhiguo (#2)
RE: [RFC] Lock-free XLog Reservation from WAL

This message is a duplicate of PH7PR11MB5796659F654F9BE983F3AD97EF142@PH7PR11MB5796.namprd11.prod.outlook.com. Please consider dropping this thread and review the original one instead.

Sorry for your inconvenience.

-----Original Message-----
From: Zhou, Zhiguo <zhiguo.zhou@intel.com>
Sent: Thursday, January 2, 2025 3:20 PM
To: pgsql-hackers@lists.postgresql.org
Subject: [RFC] Lock-free XLog Reservation from WAL

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

#4Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#1)
Re: [RFC] Lock-free XLog Reservation from WAL

02.01.2025 10:05, Zhou, Zhiguo wrote:

Hi all,

I am reaching out to solicit your insights and comments on a recent proposal regarding the "Lock-free XLog Reservation from WAL." We have identified some challenges with the current WAL insertions, which require space reservations in the WAL buffer which involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of the current XLog) and PrevBytePos (the prev-link to the previous XLog). Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the next XLog always points to the head of the current Xlog,we will slightly exceed the reserved memory range of the current XLog to update the prev-link of the next XLog, regardless of which backend acquires the next memory space. The next XLog inserter will wait until its prev-link is updated for CRC calculation before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link of its next XLog first, then return to the header position for the current log insertion. This change will reduce the dependency of XLog writes on previous ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to separate the LSN it intends to access from the LSN it expects to update in the insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the parallelism.

The attached patch could pass the regression tests (make check, make check-world), and in the performance test of this POC on SPR (480 vCPU) shows that this optimization could help the TPCC benchmark better scale with the core count and as a result the performance with full cores enabled could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we are eager to hear the community's thoughts and comments on this optimization so that we can confirm our current work aligns with expectations.

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big issue).

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear sentence).

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

#5wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Yura Sokolov (#4)
Re: [RFC] Lock-free XLog Reservation from WAL

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I think it
will be challenged,do we make it guc ?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

Show quoted text

02.01.2025 10:05, Zhou, Zhiguo wrote:

Hi all,

I am reaching out to solicit your insights and comments on a recent

proposal regarding the "Lock-free XLog Reservation from WAL." We have
identified some challenges with the current WAL insertions, which require
space reservations in the WAL buffer which involve updating two
shared-memory statuses in XLogCtlInsert: CurrBytePos (the start position of
the current XLog) and PrevBytePos (the prev-link to the previous XLog).
Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but
introduces lock contention, hindering the parallelism of XLog insertions.

To address this issue, we propose the following changes:

1. Removal of PrevBytePos: This will allow increments of the CurrBytePos

(a single uint64 field) to be implemented with an atomic operation
(fetch_add).

2. Updating Prev-Link of next XLog: Based on the fact that the prev-link

of the next XLog always points to the head of the current Xlog,we will
slightly exceed the reserved memory range of the current XLog to update the
prev-link of the next XLog, regardless of which backend acquires the next
memory space. The next XLog inserter will wait until its prev-link is
updated for CRC calculation before starting its own XLog copy into the WAL.

3. Breaking Sequential Write Convention: Each backend will update the

prev-link of its next XLog first, then return to the header position for
the current log insertion. This change will reduce the dependency of XLog
writes on previous ones (compared with the sequential writes).

4. Revised GetXLogBuffer: To support #3, we need update this function to

separate the LSN it intends to access from the LSN it expects to update in
the insertingAt field.

5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing

NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the
parallelism.

The attached patch could pass the regression tests (make check, make

check-world), and in the performance test of this POC on SPR (480 vCPU)
shows that this optimization could help the TPCC benchmark better scale
with the core count and as a result the performance with full cores enabled
could be improved by 2.04x.

Before we proceed with further patch validation and refinement work, we

are eager to hear the community's thoughts and comments on this
optimization so that we can confirm our current work aligns with
expectations.

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big issue).

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear sentence).

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

#6Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: wenhui qiu (#5)
Re: [RFC] Lock-free XLog Reservation from WAL

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
    Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I
think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1]/messages/by-id/2266698.1704854297@sss.pgh.pa.us and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1]: /messages/by-id/2266698.1704854297@sss.pgh.pa.us

#7wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Zhou, Zhiguo (#6)
Re: [RFC] Lock-free XLog Reservation from WAL

HI Zhiguo
Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on this
patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Show quoted text

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I
think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with

on

platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1]
/messages/by-id/2266698.1704854297@sss.pgh.pa.us

#8Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: wenhui qiu (#7)
Re: [RFC] Lock-free XLog Reservation from WAL

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

Show quoted text

On 1/6/2025 4:35 PM, wenhui qiu wrote:

HI Zhiguo
    Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on
this patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com
<mailto:zhiguo.zhou@intel.com>> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
      Thank you for your path,NUM_XLOGINSERT_LOCKS increase to

128,I

think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a
controversial proposal. Given that, we may first focus on the lock-free
XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov

<y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru>

<mailto:y.sokolov@postgrespro.ru

<mailto:y.sokolov@postgrespro.ru>>> wrote:

     Good day, Zhiguo.

     Idea looks great.

     Minor issue:
     - you didn't remove use of `insertpos_lck` from

`ReserveXLogSwitch`.

     I initially thought it became un-synchronized against
     `ReserveXLogInsertLocation`, but looking closer I found it is
     synchronized with `WALInsertLockAcquireExclusive`.
     Since there are no other `insertpos_lck` usages after your

patch, I

     don't see why it should exists and be used in

`ReserveXLogSwitch`.

     Still I'd prefer to see CAS loop in this place to be

consistent with

     other non-locking access. And it will allow to get rid of
     `WALInsertLockAcquireExclusive`, (though probably it is not a big
     issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

     Major issue:
     - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/

read with on

     platforms where MAXALIGN != 8 or without native 64 load/

store. Branch

     with 'memcpy` is rather obvious, but even pointer de-

referencing on

     "lucky case" is not safe either.

     I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What do
you think of this as a viable solution?

     Readability issue:
     - It would be good to add `Assert(ptr >= upto)` into

`GetXLogBuffer`.

     I had hard time to recognize `upto` is strictly not in the

future.

     - Certainly, final version have to have fixed and improved

comments.

     Many patch's ideas are strictly non-obvious. I had hard time to
     recognize patch is not a piece of ... (excuse me for the swear
     sentence).

Thanks for the suggestion and patience. It's really more readable after
inserting the assertion, I will fix it and improve other comments in
the
following patches.

     Indeed, patch is much better than it looks on first sight.
     I came with alternative idea yesterday, but looking closer to

your

     patch
     today I see it is superior to mine (if atomic access will be

fixed).

     ----

     regards,
     Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1] https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

#9wenhui qiu
qiuwenhuifx@gmail.com
In reply to: Zhou, Zhiguo (#8)
Re: [RFC] Lock-free XLog Reservation from WAL

HI Zhiguo

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

If the value is not a strong dependency, then the best way is not to change
it.

Thanks

On Mon, Jan 6, 2025 at 4:49 PM Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Show quoted text

Maybe we could leave the NUM_XLOGINSERT_LOCKS unchanged in this patch,
as it is not a hard dependency of the lock-free algorithm. And when this
patch has been fully accepted, we could then investigate the more proper
way of increasing NUM_XLOGINSERT_LOCKS. WDYT?

On 1/6/2025 4:35 PM, wenhui qiu wrote:

HI Zhiguo
Thank you for your reply ,Then you'll have to prove that 128 is the
optimal value, otherwise they'll have a hard time agreeing with you on
this patch.

Thanks

On Mon, Jan 6, 2025 at 2:46 PM Zhou, Zhiguo <zhiguo.zhou@intel.com
<mailto:zhiguo.zhou@intel.com>> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to

128,I

think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to

be a

controversial proposal. Given that, we may first focus on the

lock-free

XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov

<y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru>

<mailto:y.sokolov@postgrespro.ru

<mailto:y.sokolov@postgrespro.ru>>> wrote:

Good day, Zhiguo.

Idea looks great.

Minor issue:
- you didn't remove use of `insertpos_lck` from

`ReserveXLogSwitch`.

I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your

patch, I

don't see why it should exists and be used in

`ReserveXLogSwitch`.

Still I'd prefer to see CAS loop in this place to be

consistent with

other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a

big

issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree

with

you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will

try

it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/

read with on

platforms where MAXALIGN != 8 or without native 64 load/

store. Branch

with 'memcpy` is rather obvious, but even pointer de-

referencing on

"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What

do

you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into

`GetXLogBuffer`.

I had hard time to recognize `upto` is strictly not in the

future.

- Certainly, final version have to have fixed and improved

comments.

Many patch's ideas are strictly non-obvious. I had hard time

to

recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable

after

inserting the assertion, I will fix it and improve other comments in
the
following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to

your

patch
today I see it is superior to mine (if atomic access will be

fixed).

----

regards,
Yura Sokolov aka funny-falcon

Regards,
Zhiguo

[1] https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

#10Юрий Соколов
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#6)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be a controversial proposal. Given that, we may first focus on the lock-free XLog reservation implementation, and leave the increase of NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru <mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru>> wrote:
Good day, Zhiguo.
Idea looks great.
Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.
Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big
issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop which should significantly reduce the synchronization cost here especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try it in the next version of patch.

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.
I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in some situations. My initial thought is to define a bit near the prev-link to flag the completion of the update. In this way, we could allow non-atomic or even discontinuous write/read operations on the prev-link, while simultaneously guaranteeing its atomicity through atomic operations (as well as memory barriers) on the flag bit. What do you think of this as a viable solution?

Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear
sentence).

Thanks for the suggestion and patience. It's really more readable after inserting the assertion, I will fix it and improve other comments in the following patches.

Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your
patch
today I see it is superior to mine (if atomic access will be fixed).

[1] /messages/by-id/2266698.1704854297@sss.pgh.pa.us

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally unique,
but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
this difference is certainly less than 4GB, so it also could be stored as 32
bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize as a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts, this implementation relies on concurrent deleters which will eventually break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit lighter on atomic instructions count, but it performs badly in absence of native 64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

Attachments:

v1-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchapplication/octet-stream; name=v1-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patch; x-unix-mode=0644Download
From c41f7b72d57b4fa4211079f7637f0a8470ec9348 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=D0=AE=D1=80=D0=B8=D0=B9=20=D0=A1=D0=BE=D0=BA=D0=BE=D0=BB?=
 =?UTF-8?q?=D0=BE=D0=B2?= <yura@Urijs-MacBook-Air.local>
Date: Mon, 6 Jan 2025 21:54:06 +0300
Subject: [PATCH v1] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 346 ++++++++++++++++++++++++------
 1 file changed, 285 insertions(+), 61 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b9ea92a542..2f35e1645a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -147,7 +149,7 @@ int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
  * which needs to iterate all the locks.
  */
-#define NUM_XLOGINSERT_LOCKS  8
+#define NUM_XLOGINSERT_LOCKS  128
 
 /*
  * Max distance from last checkpoint, before triggering a new xlog-based
@@ -384,6 +386,36 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+	uint32		CurrPosId;
+	uint32		PrevSize;
+}			WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ */
+typedef struct
+{
+	pg_atomic_uint32 CurrPosId;
+	pg_atomic_uint32 PrevSize;
+}			WALPrevPosLink;
+
+#define PREV_LINKS_HASH_CAPA 256
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -395,26 +427,31 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos pg_attribute_aligned(PG_CACHE_LINE_SIZE);
 
 	/*
-	 * Make sure the above heavily-contended spinlock and byte positions are
-	 * on their own cache line. In particular, the RedoRecPtr and full page
-	 * write variables below should be on a different cache line. They are
-	 * read on every WAL insertion, but updated rarely, and we don't want
-	 * those reads to steal the cache line containing Curr/PrevBytePos.
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorith. It is
+	 * mostly 4 way: for every element computed two positions h1, h2, and
+	 * neighbour h1+1 and h2+2 are used as well. This way even on collision we
+	 * have 3 distinct position, which provide us ~75% fill rate without
+	 * unsolvable cycles (due to Cuckoo's theory).
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
 	 */
-	char		pad[PG_CACHE_LINE_SIZE];
+	WALPrevPosLink PrevLinksHash[PREV_LINKS_HASH_CAPA];
 
 	/*
 	 * fullPageWrites is the authoritative value used by all backends to
@@ -700,6 +737,15 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink * link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink * link, WALPrevPosLinkVal * val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink * link, WALPrevPosLinkVal * val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1086,6 +1132,192 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink * link, WALPrevPosLinkVal val)
+{
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	/* This write acts as unlock as well. */
+	pg_atomic_write_membarrier_u32(&link->PrevSize, val.PrevSize);
+	return true;
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink * link, WALPrevPosLinkVal * val)
+{
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	/* This write acts as unlock as well. */
+	pg_atomic_write_membarrier_u32(&link->PrevSize, 0);
+	return true;
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink * link, WALPrevPosLinkVal * val)
+{
+	uint32		oldCur;
+	uint32		oldPrev;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCur = pg_atomic_read_u32(&link->CurrPosId);
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	/* This write acts as unlock as well. */
+	pg_atomic_write_membarrier_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCur;
+	val->PrevSize = oldPrev;
+	return true;
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(uint32 ptr, uint32 pos[4])
+{
+	uint32		hash = murmurhash32(ptr);
+
+	pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos[1] = (pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos[3] = (pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	WALPrevPosLink *hashtable = XLogCtl->Insert.PrevLinksHash;
+	WALPrevPosLinkVal lookup = {
+		.CurrPosId = StartPos ^ (StartPos >> 32),
+	};
+	WALPrevPosLinkVal insert = {
+		.CurrPosId = EndPos ^ (EndPos >> 32),
+		.PrevSize = EndPos - StartPos
+	};
+	pg_prng_state prng;
+	uint32		lookup_pos[4];
+	uint32		insert_pos[4];
+	uint32		i;
+	bool		inserted = false;
+	bool		found = false;
+
+	CalcCuckooPositions(lookup.CurrPosId, lookup_pos);
+	CalcCuckooPositions(insert.CurrPosId, insert_pos);
+	pg_prng_seed(&prng, StartPos);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < 4; i++)
+			found = WALPrevPosLinkConsume(&hashtable[lookup_pos[i]], &lookup);
+
+		if (inserted)
+			goto next;
+
+		for (i = 0; !inserted && i < 4; i++)
+			inserted = WALPrevPosLinkInsert(&hashtable[insert_pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		if (pg_prng_uint32(&prng) % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = pg_prng_uint32(&prng) % 4;
+		if (!WALPrevPosLinkSwap(&hashtable[insert_pos[i]], &insert))
+			goto next;
+
+		if (insert.PrevSize == 0)
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (insert.CurrPosId == lookup.CurrPosId)
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			lookup.PrevSize = insert.PrevSize;
+			break;
+		}
+		else
+			CalcCuckooPositions(insert.CurrPosId, insert_pos);
+
+next:
+		pg_spin_delay();
+	}
+
+	*PrevPtr = StartPos - lookup.PrevSize;
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLink *hashtable = XLogCtl->Insert.PrevLinksHash;
+	uint32		insert_pos[4];
+
+	CalcCuckooPositions(EndOfLog ^ (EndOfLog >> 32), insert_pos);
+	pg_atomic_write_u32(&hashtable[insert_pos[0]].CurrPosId,
+						EndOfLog ^ (EndOfLog >> 32));
+	pg_atomic_write_u32(&hashtable[insert_pos[0]].PrevSize,
+						EndOfLog - LastRec);
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1118,25 +1350,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1172,26 +1388,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,10 +1416,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1507,7 +1730,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1744,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -5017,12 +5237,18 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+		pg_atomic_init_u32(&XLogCtl->Insert.PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&XLogCtl->Insert.PrevLinksHash[i].PrevSize, 0);
+	}
 }
 
 /*
@@ -6018,8 +6244,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +7236,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9665,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
-- 
2.39.3 (Apple Git-146)

#11Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Юрий Соколов (#10)
Re: [RFC] Lock-free XLog Reservation from WAL

On 1/7/2025 10:49 AM, Юрий Соколов wrote:

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I
think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be
a controversial proposal. Given that, we may first focus on the lock-
free XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>>> wrote:
   Good day, Zhiguo.
   Idea looks great.
   Minor issue:
   - you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
   I initially thought it became un-synchronized against
   `ReserveXLogInsertLocation`, but looking closer I found it is
   synchronized with `WALInsertLockAcquireExclusive`.
   Since there are no other `insertpos_lck` usages after your patch, I
   don't see why it should exists and be used in `ReserveXLogSwitch`.
   Still I'd prefer to see CAS loop in this place to be consistent with
   other non-locking access. And it will allow to get rid of
   `WALInsertLockAcquireExclusive`, (though probably it is not a big
   issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree with
you on getting rid of `WALInsertLockAcquireExclusive` with CAS loop
which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will try
it in the next version of patch.

   Major issue:
   - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
   platforms where MAXALIGN != 8 or without native 64 load/store. Branch
   with 'memcpy` is rather obvious, but even pointer de-referencing on
   "lucky case" is not safe either.
   I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the prev-
link to flag the completion of the update. In this way, we could allow
non-atomic or even discontinuous write/read operations on the prev-
link, while simultaneously guaranteeing its atomicity through atomic
operations (as well as memory barriers) on the flag bit. What do you
think of this as a viable solution?

   Readability issue:
   - It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
   I had hard time to recognize `upto` is strictly not in the future.
   - Certainly, final version have to have fixed and improved comments.
   Many patch's ideas are strictly non-obvious. I had hard time to
   recognize patch is not a piece of ... (excuse me for the swear
   sentence).

Thanks for the suggestion and patience. It's really more readable
after inserting the assertion, I will fix it and improve other
comments in the following patches.

   Indeed, patch is much better than it looks on first sight.
   I came with alternative idea yesterday, but looking closer to your
   patch
   today I see it is superior to mine (if atomic access will be fixed).

[1]https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing
with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
  value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally
unique,
  but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
  this difference is certainly less than 4GB, so it also could be
stored as 32
  bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize as
a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts,
this implementation relies on concurrent deleters which will eventually
break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit lighter
on atomic instructions count, but it performs badly in absence of native
64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Your implementation based on the lock-free hash table is truly
impressive! One of the aspects I particularly admire is how your
solution doesn't require breaking the current convention of XLog
insertion, whose revision is quite error-prone and ungraceful. My minor
concern is that the limited number of entries (256) in the hash table
would be a bottleneck for parallel memory reservation, but I believe
this is not a critical issue.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

BTW, do you have a plan to merge this patch to the master branch? Thanks!

Regards,
Zhiguo

#12Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#11)
Re: [RFC] Lock-free XLog Reservation from WAL

09.01.2025 19:03, Zhou, Zhiguo пишет:

On 1/7/2025 10:49 AM, Юрий Соколов wrote:

On 6 Jan 2025, at 09:46, Zhou, Zhiguo <zhiguo.zhou@intel.com> wrote:

Hi Yura and Wenhui,

Thanks for kindly reviewing this work!

On 1/3/2025 9:01 PM, wenhui qiu wrote:

Hi
Thank you for your path,NUM_XLOGINSERT_LOCKS increase to 128,I
think it will be challenged,do we make it guc ?

I noticed there have been some discussions (for example, [1] and its
responses) about making NUM_XLOGINSERT_LOCKS a GUC, which seems to be
a controversial proposal. Given that, we may first focus on the lock-
free XLog reservation implementation, and leave the increase of
NUM_XLOGINSERT_LOCKS for a future patch, where we would provide more
quantitative evidence for the various implementations. WDYT?

On Fri, 3 Jan 2025 at 20:36, Yura Sokolov <y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru><mailto:y.sokolov@postgrespro.ru
<mailto:y.sokolov@postgrespro.ru>>> wrote:
   Good day, Zhiguo.
   Idea looks great.
   Minor issue:
   - you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
   I initially thought it became un-synchronized against
   `ReserveXLogInsertLocation`, but looking closer I found it is
   synchronized with `WALInsertLockAcquireExclusive`.
   Since there are no other `insertpos_lck` usages after your patch, I
   don't see why it should exists and be used in `ReserveXLogSwitch`.
   Still I'd prefer to see CAS loop in this place to be consistent with
   other non-locking access. And it will allow to get rid of
   `WALInsertLockAcquireExclusive`, (though probably it is not a big
   issue).

Exactly, it should be safe to remove `insertpos_lck`. And I agree
with you on getting rid of `WALInsertLockAcquireExclusive` with CAS
loop which should significantly reduce the synchronization cost here
especially when we intend to increase NUM_XLOGINSERT_LOCKS. I will
try it in the next version of patch.

   Major issue:
   - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read
with on
   platforms where MAXALIGN != 8 or without native 64 load/store.
Branch
   with 'memcpy` is rather obvious, but even pointer de-referencing on
   "lucky case" is not safe either.
   I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the prev-
link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev- link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

   Readability issue:
   - It would be good to add `Assert(ptr >= upto)` into
`GetXLogBuffer`.
   I had hard time to recognize `upto` is strictly not in the future.
   - Certainly, final version have to have fixed and improved comments.
   Many patch's ideas are strictly non-obvious. I had hard time to
   recognize patch is not a piece of ... (excuse me for the swear
   sentence).

Thanks for the suggestion and patience. It's really more readable
after inserting the assertion, I will fix it and improve other
comments in the following patches.

   Indeed, patch is much better than it looks on first sight.
   I came with alternative idea yesterday, but looking closer to your
   patch
   today I see it is superior to mine (if atomic access will be fixed).

[1]https://www.postgresql.org/message-
id/2266698.1704854297%40sss.pgh.pa.us <https://www.postgresql.org/
message-id/2266698.1704854297%40sss.pgh.pa.us>

Good day, Zhiguo.

Here’s my attempt to organise link to previous record without messing
with xlog buffers:
- link is stored in lock-free hash table instead.

I don’t claim it is any better than using xlog buffers.
It is just alternative vision.

Some tricks in implementation:
- Relying on byte-position nature, it could be converted to 32 bit unique
   value with `(uint32)(pos ^ (pos>>32))`. Certainly it is not totally
unique,
   but it is certainly unique among 32GB consecutive log.
- PrevBytePos could be calculated as a difference between positions, and
   this difference is certainly less than 4GB, so it also could be
stored as 32
   bit value (PrevSize).
- Since xlog records are aligned we could use lowest bit of PrevSize
as a lock.
- While Cuckoo Hashing could suffer from un-solvable cycle conflicts,
this implementation relies on concurrent deleters which will
eventually break such cycles if any.

I have a version without 32bit conversion trick, and it is a bit
lighter on atomic instructions count, but it performs badly in absence
of native 64bit atomics.

——
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Your implementation based on the lock-free hash table is truly
impressive! One of the aspects I particularly admire is how your
solution doesn't require breaking the current convention of XLog
insertion, whose revision is quite error-prone and ungraceful.

That is main benefit of my approach. Though it is not strictly better
than yours.

My minor
concern is that the limited number of entries (256) in the hash table
would be a bottleneck for parallel memory reservation, but I believe
this is not a critical issue.

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

But I agree 8 items on cache line could lead to false-sharing.
Items could be stretched to 16 bytes (and then CurrPosId could be fully
unique), so there's just 4 entry per cache line.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any benefits
neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

BTW, do you have a plan to merge this patch to the master branch? Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

#define XLogRecordMaxSize (1020 * 1024 * 1024)
#define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

#13Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Yura Sokolov (#12)
Re: [RFC] Lock-free XLog Reservation from WAL

On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

-1, I don't think that is possible without degrading what our current
WAL system protects against.

For intra-record torn write protection we have the checksum, but that
same protection doesn't cover the multiple WAL records on each page.
That is what the xl_prev pointer is used for - detecting that this
part of the page doesn't contain the correct data (e.g. the data of a
previous version of this recycled segment).
If we replaced xl_prev with just an offset into the segment, then this
protection would be much less effective, as the previous version of
the segment realistically used the same segment offsets at the same
offsets into the file.

To protect against torn writes while still only using record segment
offsets, you'd have zero and then fsync any segment before reusing it,
which would severely reduce the benefits we get from recycling
segments.
Note that we can't expect the page header to help here, as write tears
can happen at nearly any offset into the page - not just 8k intervals
- and so the page header is not always representative of the origins
of all bytes on the page - only the first 24 (if even that).

Kind regards,

Matthias van de Meent

#14Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Matthias van de Meent (#13)
Re: [RFC] Lock-free XLog Reservation from WAL

10.01.2025 19:53, Matthias van de Meent пишет:

On Fri, 10 Jan 2025 at 13:42, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

-1, I don't think that is possible without degrading what our current
WAL system protects against.

For intra-record torn write protection we have the checksum, but that
same protection doesn't cover the multiple WAL records on each page.
That is what the xl_prev pointer is used for - detecting that this
part of the page doesn't contain the correct data (e.g. the data of a
previous version of this recycled segment).
If we replaced xl_prev with just an offset into the segment, then this
protection would be much less effective, as the previous version of
the segment realistically used the same segment offsets at the same
offsets into the file.

Well, to protect against "torn write" it is enough to have "self-lsn"
field, not "prev-lsn". So 8 byte "self-lsn" + "offset-to-prev" would work.

But this way header will be increased by 4 bytes compared to current
one, not decreased.

Just thought:
If XLogRecord alignment were stricter (for example, 32 bytes), then LSN
could mean not byte-offset, but 32byte-offset. Then low 32bits of LSN
will cover 128GB of WAL logs. For most installations re-use distance for
WAL segments doubdfully longer than 128GB. But I believe, there are some
with larger one. So it is not reliable.

To protect against torn writes while still only using record segment
offsets, you'd have zero and then fsync any segment before reusing it,
which would severely reduce the benefits we get from recycling
segments.
Note that we can't expect the page header to help here, as write tears
can happen at nearly any offset into the page - not just 8k intervals
- and so the page header is not always representative of the origins
of all bytes on the page - only the first 24 (if even that).

-----

regards,
Yura

#15Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#12)
Re: [RFC] Lock-free XLog Reservation from WAL

Good day, Yura!

On 1/10/2025 8:42 PM, Yura Sokolov wrote:

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

The profile of your patch didn't show significant hotspots in the hash
table functions, so I believe the 256 entries should be enough.

I will soon try to evaluate the performance impact of your patch on my
device with the TPCC benchmark and also profile it to see if there are
any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any benefits
neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

I've tested the performance impact of our patches on an Intel Sapphire
Rapids device with 480 vCPUs using a HammerDB TPC-C workload (256 VUs).
The results show a 72.3% improvement (average of 3 rounds, RSD: 1.5%)
with your patch and a 76.0% boost (average of 3 rounds, RSD: 2.95%) with
mine, applied to the latest codebase. This optimization is most
effective on systems with over 64 cores, as our core-scaling experiments
suggest minimal impact on lower-core setups like your notebook or a
20-core VM.

BTW, do you have a plan to merge this patch to the master branch? Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

    #define XLogRecordMaxSize    (1020 * 1024 * 1024)
    #define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

Thanks for the great suggestion!

I think we've arrived at a critical juncture where we need to decide
which patch to move forward with for our optimization efforts. I've
evaluated the pros and cons of my implementation:

Pros:
-Achieves an additional 4% performance improvement.

Cons:
-Breaks the current convention of XLog insertions.
-TAP tests are not fully passed, requiring time to resolve.
-May necessitate changes to the format and backup tools, potentially
leading to backward compatibility issues.

Given these considerations, I believe your implementation is superior to
mine. I'd greatly appreciate it if you could share your insights on this
matter.

Regards,
Zhiguo

#16Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#15)
Re: [RFC] Lock-free XLog Reservation from WAL

14.01.2025 17:49, Zhou, Zhiguo пишет:

Good day, Yura!

On 1/10/2025 8:42 PM, Yura Sokolov wrote:

If you consider hash-table fillrate, than 256 is quite enough for 128
concurrent inserters.

The profile of your patch didn't show significant hotspots in the hash
table functions, so I believe the 256 entries should be enough.

I will soon try to evaluate the performance impact of your patch on
my device with the TPCC benchmark and also profile it to see if there
are any changes that could be made to further improve it.

It would be great. On my notebook (Mac Air M1) I don't see any
benefits neither from mine, nor from yours patch ))
My colleague will also test it on 20 core virtual machine (but
backported to v15).

I've tested the performance impact of our patches on an Intel Sapphire
Rapids device with 480 vCPUs using a HammerDB TPC-C workload (256 VUs).
The results show a 72.3% improvement (average of 3 rounds, RSD: 1.5%)
with your patch and a 76.0% boost (average of 3 rounds, RSD: 2.95%) with
mine, applied to the latest codebase. This optimization is most
effective on systems with over 64 cores, as our core-scaling experiments
suggest minimal impact on lower-core setups like your notebook or a 20-
core VM.

BTW, do you have a plan to merge this patch to the master branch?
Thanks!

I'm not committer )) We are both will struggle to make something
committed for many months ;-)

BTW, your version could make alike trick for guaranteed atomicity:
- change XLogRecord's `XLogRecPtr xl_prev` to `uint32 xl_prev_offset`
and store offset to prev record's start.

Since there are two limits:

     #define XLogRecordMaxSize    (1020 * 1024 * 1024)
     #define WalSegMaxSize 1024 * 1024 * 1024

offset to previous record could not be larger than 2GB.

Yes, it is format change, that some backup utilities will have to adopt.
But it saves 4 bytes in XLogRecord (that could be spent to store
FullTransactionId instead of TransactionId) and it is better
compressible.

And your version than will not need the case when this value is split
among two buffers (since MAXALIGN is not less than 4), and PostgreSQL
already relies on 4 byte read/write atomicity (in some places even
without use of pg_atomic_uint32).

----

regards
Sokolov Yura aka funny-falcon

Thanks for the great suggestion!

I think we've arrived at a critical juncture where we need to decide
which patch to move forward with for our optimization efforts. I've
evaluated the pros and cons of my implementation:

Pros:
-Achieves an additional 4% performance improvement.

Cons:
-Breaks the current convention of XLog insertions.
-TAP tests are not fully passed, requiring time to resolve.
-May necessitate changes to the format and backup tools, potentially
leading to backward compatibility issues.

Given these considerations, I believe your implementation is superior to
mine. I'd greatly appreciate it if you could share your insights on this
matter.

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its effect
even on my working notebook (it is another one: Ryzen 5825U limited to
@2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

#17Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#16)
Re: [RFC] Lock-free XLog Reservation from WAL

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its effect
even on my working notebook (it is another one: Ryzen 5825U limited to
@2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the
hash-table-based implementation for your further review. WDYT?

Regards,
Zhiguo

#18Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#17)
Re: [RFC] Lock-free XLog Reservation from WAL

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Certainly.

And I will sent my version of 64bit operations on hash-table entries...

tomorrow.

Today is 3am at the moment...

I was doing "removal of WALBufMappingLock" [1]/messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru
and I want to sleep a lot...

[1]: /messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru
/messages/by-id/flat/39b39e7a-41b4-4f34-b3f5-db735e74a723@postgrespro.ru

-----
regards
Yura Sokolov aka funny-falcon

#19Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#17)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive approach
and understand the importance of exploring different angles for
optimization. Your patch is indeed fundamental to our ongoing work on
the lock-free xlog reservation, and I'm eager to see how it can further
enhance our efforts.

I will proceed to test the performance impact of your latest patch when
combined with the lock-free xlog reservation patch. This will help us
determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic operations. I
suppose it could be a bit faster. But I can't prove it now.

btw, you wrote:

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

platforms where MAXALIGN != 8 or without native 64 load/store.

Branch

with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

------
regards
Yura Sokolov aka funny-falcon

Attachments:

v2-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/x-patch; charset=UTF-8; name=v2-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From 24520c5bf4f88271dfbe72221f50083f3ec9ca8e Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v2] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 ...-Increase-NUM_XLOGINSERT_LOCKS-to-64.patch |  38 ++
 src/backend/access/transam/xlog.c             | 546 ++++++++++++++++--
 src/tools/pgindent/typedefs.list              |   2 +
 3 files changed, 523 insertions(+), 63 deletions(-)
 create mode 100644 patches/v0-0001-Increase-NUM_XLOGINSERT_LOCKS-to-64.patch

diff --git a/patches/v0-0001-Increase-NUM_XLOGINSERT_LOCKS-to-64.patch b/patches/v0-0001-Increase-NUM_XLOGINSERT_LOCKS-to-64.patch
new file mode 100644
index 00000000000..c6fa8bf830c
--- /dev/null
+++ b/patches/v0-0001-Increase-NUM_XLOGINSERT_LOCKS-to-64.patch
@@ -0,0 +1,38 @@
+From 93a4d4a7e2219a952c2a544047c19db9f0f0f5c0 Mon Sep 17 00:00:00 2001
+From: Yura Sokolov <y.sokolov@postgrespro.ru>
+Date: Thu, 16 Jan 2025 15:06:59 +0300
+Subject: [PATCH v0 1/2] Increase NUM_XLOGINSERT_LOCKS to 64
+
+---
+ src/backend/access/transam/xlog.c | 8 ++++++--
+ 1 file changed, 6 insertions(+), 2 deletions(-)
+
+diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
+index bf3dbda901d..39381693db6 100644
+--- a/src/backend/access/transam/xlog.c
++++ b/src/backend/access/transam/xlog.c
+@@ -147,7 +147,7 @@ int			wal_segment_size = DEFAULT_XLOG_SEG_SIZE;
+  * to happen concurrently, but adds some CPU overhead to flushing the WAL,
+  * which needs to iterate all the locks.
+  */
+-#define NUM_XLOGINSERT_LOCKS  8
++#define NUM_XLOGINSERT_LOCKS  64
+ 
+ /*
+  * Max distance from last checkpoint, before triggering a new xlog-based
+@@ -1448,7 +1448,11 @@ WALInsertLockRelease(void)
+ 	{
+ 		int			i;
+ 
+-		for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
++		/*
++		 * LWLockRelease hopes we will release in reverse order for faster
++		 * search in held_lwlocks.
++		 */
++		for (i = NUM_XLOGINSERT_LOCKS - 1; i >= 0; i--)
+ 			LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
+ 								  &WALInsertLocks[i].l.insertingAt,
+ 								  0);
+-- 
+2.43.0
+
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..91067002447 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -384,6 +386,78 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+#define PREV_LINKS_LOOKUPS 4
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+#define PREV_LINKS_SAME_CACHE_LINE 0
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -395,26 +469,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
-
-	/*
-	 * Make sure the above heavily-contended spinlock and byte positions are
-	 * on their own cache line. In particular, the RedoRecPtr and full page
-	 * write variables below should be on a different cache line. They are
-	 * read on every WAL insertion, but updated rarely, and we don't want
-	 * those reads to steal the cache line containing Curr/PrevBytePos.
-	 */
-	char		pad[PG_CACHE_LINE_SIZE];
+	pg_atomic_uint64 CurrBytePos pg_attribute_aligned(PG_CACHE_LINE_SIZE);
 
 	/*
 	 * fullPageWrites is the authoritative value used by all backends to
@@ -442,6 +508,20 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorith. It is
+	 * mostly 4 way: for every element computed two positions h1, h2, and
+	 * neighbour h1+1 and h2+2 are used as well. This way even on collision we
+	 * have 3 distinct position, which provide us ~75% fill rate without
+	 * unsolvable cycles (due to Cuckoo's theory).
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
 /*
@@ -568,6 +648,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -700,6 +783,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1086,6 +1182,327 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+
+	StaticAssertStmt(PREV_LINKS_LOOKUPS == 4, "CalcCuckooPositions assumes PREV_LINKS_LOOKUPS == 4");
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if !PREV_LINKS_SAME_CACHE_LINE
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] + 1;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = pos->pos[2] + 2;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1118,25 +1535,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1172,26 +1573,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,10 +1601,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1507,7 +1915,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1929,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -4898,6 +5303,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -4999,6 +5406,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5017,12 +5427,24 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6018,8 +6440,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +7432,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9861,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 668bddbfcd7..28001598130 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3122,6 +3122,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0

#20Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#19)
Re: [RFC] Lock-free XLog Reservation from WAL

On 1/19/2025 10:56 PM, Yura Sokolov wrote:

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread just
about increase of NUM_XLOGINSERT_LOCK, because I can measure its
effect even on my working notebook (it is another one: Ryzen 5825U
limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Thank you for keeping me informed. I appreciate your proactive
approach and understand the importance of exploring different angles
for optimization. Your patch is indeed fundamental to our ongoing work
on the lock-free xlog reservation, and I'm eager to see how it can
further enhance our efforts.

I will proceed to test the performance impact of your latest patch
when combined with the lock-free xlog reservation patch. This will
help us determine if there's potential for additional optimization.
Concurrently, with your permission, I'll try to refine the hash-table-
based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic operations. I
suppose it could be a bit faster. But I can't prove it now.

btw, you wrote:

Major issue:
     - `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

     platforms where MAXALIGN != 8 or without native 64 load/store.

Branch

     with 'memcpy` is rather obvious, but even pointer de-referencing on
     "lucky case" is not safe either.

     I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

------
regards
Yura Sokolov aka funny-falcon

Good day, Yura.

Thank you for your patch! It has been incredibly helpful and serves as a
great guide for my revisions. I particularly appreciate your insight
into writing the prev-rec-ptr atomically. It's a brilliant approach, and
I will definitely try implementing it in my development work. Besides,
please take some well-deserved rest. Thanks!

Regards,
Zhiguo

#21Japin Li
japinli@hotmail.com
In reply to: Yura Sokolov (#19)
Re: [RFC] Lock-free XLog Reservation from WAL

On Sun, 19 Jan 2025 at 17:56, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread
just about increase of NUM_XLOGINSERT_LOCK, because I can measure
its effect even on my working notebook (it is another one: Ryzen
5825U limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!
Thank you for keeping me informed. I appreciate your proactive
approach and understand the importance of exploring different angles
for optimization. Your patch is indeed fundamental to our ongoing
work on the lock-free xlog reservation, and I'm eager to see how it
can further enhance our efforts.
I will proceed to test the performance impact of your latest patch
when combined with the lock-free xlog reservation patch. This will
help us determine if there's potential for additional
optimization. Concurrently, with your permission, I'll try to refine
the hash-table- based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic
operations. I suppose it could be a bit faster. But I can't prove it
now.

btw, you wrote:

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

platforms where MAXALIGN != 8 or without native 64

load/store. Branch

with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a
page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

Hi, Yura Sokolov

Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:

case | min | avg | max
--------------------+------------+------------+--------------
master (patched) | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92

The patch provides a significant improvement.

I just looked through the patch, here are some comments.

1.
The v2 patch can't be applied cleanly.

Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.

.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int i;
.git/rebase-apply/patch:39: trailing whitespace.

.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.

2.
And there is a typo:

+     * PrevLinksHash is a lock-free hash table based on Cuckoo algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and

s/algorith/algorithm/g

--
Regrads,
Japin Li

#22Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#21)
Re: [RFC] Lock-free XLog Reservation from WAL

22.01.2025 09:09, Japin Li пишет:

On Sun, 19 Jan 2025 at 17:56, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

17.01.2025 17:00, Zhou, Zhiguo пишет:

On 1/16/2025 10:00 PM, Yura Sokolov wrote:

Good day, Zhiguo.

Excuse me, I feel sneaky a bit, but I've started another thread
just about increase of NUM_XLOGINSERT_LOCK, because I can measure
its effect even on my working notebook (it is another one: Ryzen
5825U limited to @2GHz).

/messages/by-id/flat/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

-----

regards
Yura Sokolov aka funny-falcon

Good day, Yura!
Thank you for keeping me informed. I appreciate your proactive
approach and understand the importance of exploring different angles
for optimization. Your patch is indeed fundamental to our ongoing
work on the lock-free xlog reservation, and I'm eager to see how it
can further enhance our efforts.
I will proceed to test the performance impact of your latest patch
when combined with the lock-free xlog reservation patch. This will
help us determine if there's potential for additional
optimization. Concurrently, with your permission, I'll try to refine
the hash-table- based implementation for your further review. WDYT?

Good day, Zhiguo

Here's version of "hash-table reservation" with both 32bit and 64bit
operations (depending on PG_HAVE_ATOMIC_U64_SIMULATION, or may be
switched by hand).

64bit version uses other protocol with a bit lesser atomic
operations. I suppose it could be a bit faster. But I can't prove it
now.

btw, you wrote:

Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read

with on

platforms where MAXALIGN != 8 or without native 64

load/store. Branch

with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.

I have no idea how to fix it at the moment.

Indeed, non-atomic write/read operations can lead to safety issues in
some situations. My initial thought is to define a bit near the
prev-link to flag the completion of the update. In this way, we could
allow non-atomic or even discontinuous write/read operations on the
prev-link, while simultaneously guaranteeing its atomicity through
atomic operations (as well as memory barriers) on the flag bit. What
do you think of this as a viable solution?

There is a way to order operations:
- since SetPrevRecPtr stores start of record as LSN, its lower 32bits
are certainly non-zero (record could not start at the beginning of a
page).
- so SetPrevRecPtr should write high 32bits, issue write barrier, and
then write lower 32bits,
- and then GetPrevRecPtr should first read lower 32bits, and if it is
not zero, then issue read barrier and read upper 32bits.

This way you will always read correct prev-rec-ptr on platform without
64bit atomics. (because MAXALING >= 4 and PostgreSQL requires 4 byte
atomicity for several years).

Hi, Yura Sokolov

Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:

case | min | avg | max
--------------------+------------+------------+--------------
master (patched) | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92

The patch provides a significant improvement.

I just looked through the patch, here are some comments.

1.
The v2 patch can't be applied cleanly.

Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.

.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int i;
.git/rebase-apply/patch:39: trailing whitespace.

.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.

2.
And there is a typo:

+     * PrevLinksHash is a lock-free hash table based on Cuckoo algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and

s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does it
pay for self or not.

-------

regards
Yura

#23Japin Li
japinli@hotmail.com
In reply to: Yura Sokolov (#22)
Re: [RFC] Lock-free XLog Reservation from WAL

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64; I will try 128
and update the result later.

--
Regrads,
Japin Li

#24Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#23)
Re: [RFC] Lock-free XLog Reservation from WAL

22.01.2025 10:54, Japin Li пишет:

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64;

Ok, that is good.
Did you just increased number of locks, or applied
"several-attempts-to-lock"
from [1]/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru as well? It will be interesting how it affects performance in this
case. And it is orthogonal to "lock-free reservation", so they could
applied simultaneously.

I will try 128 and update the result later.

Thank you.

[1]: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

------
regards
Yura

#25Japin Li
japinli@hotmail.com
In reply to: Yura Sokolov (#24)
Re: [RFC] Lock-free XLog Reservation from WAL

On Wed, 22 Jan 2025 at 11:22, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 10:54, Japin Li пишет:

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64;

Ok, that is good.
Did you just increased number of locks, or applied
"several-attempts-to-lock"
from [1] as well? It will be interesting how it affects performance in this
case. And it is orthogonal to "lock-free reservation", so they could
applied simultaneously.

I apply the following two patches:

1. Lock-free XLog Reservation using lock-free hash-table
2. Increase NUM_XLOGINSERT_LOCKS to 64

I noticed the patch from the [1]. However, I haven't tested it independently.

--
Regrads,
Japin Li

#26Japin Li
japinli@hotmail.com
In reply to: Japin Li (#25)
Re: [RFC] Lock-free XLog Reservation from WAL

On Wed, 22 Jan 2025 at 16:49, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 11:22, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 10:54, Japin Li пишет:

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64;

Ok, that is good.
Did you just increased number of locks, or applied
"several-attempts-to-lock"
from [1] as well? It will be interesting how it affects performance in this
case. And it is orthogonal to "lock-free reservation", so they could
applied simultaneously.

I apply the following two patches:

1. Lock-free XLog Reservation using lock-free hash-table

Hi, Yura Sokolov

When I try to test the performance by only applying the Lock-free XLog
Reservation patch, there is an error:

2025-01-22 20:06:49.976 CST [1271602] PANIC: stuck spinlock detected at LinkAndFindPrevPos, /home/postgres/postgres/build/../src/backend/access/transam/xlog.c:1425
2025-01-22 20:06:49.976 CST [1271602] STATEMENT: UPDATE bmsql_customer SET c_balance = c_balance - $1, c_ytd_payment = c_ytd_payment + $2, c_payment_cnt = c_payment_cnt + 1 WHERE c_w_id = $3 AND c_d_id = $4 AND c_id = $5
2025-01-22 20:06:50.078 CST [1271748] PANIC: stuck spinlock detected at LinkAndFindPrevPos, /home/postgres/postgres/build/../src/backend/access/transam/xlog.c:1425

However, it does not always occur.

--
Regrads,
Japin Li

#27Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#26)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

22.01.2025 15:37, Japin Li пишет:

On Wed, 22 Jan 2025 at 16:49, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 11:22, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 10:54, Japin Li пишет:

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64;

Ok, that is good.
Did you just increased number of locks, or applied
"several-attempts-to-lock"
from [1] as well? It will be interesting how it affects performance in this
case. And it is orthogonal to "lock-free reservation", so they could
applied simultaneously.

I apply the following two patches:

1. Lock-free XLog Reservation using lock-free hash-table

Hi, Yura Sokolov

When I try to test the performance by only applying the Lock-free XLog
Reservation patch, there is an error:

2025-01-22 20:06:49.976 CST [1271602] PANIC: stuck spinlock detected at LinkAndFindPrevPos, /home/postgres/postgres/build/../src/backend/access/transam/xlog.c:1425
2025-01-22 20:06:49.976 CST [1271602] STATEMENT: UPDATE bmsql_customer SET c_balance = c_balance - $1, c_ytd_payment = c_ytd_payment + $2, c_payment_cnt = c_payment_cnt + 1 WHERE c_w_id = $3 AND c_d_id = $4 AND c_id = $5
2025-01-22 20:06:50.078 CST [1271748] PANIC: stuck spinlock detected at LinkAndFindPrevPos, /home/postgres/postgres/build/../src/backend/access/transam/xlog.c:1425

However, it does not always occur.

Oh, thank you!

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo PREV_LINKS_HASH_CAPA.

Here's the fix:

         pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
         pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source of
white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

-------
regards
Yura Sokolov aka funny-falcon

Attachments:

v3-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/x-patch; charset=UTF-8; name=v3-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From bbc415b50a39d130ee003ce37a485cbd95e238e3 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v3] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 587 ++++++++++++++++++++++++++----
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 526 insertions(+), 63 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..2664fc5706f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -384,6 +386,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+/*-----------
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ *     but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ *     in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 1
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -395,26 +485,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
-
-	/*
-	 * Make sure the above heavily-contended spinlock and byte positions are
-	 * on their own cache line. In particular, the RedoRecPtr and full page
-	 * write variables below should be on a different cache line. They are
-	 * read on every WAL insertion, but updated rarely, and we don't want
-	 * those reads to steal the cache line containing Curr/PrevBytePos.
-	 */
-	char		pad[PG_CACHE_LINE_SIZE];
+	pg_atomic_uint64 CurrBytePos pg_attribute_aligned(PG_CACHE_LINE_SIZE);
 
 	/*
 	 * fullPageWrites is the authoritative value used by all backends to
@@ -442,6 +524,31 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorithm.
+	 *
+	 * With default PREV_LINKS_HASH_STRATEGY == 1 it is mostly 4 way: for
+	 * every element computed two positions h1, h2, and neighbour h1+1 and
+	 * h2+2 are used as well. This way even on collision we have 3 distinct
+	 * position, which provide us ~75% fill rate without unsolvable cycles
+	 * (due to Cuckoo's theory). But chosen slots may be in 4 distinct
+	 * cache-lines.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 3 it takes two buckets 4 elements each
+	 * - 8 positions in total, but guaranteed to be in two cache lines. It
+	 * provides very high fill rate - upto 90%.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 2 it takes only one bucket with 4
+	 * elements. Strictly speaking it is not Cuckoo-hashing, but should work
+	 * for our case.
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
 /*
@@ -568,6 +675,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -700,6 +810,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1086,6 +1209,341 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	uint32		offset;
+#endif
+
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY == 1
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	/* use multiplication compute 0 <= offset < PREV_LINKS_HASH_CAPA-4 */
+	offset = (hash / PREV_LINKS_HASH_CAPA) * (PREV_LINKS_HASH_CAPA - 4);
+	offset /= UINT32_MAX / PREV_LINKS_HASH_CAPA + 1;
+	/* add start of next bucket */
+	offset += (pos->pos[0] | 3) + 1;
+	/* get position in strictly other bucket */
+	pos->pos[4] = offset % PREV_LINKS_HASH_CAPA;
+	pos->pos[5] = pos->pos[4] ^ 1;
+	pos->pos[6] = pos->pos[4] ^ 2;
+	pos->pos[7] = pos->pos[4] ^ 3;
+#endif
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1118,25 +1576,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1172,26 +1614,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,10 +1642,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1507,7 +1956,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1970,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -4898,6 +5344,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -4999,6 +5447,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5017,12 +5468,24 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6018,8 +6481,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +7473,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9902,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 668bddbfcd7..28001598130 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3122,6 +3122,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0

#28Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#23)
Re: [RFC] Lock-free XLog Reservation from WAL

22.01.2025 10:54, Japin Li wrote:

On Wed, 22 Jan 2025 at 10:25, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

22.01.2025 09:09, Japin Li пишет:

Hi, Yura Sokolov
Thanks for updating the patch.
I test the v2 patch using BenchmarkSQL 1000 warehouse, and here is the tpmC
result:
case               | min        | avg        | max
--------------------+------------+------------+--------------
master (patched)    | 988,461.89 | 994,916.50 | 1,000,362.40
master (44b61efb79) | 857,028.07 | 863,174.59 | 873,856.92
The patch provides a significant improvement.
I just looked through the patch, here are some comments.
1.
The v2 patch can't be applied cleanly.
Applying: Lock-free XLog Reservation using lock-free hash-table
.git/rebase-apply/patch:33: trailing whitespace.
.git/rebase-apply/patch:37: space before tab in indent.
{
.git/rebase-apply/patch:38: space before tab in indent.
int                     i;
.git/rebase-apply/patch:39: trailing whitespace.
.git/rebase-apply/patch:46: space before tab in indent.
LWLockReleaseClearVar(&WALInsertLocks[i].l.lock,
warning: squelched 4 whitespace errors
warning: 9 lines add whitespace errors.
2.
And there is a typo:
+     * PrevLinksHash is a lock-free hash table based on Cuckoo
algorith. It is
+     * mostly 4 way: for every element computed two positions h1, h2, and
s/algorith/algorithm/g

Hi, Japin

Thank you a lot for measuring and comments.

May I ask you to compare not only against master, but against straight
increase of NUM_XLOGINSERT_LOCKS to 128 as well?
This way the profit from added complexity will be more obvious: does
it pay for self or not.

The above test already increases NUM_XLOGINSERT_LOCKS to 64; I will try 128
and update the result later.

Oh, I see: I forgot that I removed increase of NUM_XLOGINSERT_LOCKS from
v2 patch.

#29Japin Li
japinli@hotmail.com
In reply to: Yura Sokolov (#27)
Re: [RFC] Lock-free XLog Reservation from WAL

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

--
Regrads,
Japin Li

#30Japin Li
japinli@hotmail.com
In reply to: Japin Li (#29)
Re: [RFC] Lock-free XLog Reservation from WAL

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov

Here is my test result of the v3 patch:

| case | min | avg | max |
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |

It seems there are some performance decreases :( or something I missed?

--
Regrads,
Japin Li

#31Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#30)
Re: [RFC] Lock-free XLog Reservation from WAL

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov

Here is my test result of the v3 patch:

| case | min | avg | max |
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |

It seems there are some performance decreases :( or something I missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you first
time).

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does we
need such not-small patch, or there is no real problem at all?

----
regards
Yura Sokolov aka funny-falcon

#32Japin Li
japinli@hotmail.com
In reply to: Yura Sokolov (#31)
Re: [RFC] Lock-free XLog Reservation from WAL

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case | min | avg | max
|
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Thanks for your explanation. I will test it based on [1]/messages/by-id/ME0P300MB0445471ABC855D0FA6FF0CA5B6E02@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM -- Regrads, Japin Li.

[1]: /messages/by-id/ME0P300MB0445471ABC855D0FA6FF0CA5B6E02@ME0P300MB0445.AUSP300.PROD.OUTLOOK.COM -- Regrads, Japin Li
--
Regrads,
Japin Li

#33Japin Li
japinli@hotmail.com
In reply to: Japin Li (#32)
Re: [RFC] Lock-free XLog Reservation from WAL

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case | min | avg | max
|
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3 patch.

| case | min | avg | max | rate% |
|-----------------------+--------------+--------------+--------------+-------|
| master (4108440) | 891,225.77 | 904,868.75 | 913,708.17 | |
| lock 64 | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 | 11.84 |
| lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 | 12.47 |
| lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 | 12.52 |
| lock 128 | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 | 12.07 |
| lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 | 13.09 |
| lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 | 12.61 |
| lock 64 v3 | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 | 12.54 |
| lock 128 v3 | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 | 13.24 |

--
Regrads,
Japin Li

#34Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#33)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case | min | avg | max
|
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3 patch.

| case | min | avg | max | rate% |
|-----------------------+--------------+--------------+--------------+-------|
| master (4108440) | 891,225.77 | 904,868.75 | 913,708.17 | |
| lock 64 | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 | 11.84 |
| lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 | 12.47 |
| lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 | 12.52 |
| lock 128 | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 | 12.07 |
| lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 | 13.09 |
| lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 | 12.61 |
| lock 64 v3 | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 | 12.54 |
| lock 128 v3 | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 | 13.24 |

Sorry for pause, it was my birthday, so I was on short vacation.

So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
but it helps more in other (rather synthetic) benchmark [1]/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru
- my version of lock-free reservation looks to help a bit when
applied alone, but look strange in conjunction with additional
lock attempts.

I don't see small improvement from my version of Lock-Free reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Probably, when other places will be optimized/improved, it will pay
more.

Or probably Zhiguo Zhou's version will perform better.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

[1]: /messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru
/messages/by-id/3b11fdc2-9793-403d-b3d4-67ff9a00d447@postgrespro.ru

------

regards
Yura

Attachments:

Dumb-lock-free-XLog-Reservation-without-xl_prev.patchtext/x-patch; charset=UTF-8; name=Dumb-lock-free-XLog-Reservation-without-xl_prev.patchDownload
From d8b1e82bab1ee51b416956b241e824f0b1d125e8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH] Dumb lock-free XLog Reservation without xl_prev

---
 src/backend/access/transam/xlog.c       | 106 ++++++++++--------------
 src/backend/access/transam/xlogreader.c |   2 +-
 2 files changed, 47 insertions(+), 61 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..d45e6408268 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -395,17 +395,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -715,6 +716,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -1111,36 +1118,18 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 
 	size = MAXALIGN(size);
 
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+	*PrevPtr = *StartPos - 1;
 
 	/*
 	 * Check that the conversions between "usable byte positions" and
@@ -1148,7 +1137,6 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 */
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 }
 
 /*
@@ -1166,32 +1154,29 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 	uint32		size = MAXALIGN(SizeOfXLogRecord);
 	XLogRecPtr	ptr;
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,17 +1187,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
 
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+	*PrevPtr = *StartPos - 1;
 
 	Assert(XLogSegmentOffset(*EndPos, wal_segment_size) == 0);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 
 	return true;
 }
@@ -1507,7 +1499,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1513,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -5017,12 +5006,13 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
 }
 
 /*
@@ -6018,8 +6008,11 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +6998,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9427,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af06172..0851b62af93 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,7 +1165,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		 * check guards against torn WAL pages where a stale but valid-looking
 		 * WAL record starts on a sector boundary.
 		 */
-		if (record->xl_prev != PrevRecPtr)
+		if (false && record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
 								  "record with incorrect prev-link %X/%X at %X/%X",
-- 
2.43.0

#35Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Japin Li (#33)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

24.01.2025 12:07, Japin Li wrote:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.

Thanks for your quick fixing. I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case | min | avg | max
|
|-------------------------------+------------+------------+------------|
| master (44b61efb79) | 865,743.55 | 871,237.40 | 874,492.59 |
| v3 | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3 patch.

| case | min | avg | max | rate% |
|-----------------------+--------------+--------------+--------------+-------|
| master (4108440) | 891,225.77 | 904,868.75 | 913,708.17 | |
| lock 64 | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 | 11.84 |
| lock 64 attempt 1 | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 | 12.47 |
| lock 64 attempt 2 | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 | 12.52 |
| lock 128 | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 | 12.07 |
| lock 128 attempt 1 | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 | 13.09 |
| lock 128 attempt 2 | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 | 12.61 |
| lock 64 v3 | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3 | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3 | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 | 12.54 |
| lock 128 v3 | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 | 13.24 |

By the way, I think I did a mistake by removing "pad" field in
XLogCtlInsert, and it could affect result in bad way.

So I've attached v4 with changes:
- "pad" field were returned to separate CurrBytePos from following
fields, and static assert were added.
- default PREV_LINKS_HASH_STRATEGY were changed to 3 as it shows less
regression on not modified NUM_XLOGINSERT_LOCKS=8

Though I beg you to test "Dumb-lock-free..." patch from previous letter
first. And only if it shows some promising results, then spent time on
v4.

------

regards
Yura

Attachments:

v4-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/x-patch; charset=UTF-8; name=v4-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From 189981c2e7dfe94ff86c4f9406e740b368aad79d Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v4] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 582 +++++++++++++++++++++++++++---
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 530 insertions(+), 54 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..de9bbbfbbb6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -384,6 +386,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+/*-----------
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ *     but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ *     in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 3
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -395,17 +485,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -442,8 +533,37 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorithm.
+	 *
+	 * With default PREV_LINKS_HASH_STRATEGY == 1 it is mostly 4 way: for
+	 * every element computed two positions h1, h2, and neighbour h1+1 and
+	 * h2+2 are used as well. This way even on collision we have 3 distinct
+	 * position, which provide us ~75% fill rate without unsolvable cycles
+	 * (due to Cuckoo's theory). But chosen slots may be in 4 distinct
+	 * cache-lines.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 3 it takes two buckets 4 elements each
+	 * - 8 positions in total, but guaranteed to be in two cache lines. It
+	 * provides very high fill rate - upto 90%.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 2 it takes only one bucket with 4
+	 * elements. Strictly speaking it is not Cuckoo-hashing, but should work
+	 * for our case.
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
+StaticAssertDecl(offsetof(XLogCtlInsert, RedoRecPtr) / PG_CACHE_LINE_SIZE !=
+				 offsetof(XLogCtlInsert, CurrBytePos) / PG_CACHE_LINE_SIZE,
+				 "offset ok");
+
 /*
  * Total shared-memory state for XLOG.
  */
@@ -568,6 +688,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -700,6 +823,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1086,6 +1222,341 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	uint32		offset;
+#endif
+
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY == 1
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	/* use multiplication compute 0 <= offset < PREV_LINKS_HASH_CAPA-4 */
+	offset = (hash / PREV_LINKS_HASH_CAPA) * (PREV_LINKS_HASH_CAPA - 4);
+	offset /= UINT32_MAX / PREV_LINKS_HASH_CAPA + 1;
+	/* add start of next bucket */
+	offset += (pos->pos[0] | 3) + 1;
+	/* get position in strictly other bucket */
+	pos->pos[4] = offset % PREV_LINKS_HASH_CAPA;
+	pos->pos[5] = pos->pos[4] ^ 1;
+	pos->pos[6] = pos->pos[4] ^ 2;
+	pos->pos[7] = pos->pos[4] ^ 3;
+#endif
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1118,25 +1589,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1172,26 +1627,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,10 +1655,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1507,7 +1969,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1983,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -4898,6 +5357,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -4999,6 +5460,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5017,12 +5481,24 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6018,8 +6494,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +7486,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9915,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index d5aa5c295ae..118aa487adf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3123,6 +3123,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0

#36Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#34)
Re: [RFC] Lock-free XLog Reservation from WAL

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.sokolov@postgrespro.ru>
wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines"
strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than
any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3
patch.

| case                  | min          | avg          | max          |
rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   | 913,708.17   |
       |
| lock 64               | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 |
11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 |
12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 |
12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 |
12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 |
13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 |
12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 |
13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 |
12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 |
12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 |
12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 |
12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 |
13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1]/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj.

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for pause, it was my birthday, so I was on short vacation.

So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.

I don't see small improvement from my version of Lock-Free reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to measure
the performance impact of your v1 patch of the lock-free hash table with
64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I observed an
improvement of *76.4%* (RSD: 4.1%) when combining them together on the
SPR with 480 vCPUs. I understand that your test devices may not have as
many cores, which might be why this optimization brings an unnoticeable
impact. However, I don't think this is an unreal problem. In fact, this
issue was raised by our customer who is trying to deploy Postgres on
devices with hundreds of cores, and I believe the resolution of this
performance issue would result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.

Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During my
analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary, especially
since your implementation already achieves the desired performance
improvements without suffering from the test failures. I think we may
need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I plan
to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru

------

regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying Yura's
patch

WDYT?

[1]: /messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

#37Japin Li
japinli@hotmail.com
In reply to: Zhou, Zhiguo (#36)
Re: [RFC] Lock-free XLog Reservation from WAL

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.

I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory, and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.

Here is my database settings:

listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

--
Regrads,
Japin Li

#38Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Japin Li (#37)
Re: [RFC] Lock-free XLog Reservation from WAL

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.

I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory, and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.

Here is my database settings:

listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from vacation.
To move things forward, I'll be running the BenchmarkSQL workload on my
end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Regards,
Zhiguo

#39Japin Li
japinli@hotmail.com
In reply to: Zhou, Zhiguo (#38)
Re: [RFC] Lock-free XLog Reservation from WAL

On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.

--
Regrads,
Japin Li

#40Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Japin Li (#39)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

On 2/11/2025 9:25 AM, Japin Li wrote:

On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.

Good day, Yura and Japin!

I recently acquired the SUT device again and had the opportunity to
conduct performance experiments using the TPC-C benchmark (pg_count_ware
757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the
results and key observations:

+----------------+-------------+------------+-------------+------------+
| Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
+----------------+-------------+------------+-------------+------------+
| master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
| 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
| 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
| 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
| 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
+----------------+-------------+------------+-------------+------------+

- Though the baseline (b4a07f5) has improved compared to when we created
this mailing list, we still achieve 44% improvement with this optimization.
- Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe
performance regression due to intensified lock contention.
- Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog
insertion optimization jointly improve overall performance.
- 64 locks seems the sweet spot for achieving the most performance
improvement.

I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not
sure if the difference of their implementation of TPCC would lead to
some performance gap). I observed that:

- The performance indicator (NOPM) shows differences of several
magnitudes compared to Japin's results.
- NOPM/TPM seems insensitive to code changes (lock count increase,
lock-free algorithm), which is quite strange.
- Possible reasons may include: 1) scaling parameters [1]https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters are not
aligned, 2) test configuration did not reach the pain point of the XLog
insertions.

And I noticed a 64-core device (32 cores for the server) was used in
Japin's test. In our previous core-scaling test (attached), 32/64 cores
may not be enough to show the impact of the optimization, I think that
would be one of the reason why Japin observed minimal impact from the
lock-free optimization.

In summary, I think:
- The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB
effectively reflects performance in XLog insertions.
- This test on a device with hundreds of cores reflects a real user
scenario, making it a significant consideration.
- The lock-free algorithm with the lock count increased to 64 can bring
significant performance improvements.

So I propose to continue the code review process for this optimization
patch. WDYT?

Regards,
Zhiguo

[1]: https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters
https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters

Attachments:

pg-tpcc-core-scaling-lock-free.pngimage/png; name=pg-tpcc-core-scaling-lock-free.pngDownload
#41Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#40)
Re: [RFC] Lock-free XLog Reservation from WAL

14.02.2025 11:41, Zhou, Zhiguo пишет:

On 2/11/2025 9:25 AM, Japin Li wrote:

On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.

Good day, Yura and Japin!

I recently acquired the SUT device again and had the opportunity to
conduct performance experiments using the TPC-C benchmark (pg_count_ware
757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the
results and key observations:

+----------------+-------------+------------+-------------+------------+
| Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
+----------------+-------------+------------+-------------+------------+
| master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
| 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
| 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
| 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
| 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
+----------------+-------------+------------+-------------+------------+

- Though the baseline (b4a07f5) has improved compared to when we created
this mailing list, we still achieve 44% improvement with this optimization.
- Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe
performance regression due to intensified lock contention.
- Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog
insertion optimization jointly improve overall performance.
- 64 locks seems the sweet spot for achieving the most performance
improvement.

I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not
sure if the difference of their implementation of TPCC would lead to
some performance gap). I observed that:

- The performance indicator (NOPM) shows differences of several
magnitudes compared to Japin's results.
- NOPM/TPM seems insensitive to code changes (lock count increase,
lock-free algorithm), which is quite strange.
- Possible reasons may include: 1) scaling parameters [1] are not
aligned, 2) test configuration did not reach the pain point of the XLog
insertions.

And I noticed a 64-core device (32 cores for the server) was used in
Japin's test. In our previous core-scaling test (attached), 32/64 cores
may not be enough to show the impact of the optimization, I think that
would be one of the reason why Japin observed minimal impact from the
lock-free optimization.

In summary, I think:
- The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB
effectively reflects performance in XLog insertions.
- This test on a device with hundreds of cores reflects a real user
scenario, making it a significant consideration.
- The lock-free algorithm with the lock count increased to 64 can bring
significant performance improvements.

So I propose to continue the code review process for this optimization
patch. WDYT?

[1]
https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters

Good day.

I'll just repeat my answer from personal mail:

I'm impressed with results. I really didn't expect it is so important for
huge servers.

Main problem will be to prove this patch doesn't harm performance on
smaller servers. Or made things configurable so that smaller server still
uses simpler code.

-------
regards
Yura Sokolov aka funny-falcon

#42Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#41)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

On 2/23/2025 8:03 PM, Yura Sokolov wrote:

14.02.2025 11:41, Zhou, Zhiguo пишет:

On 2/11/2025 9:25 AM, Japin Li wrote:

On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.

Good day, Yura and Japin!

I recently acquired the SUT device again and had the opportunity to
conduct performance experiments using the TPC-C benchmark (pg_count_ware
757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the
results and key observations:

+----------------+-------------+------------+-------------+------------+
| Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
+----------------+-------------+------------+-------------+------------+
| master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
| 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
| 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
| 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
| 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
+----------------+-------------+------------+-------------+------------+

- Though the baseline (b4a07f5) has improved compared to when we created
this mailing list, we still achieve 44% improvement with this optimization.
- Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe
performance regression due to intensified lock contention.
- Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog
insertion optimization jointly improve overall performance.
- 64 locks seems the sweet spot for achieving the most performance
improvement.

I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not
sure if the difference of their implementation of TPCC would lead to
some performance gap). I observed that:

- The performance indicator (NOPM) shows differences of several
magnitudes compared to Japin's results.
- NOPM/TPM seems insensitive to code changes (lock count increase,
lock-free algorithm), which is quite strange.
- Possible reasons may include: 1) scaling parameters [1] are not
aligned, 2) test configuration did not reach the pain point of the XLog
insertions.

And I noticed a 64-core device (32 cores for the server) was used in
Japin's test. In our previous core-scaling test (attached), 32/64 cores
may not be enough to show the impact of the optimization, I think that
would be one of the reason why Japin observed minimal impact from the
lock-free optimization.

In summary, I think:
- The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB
effectively reflects performance in XLog insertions.
- This test on a device with hundreds of cores reflects a real user
scenario, making it a significant consideration.
- The lock-free algorithm with the lock count increased to 64 can bring
significant performance improvements.

So I propose to continue the code review process for this optimization
patch. WDYT?

[1]
https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters

Good day.

I'll just repeat my answer from personal mail:

I'm impressed with results. I really didn't expect it is so important for
huge servers.

Main problem will be to prove this patch doesn't harm performance on
smaller servers. Or made things configurable so that smaller server still
uses simpler code.

-------
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Firstly, I'd apologize for the delayed response due to internal
urgencies and the time required to set up the tests on a new device.

Regarding your concerns about the potential negative impact on
performance, I have conducted further evaluations. To assess the
performance impact of the patch on a smaller device, I located another
device with significantly fewer processors. Using the same database and
test configurations (TPC-C benchmark: pg_count_ware 757, vu 256) and
code bases (b4a07f5 as "base" and v4 patch with 64 locks as "opt"), I
performed core scaling tests ranging from 8 to 64 physical cores in
steps of 8. The results (attached) indicate that the optimization does
not lead to performance regression within this low core count range.

Please kindly let me know if more data is required to move the process
forward.

I look forward to your insights.

Regards,
Zhiguo

Attachments:

pg-tpcc-core-scaling-lock-free-lcc.pngimage/png; name=pg-tpcc-core-scaling-lock-free-lcc.pngDownload
#43Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Zhou, Zhiguo (#42)
Re: [RFC] Lock-free XLog Reservation from WAL

05.03.2025 08:39, Zhou, Zhiguo пишет:

On 2/23/2025 8:03 PM, Yura Sokolov wrote:

14.02.2025 11:41, Zhou, Zhiguo пишет:

On 2/11/2025 9:25 AM, Japin Li wrote:

On Mon, 10 Feb 2025 at 22:12, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 2/5/2025 4:32 PM, Japin Li wrote:

On Mon, 27 Jan 2025 at 17:30, "Zhou, Zhiguo" <zhiguo.zhou@intel.com> wrote:

On 1/26/2025 10:59 PM, Yura Sokolov wrote:

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japinli@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japinli@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov
<y.sokolov@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.

Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two
cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3
and see
if it affects measurably.

Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------
+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 |
874,492.59 |
| v3                            | 857,020.58 | 860,180.11 |
864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 |
858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 |
865,396.42 |
It seems there are some performance decreases :( or something I
missed?

Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better
than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any
profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?

Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the
v3 patch.

| case                  | min          | avg          |
max          | rate% |
|-----------------------+--------------+--------------+--------------
+-------|
| master (4108440)      | 891,225.77   | 904,868.75   |
913,708.17   |        |
| lock 64               | 1,007,716.95 | 1,012,013.22 |
1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 |
1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 |
1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 |
1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 |
1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 |
1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 |
1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 |
1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 |
1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 |
1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 |
1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 |
1,027,855.90 | 13.24 |

The data looks really interesting and I recognize the need for further
investigation. I'm not very familiar with BenchmarkSQL but we've done
similar tests with HammerDB/TPCC by solely increasing
NUM_XLOGINSERT_LOCKS from 8 to 128, and we observed a significant
performance drop of ~50% and the cycle ratio of spinlock acquisition
(s_lock) rose to over 60% of the total, which is basically consistent
with the previous findings in [1].

Could you please share the details of your test environment, including
the device, configuration, and test approach, so we can collaborate on
understanding the differences?

Sorry for the late reply. I'm on my vacation.
I use Hygon C86 7490 64-core, it has 8 NUMA nodes with 1.5T memory,
and
I use 0-3 run the database, and 4-7 run the BenchmarkSQL.
Here is my database settings:
listen_addresses = '*'
max_connections = '1050'
shared_buffers = '100GB'
work_mem = '64MB'
maintenance_work_mem = '512MB'
max_wal_size = '50GB'
min_wal_size = '10GB'
random_page_cost = '1.1'
wal_buffers = '1GB'
wal_level = 'minimal'
max_wal_senders = '0'
wal_sync_method = 'open_datasync'
wal_compression = 'lz4'
track_activities = 'off'
checkpoint_timeout = '1d'
checkpoint_completion_target = '0.95'
effective_cache_size = '300GB'
effective_io_concurrency = '32'
update_process_title = 'off'
password_encryption = 'md5'
huge_pages = 'on'

Sorry for pause, it was my birthday, so I was on short vacation.
So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.
I don't see small improvement from my version of Lock-Free
reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Due to limited hardware resources, I only had the opportunity to
measure the performance impact of your v1 patch of the lock-free hash
table with 64 NUM_XLOGINSERT_LOCKS and the two lock attempt patch. I
observed an improvement of *76.4%* (RSD: 4.1%) when combining them
together on the SPR with 480 vCPUs. I understand that your test
devices may not have as many cores, which might be why this
optimization brings an unnoticeable impact. However, I don't think
this is an unreal problem. In fact, this issue was raised by our
customer who is trying to deploy Postgres on devices with hundreds of
cores, and I believe the resolution of this performance issue would
result in real impacts.

Probably, when other places will be optimized/improved, it will pay
more.
Or probably Zhiguo Zhou's version will perform better.

Our primary difference lies in the approach to handling the prev-link,
either via the hash table or directly within the XLog buffer. During
my analysis, I didn't identify significant hotspots in the hash table
functions, leading me to believe that both implementations should
achieve comparable performance improvements.

Following your advice, I revised my implementation to update the
prev-link atomically and resolved the known TAP tests. However, I
encountered the last failure in the recovery/t/027_stream_regress.pl
test. Addressing this issue might require a redesign of the underlying
writing convention of XLog, which I believe is not necessary,
especially since your implementation already achieves the desired
performance improvements without suffering from the test failures. I
think we may need to focus on your implementation in the next phase.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

I currently don't have access to the high-core-count device, but I
plan to measure the performance impact of your latest patch and the
"Dump-lock-free..." patch once I regain access.

[1] /messages/by-id/3b11fdc2-9793-403d-
b3d4-67ff9a00d447%40postgrespro.ru
------
regards
Yura

Hi Yura and Japin,

Thanks so much for your recent patch works and discussions which
inspired me a lot! I agree with you that we need to:
- Align the test approach and environment
- Address the motivation and necessity of this optimization
- Further identify the optimization opportunities after applying
Yura's patch

WDYT?

[1]
/messages/by-id/6ykez6chr5wfiveuv2iby236mb7ab6fqwpxghppdi5ugb4kdyt@lkrn4maox2wj

Regards,
Zhiguo

Hi Japin,

Apologies for the delay in responding—I've just returned from
vacation. To move things forward, I'll be running the BenchmarkSQL
workload on my end shortly.

In the meantime, could you run the HammerDB/TPCC workload on your
device? We've observed significant performance improvements with this
test, and it might help clarify whether the discrepancies we're seeing
stem from the workload itself. Thanks!

Sorry, I currently don't have access to the test device, I will try to test
it if I can regain access.

Good day, Yura and Japin!

I recently acquired the SUT device again and had the opportunity to
conduct performance experiments using the TPC-C benchmark (pg_count_ware
757, vu 256) with HammerDB on an Intel CPU with 480 vCPUs. Below are the
results and key observations:

+----------------+-------------+------------+-------------+------------+
| Version        | NOPM        | NOPM Gain% | TPM         | TPM(Gain%) |
+----------------+-------------+------------+-------------+------------+
| master(b4a07f5)|  1,681,233  | 0.0%       |  3,874,491  | 0.0%       |
| 64-lock        |  643,853    | -61.7%     |  1,479,647  | -61.8%     |
| 64-lock-v4     |  2,423,972  | 44.2%      |  5,577,580  | 44.0%      |
| 128-lock       |  462,993    | -72.5%     |  1,064,733  | -72.5%     |
| 128-lock-v4    |  2,468,034  | 46.8%      |  5,673,349  | 46.4%      |
+----------------+-------------+------------+-------------+------------+

- Though the baseline (b4a07f5) has improved compared to when we created
this mailing list, we still achieve 44% improvement with this optimization.
- Increasing NUM_XLOGINSERT_LOCKS solely to 64/128 leads to severe
performance regression due to intensified lock contention.
- Increasing NUM_XLOGINSERT_LOCKS and applying the lock-free xlog
insertion optimization jointly improve overall performance.
- 64 locks seems the sweet spot for achieving the most performance
improvement.

I also executed the same benchmark, TPCC, with BenchmarkSQL (I'm not
sure if the difference of their implementation of TPCC would lead to
some performance gap). I observed that:

- The performance indicator (NOPM) shows differences of several
magnitudes compared to Japin's results.
- NOPM/TPM seems insensitive to code changes (lock count increase,
lock-free algorithm), which is quite strange.
- Possible reasons may include: 1) scaling parameters [1] are not
aligned, 2) test configuration did not reach the pain point of the XLog
insertions.

And I noticed a 64-core device (32 cores for the server) was used in
Japin's test. In our previous core-scaling test (attached), 32/64 cores
may not be enough to show the impact of the optimization, I think that
would be one of the reason why Japin observed minimal impact from the
lock-free optimization.

In summary, I think:
- The TPC-C benchmark (pg_count_ware 757, vu 256) with HammerDB
effectively reflects performance in XLog insertions.
- This test on a device with hundreds of cores reflects a real user
scenario, making it a significant consideration.
- The lock-free algorithm with the lock count increased to 64 can bring
significant performance improvements.

So I propose to continue the code review process for this optimization
patch. WDYT?

[1]
https://github.com/pgsql-io/benchmarksql/blob/master/docs/PROPERTIES.md#scaling-parameters

Good day.

I'll just repeat my answer from personal mail:

I'm impressed with results. I really didn't expect it is so important for
huge servers.

Main problem will be to prove this patch doesn't harm performance on
smaller servers. Or made things configurable so that smaller server still
uses simpler code.

-------
regards
Yura Sokolov aka funny-falcon

Good day, Yura!

Firstly, I'd apologize for the delayed response due to internal
urgencies and the time required to set up the tests on a new device.

Regarding your concerns about the potential negative impact on
performance, I have conducted further evaluations. To assess the
performance impact of the patch on a smaller device, I located another
device with significantly fewer processors. Using the same database and
test configurations (TPC-C benchmark: pg_count_ware 757, vu 256) and
code bases (b4a07f5 as "base" and v4 patch with 64 locks as "opt"), I
performed core scaling tests ranging from 8 to 64 physical cores in
steps of 8. The results (attached) indicate that the optimization does
not lead to performance regression within this low core count range.

Please kindly let me know if more data is required to move the process
forward.

I look forward to your insights.

Good day, Zhiguo.

Thank you a lot for testing!
I will validate on servers I have access too (and on notebook).

To be honestly, I didn't bench v4, and it fixes cache-line sharing issue i
mistakenly introduced in previous version. So probably it is really doesn't
affect performance as v3 did.

-------
regards
Yura Sokolov aka funny-falcon

#44Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#43)
Re: [RFC] Lock-free XLog Reservation from WAL

Reliably fails tests on windows, due to what looks to be a null pointer dereference.

E.g. https://cirrus-ci.com/task/6178371937239040

That's likely related to EXEC_BACKEND.

The new status of this patch is: Waiting on Author

#45Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#44)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

Good day, Andres.

18.03.2025 23:40, Andres Freund wrote:

Reliably fails tests on windows, due to what looks to be a null pointer dereference.

E.g. https://cirrus-ci.com/task/6178371937239040

That's likely related to EXEC_BACKEND.

The new status of this patch is: Waiting on Author

Thank you very much for pointing on!
Yes, I've missed copying from XLogCtl as it is done for WALInsertLocks.
Fixed.

--
regards
Yura Sokolov aka funny-falcon

Attachments:

v5-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/x-patch; charset=UTF-8; name=v5-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From 8cc3f9c7d629ce7fedd15f224df7566fd723cb06 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v5] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 585 +++++++++++++++++++++++++++---
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 532 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4b6c694a3f7..82d6dc0732c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -384,6 +386,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+/*-----------
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ *     but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ *     in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 3
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -395,17 +485,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -442,8 +533,37 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorithm.
+	 *
+	 * With default PREV_LINKS_HASH_STRATEGY == 1 it is mostly 4 way: for
+	 * every element computed two positions h1, h2, and neighbour h1+1 and
+	 * h2+2 are used as well. This way even on collision we have 3 distinct
+	 * position, which provide us ~75% fill rate without unsolvable cycles
+	 * (due to Cuckoo's theory). But chosen slots may be in 4 distinct
+	 * cache-lines.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 3 it takes two buckets 4 elements each
+	 * - 8 positions in total, but guaranteed to be in two cache lines. It
+	 * provides very high fill rate - upto 90%.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 2 it takes only one bucket with 4
+	 * elements. Strictly speaking it is not Cuckoo-hashing, but should work
+	 * for our case.
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
+StaticAssertDecl(offsetof(XLogCtlInsert, RedoRecPtr) / PG_CACHE_LINE_SIZE !=
+				 offsetof(XLogCtlInsert, CurrBytePos) / PG_CACHE_LINE_SIZE,
+				 "offset ok");
+
 /*
  * Total shared-memory state for XLOG.
  */
@@ -568,6 +688,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -700,6 +823,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1086,6 +1222,341 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	uint32		offset;
+#endif
+
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY == 1
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	/* use multiplication compute 0 <= offset < PREV_LINKS_HASH_CAPA-4 */
+	offset = (hash / PREV_LINKS_HASH_CAPA) * (PREV_LINKS_HASH_CAPA - 4);
+	offset /= UINT32_MAX / PREV_LINKS_HASH_CAPA + 1;
+	/* add start of next bucket */
+	offset += (pos->pos[0] | 3) + 1;
+	/* get position in strictly other bucket */
+	pos->pos[4] = offset % PREV_LINKS_HASH_CAPA;
+	pos->pos[5] = pos->pos[4] ^ 1;
+	pos->pos[6] = pos->pos[4] ^ 2;
+	pos->pos[7] = pos->pos[4] ^ 3;
+#endif
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1118,25 +1589,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1172,26 +1627,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,10 +1655,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1507,7 +1969,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1983,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -4944,6 +5403,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -4998,8 +5459,9 @@ XLOGShmemInit(void)
 		/* both should be present or neither */
 		Assert(foundCFile && foundXLog);
 
-		/* Initialize local copy of WALInsertLocks */
+		/* Initialize local copy of WALInsertLocks and PrevLinksHash */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+		PrevLinksHash = XLogCtl->Insert.PrevLinksHash;
 
 		if (localControlFile)
 			pfree(localControlFile);
@@ -5045,6 +5507,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5063,12 +5528,24 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6064,8 +6541,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7051,7 +7533,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9469,14 +9951,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index bfa276d2d35..54f8a1e0d16 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3180,6 +3180,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0

#46Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#45)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

Just rebase

--
regards
Yura Sokolov aka funny-falcon

Attachments:

v6-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/x-patch; charset=UTF-8; name=v6-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From 4ea25d6feb655a072d1e9f40a547dc6aeab762ac Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.sokolov@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH v6] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 585 +++++++++++++++++++++++++++---
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 532 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2d4c346473b..0dff9addfe1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -379,6 +381,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+/*-----------
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ *     but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ *     in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 3
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -390,17 +480,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -437,8 +528,37 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorithm.
+	 *
+	 * With default PREV_LINKS_HASH_STRATEGY == 1 it is mostly 4 way: for
+	 * every element computed two positions h1, h2, and neighbour h1+1 and
+	 * h2+2 are used as well. This way even on collision we have 3 distinct
+	 * position, which provide us ~75% fill rate without unsolvable cycles
+	 * (due to Cuckoo's theory). But chosen slots may be in 4 distinct
+	 * cache-lines.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 3 it takes two buckets 4 elements each
+	 * - 8 positions in total, but guaranteed to be in two cache lines. It
+	 * provides very high fill rate - upto 90%.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 2 it takes only one bucket with 4
+	 * elements. Strictly speaking it is not Cuckoo-hashing, but should work
+	 * for our case.
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
+StaticAssertDecl(offsetof(XLogCtlInsert, RedoRecPtr) / PG_CACHE_LINE_SIZE !=
+				 offsetof(XLogCtlInsert, CurrBytePos) / PG_CACHE_LINE_SIZE,
+				 "offset ok");
+
 /*
  * Total shared-memory state for XLOG.
  */
@@ -579,6 +699,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -711,6 +834,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1097,6 +1233,341 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	uint32		offset;
+#endif
+
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY == 1
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	/* use multiplication compute 0 <= offset < PREV_LINKS_HASH_CAPA-4 */
+	offset = (hash / PREV_LINKS_HASH_CAPA) * (PREV_LINKS_HASH_CAPA - 4);
+	offset /= UINT32_MAX / PREV_LINKS_HASH_CAPA + 1;
+	/* add start of next bucket */
+	offset += (pos->pos[0] | 3) + 1;
+	/* get position in strictly other bucket */
+	pos->pos[4] = offset % PREV_LINKS_HASH_CAPA;
+	pos->pos[5] = pos->pos[4] ^ 1;
+	pos->pos[6] = pos->pos[4] ^ 2;
+	pos->pos[7] = pos->pos[4] ^ 3;
+#endif
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1129,25 +1600,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1183,26 +1638,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1213,10 +1666,19 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1518,7 +1980,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1533,9 +1994,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -5079,6 +5538,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5133,8 +5594,9 @@ XLOGShmemInit(void)
 		/* both should be present or neither */
 		Assert(foundCFile && foundXLog);
 
-		/* Initialize local copy of WALInsertLocks */
+		/* Initialize local copy of WALInsertLocks and PrevLinksHash */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+		PrevLinksHash = XLogCtl->Insert.PrevLinksHash;
 
 		if (localControlFile)
 			pfree(localControlFile);
@@ -5180,6 +5642,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5198,7 +5663,6 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
@@ -5208,6 +5672,19 @@ XLOGShmemInit(void)
 	pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
 	ConditionVariableInit(&XLogCtl->InitializedUpToCondVar);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6203,8 +6680,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7193,7 +7675,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9611,14 +10093,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e5879e00dff..f0bfb4762c3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3211,6 +3211,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0

#47Zhou, Zhiguo
zhiguo.zhou@intel.com
In reply to: Yura Sokolov (#46)
1 attachment(s)
Re: [RFC] Lock-free XLog Reservation from WAL

Rebase again.

Regards,
Zhiguo

Show quoted text

On 4/30/2025 10:55 PM, Yura Sokolov wrote:

Just rebase

Attachments:

v7-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchtext/plain; charset=UTF-8; name=v7-0001-Lock-free-XLog-Reservation-using-lock-free-hash-t.patchDownload
From 4e3fbbd66382c6a6dfd4abae802eda8e79d8d892 Mon Sep 17 00:00:00 2001
From: Zhiguo Zhou <zhiguo.zhou@intel.com>
Date: Mon, 27 Oct 2025 17:11:30 +0800
Subject: [PATCH] Lock-free XLog Reservation using lock-free hash-table

Removed PrevBytePos to eliminate lock contention, allowing atomic updates
to CurrBytePos. Use lock-free hash-table based on 4-way Cuckoo Hashing
to store link to PrevBytePos.
---
 src/backend/access/transam/xlog.c | 587 +++++++++++++++++++++++++++---
 src/tools/pgindent/typedefs.list  |   2 +
 2 files changed, 532 insertions(+), 57 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..1c9830f89af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,8 @@
 #include "catalog/pg_database.h"
 #include "common/controldata_utils.h"
 #include "common/file_utils.h"
+#include "common/hashfn.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -385,6 +387,94 @@ typedef union WALInsertLockPadded
 	char		pad[PG_CACHE_LINE_SIZE];
 } WALInsertLockPadded;
 
+/* #define WAL_LINK_64 0 */
+#ifndef WAL_LINK_64
+#ifdef PG_HAVE_ATOMIC_U64_SIMULATION
+#define WAL_LINK_64 0
+#else
+#define WAL_LINK_64 1
+#endif
+#endif
+
+/*
+ * It links current position with previous one.
+ * - CurrPosId is (CurrBytePos ^ (CurrBytePos>>32))
+ *   Since CurrBytePos grows monotonically and it is aligned to MAXALIGN,
+ *   CurrPosId correctly identifies CurrBytePos for at least 4*2^32 = 32GB of
+ *   WAL logs.
+ * - CurrPosHigh is (CurrBytePos>>32), it is stored for strong uniqueness check.
+ * - PrevSize is difference between CurrBytePos and PrevBytePos
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	uint64		CurrPos;
+	uint64		PrevPos;
+#define WAL_PREV_EMPTY (~((uint64)0))
+#define WALLinkEmpty(l) ((l).PrevPos == WAL_PREV_EMPTY)
+#define WALLinkSamePos(a, b) ((a).CurrPos == (b).CurrPos)
+#define WALLinkCopyPrev(a, b) do {(a).PrevPos = (b).PrevPos;} while(0)
+#else
+	uint32		CurrPosId;
+	uint32		CurrPosHigh;
+	uint32		PrevSize;
+#define WALLinkEmpty(l) ((l).PrevSize == 0)
+#define WALLinkSamePos(a, b) ((a).CurrPosId == (b).CurrPosId && (a).CurrPosHigh == (b).CurrPosHigh)
+#define WALLinkCopyPrev(a, b) do {(a).PrevSize = (b).PrevSize;} while(0)
+#endif
+} WALPrevPosLinkVal;
+
+/*
+ * This is an element of lock-free hash-table.
+ * In 32 bit mode PrevSize's lowest bit is used as a lock, relying on fact it is MAXALIGN-ed.
+ * In 64 bit mode lock protocol is more complex.
+ */
+typedef struct
+{
+#if WAL_LINK_64
+	pg_atomic_uint64 CurrPos;
+	pg_atomic_uint64 PrevPos;
+#else
+	pg_atomic_uint32 CurrPosId;
+	uint32		CurrPosHigh;
+	pg_atomic_uint32 PrevSize;
+	uint32		pad;			/* to align to 16 bytes */
+#endif
+} WALPrevPosLink;
+
+StaticAssertDecl(sizeof(WALPrevPosLink) == 16, "WALPrevPosLink should be 16 bytes");
+
+#define PREV_LINKS_HASH_CAPA (NUM_XLOGINSERT_LOCKS * 2)
+StaticAssertDecl(!(PREV_LINKS_HASH_CAPA & (PREV_LINKS_HASH_CAPA - 1)),
+				 "PREV_LINKS_HASH_CAPA should be power of two");
+StaticAssertDecl(PREV_LINKS_HASH_CAPA < UINT16_MAX,
+				 "PREV_LINKS_HASH_CAPA is too large");
+
+/*-----------
+ * PREV_LINKS_HASH_STRATEGY - the way slots are chosen in hash table
+ *   1 - 4 positions h1,h1+1,h2,h2+2 - it guarantees at least 3 distinct points,
+ *     but may spread at 4 cache lines.
+ *   2 - 4 positions h,h^1,h^2,h^3 - 4 points in single cache line.
+ *   3 - 8 positions h1,h1^1,h1^2,h1^4,h2,h2^1,h2^2,h2^3 - 8 distinct points in
+ *     in two cache lines.
+ */
+#ifndef PREV_LINKS_HASH_STRATEGY
+#define PREV_LINKS_HASH_STRATEGY 3
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY <= 2
+#define PREV_LINKS_LOOKUPS 4
+#else
+#define PREV_LINKS_LOOKUPS 8
+#endif
+
+struct WALPrevLinksLookups
+{
+	uint16		pos[PREV_LINKS_LOOKUPS];
+};
+
+#define SWAP_ONCE_IN 128
+
 /*
  * Session status of running backup, used for sanity checks in SQL-callable
  * functions to start and stop backups.
@@ -396,17 +486,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -443,8 +534,37 @@ typedef struct XLogCtlInsert
 	 * WAL insertion locks.
 	 */
 	WALInsertLockPadded *WALInsertLocks;
+
+	/*
+	 * PrevLinksHash is a lock-free hash table based on Cuckoo algorithm.
+	 *
+	 * With default PREV_LINKS_HASH_STRATEGY == 1 it is mostly 4 way: for
+	 * every element computed two positions h1, h2, and neighbour h1+1 and
+	 * h2+2 are used as well. This way even on collision we have 3 distinct
+	 * position, which provide us ~75% fill rate without unsolvable cycles
+	 * (due to Cuckoo's theory). But chosen slots may be in 4 distinct
+	 * cache-lines.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 3 it takes two buckets 4 elements each
+	 * - 8 positions in total, but guaranteed to be in two cache lines. It
+	 * provides very high fill rate - upto 90%.
+	 *
+	 * With PREV_LINKS_HASH_STRATEGY == 2 it takes only one bucket with 4
+	 * elements. Strictly speaking it is not Cuckoo-hashing, but should work
+	 * for our case.
+	 *
+	 * Certainly, we rely on the fact we will delete elements with same speed
+	 * as we add them, and even unsolvable cycles will be destroyed soon by
+	 * concurrent deletions.
+	 */
+	WALPrevPosLink *PrevLinksHash;
+
 } XLogCtlInsert;
 
+StaticAssertDecl(offsetof(XLogCtlInsert, RedoRecPtr) / PG_CACHE_LINE_SIZE !=
+				 offsetof(XLogCtlInsert, CurrBytePos) / PG_CACHE_LINE_SIZE,
+				 "offset ok");
+
 /*
  * Total shared-memory state for XLOG.
  */
@@ -568,6 +688,9 @@ static XLogCtlData *XLogCtl = NULL;
 /* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
 static WALInsertLockPadded *WALInsertLocks = NULL;
 
+/* same for XLogCtl->Insert.PrevLinksHash */
+static WALPrevPosLink *PrevLinksHash = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -700,6 +823,19 @@ static void CopyXLogRecordToWAL(int write_len, bool isLogSwitch,
 								XLogRecData *rdata,
 								XLogRecPtr StartPos, XLogRecPtr EndPos,
 								TimeLineID tli);
+
+static void WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos);
+static void WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos);
+static void CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos);
+
+static bool WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val);
+static bool WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static bool WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val);
+static void LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos,
+							   XLogRecPtr *PrevPtr);
+static void LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec);
+static XLogRecPtr ReadInsertCurrBytePos(void);
+
 static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
@@ -1089,6 +1225,341 @@ XLogInsertRecord(XLogRecData *rdata,
 	return EndPos;
 }
 
+static pg_attribute_always_inline void
+WALPrevPosLinkValCompose(WALPrevPosLinkVal *val, XLogRecPtr StartPos, XLogRecPtr PrevPos)
+{
+#if WAL_LINK_64
+	val->CurrPos = StartPos;
+	val->PrevPos = PrevPos;
+#else
+	val->CurrPosHigh = StartPos >> 32;
+	val->CurrPosId = StartPos ^ val->CurrPosHigh;
+	val->PrevSize = StartPos - PrevPos;
+#endif
+}
+
+static pg_attribute_always_inline void
+WALPrevPosLinkValGetPrev(WALPrevPosLinkVal val, XLogRecPtr *PrevPos)
+{
+#if WAL_LINK_64
+	*PrevPos = val.PrevPos;
+#else
+	XLogRecPtr	StartPos = val.CurrPosHigh;
+
+	StartPos ^= (StartPos << 32) | val.CurrPosId;
+	*PrevPos = StartPos - val.PrevSize;
+#endif
+}
+
+static pg_attribute_always_inline void
+CalcCuckooPositions(WALPrevPosLinkVal linkval, struct WALPrevLinksLookups *pos)
+{
+	uint32		hash;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	uint32		offset;
+#endif
+
+
+#if WAL_LINK_64
+	hash = murmurhash32(linkval.CurrPos ^ (linkval.CurrPos >> 32));
+#else
+	hash = murmurhash32(linkval.CurrPosId);
+#endif
+
+#if PREV_LINKS_HASH_STRATEGY == 1
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
+	pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
+	pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;
+#else
+	pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
+	pos->pos[1] = pos->pos[0] ^ 1;
+	pos->pos[2] = pos->pos[0] ^ 2;
+	pos->pos[3] = pos->pos[0] ^ 3;
+#if PREV_LINKS_HASH_STRATEGY == 3
+	/* use multiplication compute 0 <= offset < PREV_LINKS_HASH_CAPA-4 */
+	offset = (hash / PREV_LINKS_HASH_CAPA) * (PREV_LINKS_HASH_CAPA - 4);
+	offset /= UINT32_MAX / PREV_LINKS_HASH_CAPA + 1;
+	/* add start of next bucket */
+	offset += (pos->pos[0] | 3) + 1;
+	/* get position in strictly other bucket */
+	pos->pos[4] = offset % PREV_LINKS_HASH_CAPA;
+	pos->pos[5] = pos->pos[4] ^ 1;
+	pos->pos[6] = pos->pos[4] ^ 2;
+	pos->pos[7] = pos->pos[4] ^ 3;
+#endif
+#endif
+}
+
+/*
+ * Attempt to write into empty link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkInsert(WALPrevPosLink *link, WALPrevPosLinkVal val)
+{
+#if WAL_LINK_64
+	uint64		empty = WAL_PREV_EMPTY;
+
+	if (pg_atomic_read_u64(&link->PrevPos) != WAL_PREV_EMPTY)
+		return false;
+	if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &empty, val.PrevPos))
+		return false;
+	/* we could ignore concurrent lock of CurrPos */
+	pg_atomic_write_u64(&link->CurrPos, val.CurrPos);
+	return true;
+#else
+	uint32		empty = 0;
+
+	/* first check it read-only */
+	if (pg_atomic_read_u32(&link->PrevSize) != 0)
+		return false;
+	if (!pg_atomic_compare_exchange_u32(&link->PrevSize, &empty, 1))
+		/* someone else occupied the entry */
+		return false;
+
+	pg_atomic_write_u32(&link->CurrPosId, val.CurrPosId);
+	link->CurrPosHigh = val.CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val.PrevSize);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to consume matched link.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkConsume(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+
+	if (pg_atomic_read_u64(&link->CurrPos) != val->CurrPos)
+		return false;
+	/* lock against concurrent swapper */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr != val->CurrPos)
+	{
+		/* link was swapped */
+		pg_atomic_write_u64(&link->CurrPos, oldCurr);
+		return false;
+	}
+	val->PrevPos = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, WAL_PREV_EMPTY);
+
+	/*
+	 * concurrent inserter may already reuse this link, so we don't check
+	 * result of compare_exchange
+	 */
+	oldCurr |= 1;
+	pg_atomic_compare_exchange_u64(&link->CurrPos, &oldCurr, 0);
+	return true;
+#else
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId)
+		return false;
+
+	/* Try lock */
+	val->PrevSize = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (val->PrevSize & 1)
+		/* Lock failed */
+		return false;
+
+	if (pg_atomic_read_u32(&link->CurrPosId) != val->CurrPosId ||
+		link->CurrPosHigh != val->CurrPosHigh)
+	{
+		/* unlock with old value */
+		pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+		return false;
+	}
+
+	pg_atomic_write_u32(&link->CurrPosId, 0);
+	link->CurrPosHigh = 0;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, 0);
+	return true;
+#endif
+}
+
+/*
+ * Attempt to swap entry: remember existing link and write our.
+ * It could happen we consume empty entry. Caller will detect it by checking
+ * remembered value.
+ */
+static pg_attribute_always_inline bool
+WALPrevPosLinkSwap(WALPrevPosLink *link, WALPrevPosLinkVal *val)
+{
+#if WAL_LINK_64
+	uint64		oldCurr;
+	uint64		oldPrev;
+
+	/* lock against concurrent swapper or consumer */
+	oldCurr = pg_atomic_fetch_or_u64(&link->CurrPos, 1);
+	if (oldCurr & 1)
+		return false;			/* lock failed */
+	if (oldCurr == 0)
+	{
+		/* link was empty */
+		oldPrev = WAL_PREV_EMPTY;
+		/* but concurrent inserter may concurrently insert */
+		if (!pg_atomic_compare_exchange_u64(&link->PrevPos, &oldPrev, val->PrevPos))
+			return false;		/* concurrent inserter won. It will overwrite
+								 * CurrPos */
+		/* this write acts as unlock */
+		pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+		val->CurrPos = 0;
+		val->PrevPos = WAL_PREV_EMPTY;
+		return true;
+	}
+	oldPrev = pg_atomic_read_u64(&link->PrevPos);
+	pg_atomic_write_u64(&link->PrevPos, val->PrevPos);
+	pg_write_barrier();
+	/* write acts as unlock */
+	pg_atomic_write_u64(&link->CurrPos, val->CurrPos);
+	val->CurrPos = oldCurr;
+	val->PrevPos = oldPrev;
+	return true;
+#else
+	uint32		oldPrev;
+	uint32		oldCurId;
+	uint32		oldCurHigh;
+
+	/* Attempt to lock entry against concurrent consumer or swapper */
+	oldPrev = pg_atomic_fetch_or_u32(&link->PrevSize, 1);
+	if (oldPrev & 1)
+		/* Lock failed */
+		return false;
+
+	oldCurId = pg_atomic_read_u32(&link->CurrPosId);
+	oldCurHigh = link->CurrPosHigh;
+	pg_atomic_write_u32(&link->CurrPosId, val->CurrPosId);
+	link->CurrPosHigh = val->CurrPosHigh;
+	pg_write_barrier();
+	/* This write acts as unlock as well. */
+	pg_atomic_write_u32(&link->PrevSize, val->PrevSize);
+
+	val->CurrPosId = oldCurId;
+	val->CurrPosHigh = oldCurHigh;
+	val->PrevSize = oldPrev;
+	return true;
+#endif
+}
+
+/*
+ * Write new link (EndPos, StartPos) and find PrevPtr for StartPos.
+ *
+ * Links are stored in lock-free Cuckoo based hash-table.
+ * We use mostly-4 way Cuckoo hashing which provides high fill rate without
+ * hard cycle collisions. Also we rely on concurrent consumers of existing
+ * entry, so cycles will be broken in mean time.
+ *
+ * Cuckoo hashing relies on re-insertion for balancing, so we occasionally
+ * swaps entry and try to insert swapped instead of our.
+ */
+static void
+LinkAndFindPrevPos(XLogRecPtr StartPos, XLogRecPtr EndPos, XLogRecPtr *PrevPtr)
+{
+	SpinDelayStatus spin_stat;
+	WALPrevPosLinkVal lookup;
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups lookup_pos;
+	struct WALPrevLinksLookups insert_pos;
+	uint32		i;
+	uint32		rand = 0;
+	bool		inserted = false;
+	bool		found = false;
+
+	/* pass StartPos second time to set PrevSize = 0 */
+	WALPrevPosLinkValCompose(&lookup, StartPos, StartPos);
+	WALPrevPosLinkValCompose(&insert, EndPos, StartPos);
+
+	CalcCuckooPositions(lookup, &lookup_pos);
+	CalcCuckooPositions(insert, &insert_pos);
+
+	init_local_spin_delay(&spin_stat);
+
+	while (!inserted || !found)
+	{
+		for (i = 0; !found && i < PREV_LINKS_LOOKUPS; i++)
+			found = WALPrevPosLinkConsume(&PrevLinksHash[lookup_pos.pos[i]], &lookup);
+
+		if (inserted)
+		{
+			/*
+			 * we may sleep only after we inserted our value, since other
+			 * backend waits for it
+			 */
+			perform_spin_delay(&spin_stat);
+			goto next;
+		}
+
+		for (i = 0; !inserted && i < PREV_LINKS_LOOKUPS; i++)
+			inserted = WALPrevPosLinkInsert(&PrevLinksHash[insert_pos.pos[i]], insert);
+
+		if (inserted)
+			goto next;
+
+		rand = pg_prng_uint32(&pg_global_prng_state);
+		if (rand % SWAP_ONCE_IN != 0)
+			goto next;
+
+		i = rand / SWAP_ONCE_IN % PREV_LINKS_LOOKUPS;
+		if (!WALPrevPosLinkSwap(&PrevLinksHash[insert_pos.pos[i]], &insert))
+			goto next;
+
+		if (WALLinkEmpty(insert))
+			/* Lucky case: entry become empty and we inserted into */
+			inserted = true;
+		else if (WALLinkSamePos(lookup, insert))
+		{
+			/*
+			 * We occasionally replaced entry we looked for. No need to insert
+			 * it again.
+			 */
+			inserted = true;
+			Assert(!found);
+			found = true;
+			WALLinkCopyPrev(lookup, insert);
+			break;
+		}
+		else
+			CalcCuckooPositions(insert, &insert_pos);
+
+next:
+		pg_spin_delay();
+		pg_read_barrier();
+	}
+
+	WALPrevPosLinkValGetPrev(lookup, PrevPtr);
+}
+
+static pg_attribute_always_inline void
+LinkStartPrevPos(XLogRecPtr EndOfLog, XLogRecPtr LastRec)
+{
+	WALPrevPosLinkVal insert;
+	struct WALPrevLinksLookups insert_pos;
+
+	WALPrevPosLinkValCompose(&insert, EndOfLog, LastRec);
+	CalcCuckooPositions(insert, &insert_pos);
+#if WAL_LINK_64
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].CurrPos, insert.CurrPos);
+	pg_atomic_write_u64(&PrevLinksHash[insert_pos.pos[0]].PrevPos, insert.PrevPos);
+#else
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].CurrPosId, insert.CurrPosId);
+	PrevLinksHash[insert_pos.pos[0]].CurrPosHigh = insert.CurrPosHigh;
+	pg_atomic_write_u32(&PrevLinksHash[insert_pos.pos[0]].PrevSize, insert.PrevSize);
+#endif
+}
+
+static pg_attribute_always_inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Reserves the right amount of space for a record of given size from the WAL.
  * *StartPos is set to the beginning of the reserved section, *EndPos to
@@ -1121,25 +1592,9 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
@@ -1175,26 +1630,23 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1205,10 +1657,18 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
+	LinkAndFindPrevPos(startbytepos, endbytepos, &prevbytepos);
 
 	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
 
@@ -1510,7 +1970,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1525,9 +1984,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -4940,6 +5397,8 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
+	/* prevlinkshash, abuses alignment of WAL insertion locks. */
+	size = add_size(size, mul_size(sizeof(WALPrevPosLink), PREV_LINKS_HASH_CAPA));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(pg_atomic_uint64), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -4994,8 +5453,9 @@ XLOGShmemInit(void)
 		/* both should be present or neither */
 		Assert(foundCFile && foundXLog);
 
-		/* Initialize local copy of WALInsertLocks */
+		/* Initialize local copy of WALInsertLocks and PrevLinksHash */
 		WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+		PrevLinksHash = XLogCtl->Insert.PrevLinksHash;
 
 		if (localControlFile)
 			pfree(localControlFile);
@@ -5041,6 +5501,9 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
+	PrevLinksHash = XLogCtl->Insert.PrevLinksHash = (WALPrevPosLink *) allocptr;
+	allocptr += sizeof(WALPrevPosLink) * PREV_LINKS_HASH_CAPA;
+
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
 	 * This simplifies some calculations in XLOG insertion. It is also
@@ -5059,12 +5522,24 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
+
+	for (i = 0; i < PREV_LINKS_HASH_CAPA; i++)
+	{
+#if WAL_LINK_64
+		pg_atomic_init_u64(&PrevLinksHash[i].CurrPos, 0);
+		pg_atomic_init_u64(&PrevLinksHash[i].PrevPos, WAL_PREV_EMPTY);
+#else
+		pg_atomic_init_u32(&PrevLinksHash[i].CurrPosId, 0);
+		pg_atomic_init_u32(&PrevLinksHash[i].PrevSize, 0);
+#endif
+	}
 }
 
 /*
@@ -6064,8 +6539,13 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+		XLogRecPtr	lastRec = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+		LinkStartPrevPos(endOfLog, lastRec);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7057,7 +7537,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9478,14 +9958,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 43fe3bcd593..8dda440df50 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3240,6 +3240,8 @@ WALAvailability
 WALInsertLock
 WALInsertLockPadded
 WALOpenSegment
+WALPrevPosLink
+WALPrevPosLinkVal
 WALReadError
 WALSegmentCloseCB
 WALSegmentContext
-- 
2.43.0