An attempt to reduce WALWriteLock contention

Started by Kuntal Ghoshabout 9 years ago9 messages
#1Kuntal Ghosh
kuntalghosh.2007@gmail.com
1 attachment(s)

Hello all,

In a recent post[1][HACKERS] pgbench vs. wait events </messages/by-id/CAMkU=1yVzuJA7sTW-3Ddyn9QYqAbAYG61OWwsfP-WU8f6KUGvg@mail.gmail.com by Robert, wait events for different LWLOCKS have been
analyzed. The results clearly indicate a significant lock contention
overhead on WAL Write locks. To get an idea of this overhead, we did the
following two tests.

1. Hacked the code to comment out WAL write and flush calls to see the
overhead of WAL writing. The TPS for read-write pgbench tests at 300 scale
factor with 64 client count increased from 27871 to 45068.

2. Hacked the code to comment out WAL flush calls to see the overhead of
WAL flushing (fsync=off). The TPS for read-write pgbench tests at 300 scale
factor with 64 client count increased from 27871 to 41835.

All the tests have been performed for 15 minutes with following pg
configurations:
max_wal_size: 40GB
checkpoint_timeout: 15mins
maintenance_work_mem: 4GB
checkpoint_completion_target: 0.9
Shared buffer: 8GB
(Other settings have default values)

From above experiments, it is clear that flush is the main cost in WAL
writing which is no surprise, but still, the above data shows the exact
overhead of flush. Robert and Amit suggested (in offline discussions) using
separate WALFlushLock to flush the WAL data. The idea is to take WAL flush
calls out of WAL Write Lock and introduce a new lock (WAL Flush Lock) to
flush the data. This should allow simultaneous os writes when a fsync is in
progress. LWLockAcquireOrWait is used for the newly introduced WAL Flush
Lock to accumulate flush calls. We did a pgbench read/write (s.f. 300) test
with above configurations for various clients. But, we didn't see any
performance improvements, rather it decreased by 10%-12%. Hence to measure
the wait events, we performed a run for 30 minutes with 64 clients.

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD
------------------------
48642 LWLockNamed | WALWriteLock

With Patch
----------------------------------
31889 LWLockNamed | WALFlushLock
25212 LWLockNamed | WALWriteLock

The contention on WAL Write Lock was reduced, but together with WAL Flush
lock, the total contention got increased. We also measured the number of
times fsync() and write() have been called for a 10-minutes pgbench
read/write test with 16 clients. We noticed a huge increase in write()
system calls and this is happening as we've reduced the contention on WAL
Write Lock.

Due to reduced contention on WAL Write Lock, lot of backends are going for
small os writes, sometimes on same 8KB page, i.e., write calls are not
properly accumulated. For example,
backend 1 - 1 KB write() - 15-20 micro secs
backend 2 - 1 KB write() - 15-20 micro secs
backend 3 - 1 KB write() - 15-20 micro secs
backend 4 - 1 KB write() - 15-20 micro secs
But, if we accumulate these 4 requests, 4KB can be written in 50-60 micro
secs. Apart from that, we are also paying for lock acquire and lock release
for os write and lseek(). For the same reason, when a fsync is going, we
are not able to accumulate sufficient data for next fsync. This also
increases the contention on WAL Flush Lock. So, we tried adding
delay(pg_usleep) before flush/write to accumulate data. But, this severely
increases the contention on WAL flush locks.

To reduce the contention on WAL Write Lock further, Amit suggested the
following change on top of the existing patch:
Backend as Write Leader:
Except one proc, all other proc's will wait for their write location to be
written in OS buffer. Each proc will advertise it's write location and wait
on the semaphore to check whether it's write location has been completed.
Only the leader will compete for WALWriteLock.After data is written, it
wakes all the procs for which it has written the WAL and once done with
waking it will release the WALWriteLock. Ashutosh and Amit have helped a
lot for the implementation of the above idea. Even after this idea, we
didn't see any noticeable performance improvement with
synchronous_commit=on mode, however there was no regression. Again, to
measure the wait events, we performed a 30 minutes run with 64 clients.
(pgbench r/w test with s.f. 300)

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD
------------------------
48642 LWLockNamed | WALWriteLock

With Patch
----------------------------------
38952 LWLockNamed | WALFlushLock
1679 LWLockNamed | WALWriteLock

We reduced the contention on WAL write locks. The reason is that only the
group leader is competing for write lock on behalf of a group of procs.
Still, the number of small write requests is not reduced.

Finally, we performed some tests with synchronous_commit=off and data
doesn't fit in shared buffer. This should accumulate the data properly for
write without waiting on some locks or semaphores. Besides, write and fsync
can be done simultaneously. Next results are for various scale factors and
shared buffers. (Please see below for system configuration):

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=1000
shared_buffers=10GB
max_connections=200

threads HEAD PATCH %diff
48 18585 18618 +0.1
64 19631 19735 +0.5
128 19332 20556 +6.3

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=1000
shared_buffers=14GB
max_connections=200

threads HEAD PATCH %diff
48 42156 47398 +12.4
64 41737 45288 +8.36
128 37983 47443 +24.9

./pgbench -c $threads -j $threads -T 900 -M prepared postgres
non default param:
Scale Factor=300
shared_buffers=4GB
max_connections=200

threads HEAD PATCH %diff
48 48151 48665 +1.06
64 52651 52789 +0.2
128 56656 60691 +7.1

We noticed some good improvement when most of the data fits in shared
buffer. Apart from that, the improvements are not significant. It may
happen due to high io for buffer evictions in less shared buffer.

In conclusions, we tried to take flush calls out of WAL write lock so that
we can allow simultaneous os writes when fsync is going on. For
synchronous_commit=off, we improved the performance significantly. For
other cases, the reason may be that we are not accumulating write calls
properly and thus issuing a lot of small write requests. Another
possibility could be the overhead of adding an extra lock.

Thanks to Amit Kapila, Ashutosh Sharma and Robert Haas for helping me
throughout the process with their valuable inputs.

I've attached the prototype patch as well. PFA. Any suggestions or comments
will really be helpful in this regard.

System Configuration:

Model name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU(s): 56
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Kernel: 3.10.0-327.36.1.el7.x86_64
pg_wal on /mnt/ssd type ext4 (rw,relatime,data=ordered)

[1]: [HACKERS] pgbench vs. wait events </messages/by-id/CAMkU=1yVzuJA7sTW-3Ddyn9QYqAbAYG61OWwsfP-WU8f6KUGvg@mail.gmail.com
</messages/by-id/CAMkU=1yVzuJA7sTW-3Ddyn9QYqAbAYG61OWwsfP-WU8f6KUGvg@mail.gmail.com

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

Attachments:

reduce_walwritelock_contention.patchapplication/x-download; name=reduce_walwritelock_contention.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c1b9a97..fc79415 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -73,6 +73,7 @@
 #include "utils/snapmgr.h"
 #include "utils/timestamp.h"
 #include "pg_trace.h"
+#include "access/rmgr.h"
 
 extern uint32 bootstrap_data_checksum_version;
 
@@ -808,7 +809,8 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
-static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
+static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible, PGPROC_LIST *wake_pendingWriteWALElem);
+static void XLogFsync(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
 					   bool use_lock);
@@ -1917,7 +1919,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 
-				LogwrtResult = XLogCtl->LogwrtResult;
+				LogwrtResult.Write = XLogCtl->LogwrtResult.Write;
 				if (LogwrtResult.Write >= OldPageRqstPtr)
 				{
 					/* OK, someone wrote it already */
@@ -1929,8 +1931,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
 					WriteRqst.Write = OldPageRqstPtr;
 					WriteRqst.Flush = 0;
-					XLogWrite(WriteRqst, false);
-					LWLockRelease(WALWriteLock);
+					XLogWrite(WriteRqst, false, NULL);
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 				}
 				/* Re-acquire WALBufMappingLock and retry */
@@ -2152,7 +2153,7 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
  * write.
  */
 static void
-XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
+XLogWrite(XLogwrtRqst WriteRqst, bool flexible, PGPROC_LIST *wake_pendingWriteWALElem)
 {
 	bool		ispartialpage;
 	bool		last_iteration;
@@ -2162,6 +2163,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	int			npages;
 	int			startidx;
 	uint32		startoffset;
+	long long	start;
+	bool		moveToNextBuf;
+	PGPROC      *proc_to_clear;
+
 
 	/* We should always be inside a critical section here */
 	Assert(CritSectionCount > 0);
@@ -2169,7 +2174,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	/*
 	 * Update local LogwrtResult (caller probably did this already, but...)
 	 */
-	LogwrtResult = XLogCtl->LogwrtResult;
+	LogwrtResult.Write = XLogCtl->LogwrtResult.Write;
 
 	/*
 	 * Since successive pages in the xlog cache are consecutively allocated,
@@ -2183,6 +2188,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	npages = 0;
 	startidx = 0;
 	startoffset = 0;
+	moveToNextBuf = true;
 
 	/*
 	 * Within the loop, curridx is the cache block index of the page to
@@ -2193,18 +2199,35 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 
 	while (LogwrtResult.Write < WriteRqst.Write)
 	{
+
+		XLogRecPtr  EndPtr;
+
+		if (!moveToNextBuf)
+		{
+			/*
+			 * Within the loop, curridx is the cache block index of the page to
+			 * consider writing.  Begin at the buffer containing the next unwritten
+			 * page, or last partially written page.
+			 */
+			curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
+			moveToNextBuf = true;
+		}
+
 		/*
 		 * Make sure we're not ahead of the insert process.  This could happen
 		 * if we're passed a bogus WriteRqst.Write that is past the end of the
 		 * last page that's been initialized by AdvanceXLInsertBuffer.
 		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		EndPtr = XLogCtl->xlblocks[curridx];
 
 		if (LogwrtResult.Write >= EndPtr)
-			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
+		{
+			elog(LOG, "xlog write request %X/%X is past end of log %X/%X",
 				 (uint32) (LogwrtResult.Write >> 32),
 				 (uint32) LogwrtResult.Write,
 				 (uint32) (EndPtr >> 32), (uint32) EndPtr);
+			break;
+		}
 
 		/* Advance LogwrtResult.Write to end of current buffer page */
 		LogwrtResult.Write = EndPtr;
@@ -2318,13 +2341,47 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 */
 			if (finishing_seg)
 			{
+				/*
+				 * Update shared memory status to indicate write
+				 * progress as we are releasing WALWriteLock.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->LogwrtResult.Write = LogwrtResult.Write;
+				if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
+					XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				LWLockRelease(WALWriteLock);
+
+				while (wake_pendingWriteWALElem)
+				{
+					proc_to_clear = (PGPROC *) (((char *) wake_pendingWriteWALElem) -
+												offsetof(PGPROC, pendingWriteWALLinks));
+
+					wake_pendingWriteWALElem = wake_pendingWriteWALElem->next;
+
+					/* Mark that Xid has cleared for this proc */
+					proc_to_clear->writeWAL = false;
+
+					pg_write_barrier();
+
+					if (proc_to_clear != MyProc)
+						PGSemaphoreUnlock(&proc_to_clear->sem);
+				}
+
+
+				LWLockAcquire(WALFlushLock, LW_EXCLUSIVE);
+
+				/*
+				 * we want to ensure that current segment is completely
+				 * flushed.
+				 */
 				issue_xlog_fsync(openLogFile, openLogSegNo);
+				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
 				/* signal that we need to wakeup walsenders later */
 				WalSndWakeupRequest();
 
-				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
-
 				if (XLogArchivingActive())
 					XLogArchiveNotifySeg(openLogSegNo);
 
@@ -2343,6 +2400,30 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 					if (XLogCheckpointNeeded(openLogSegNo))
 						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 				}
+				/*
+				 * Update shared memory status to indicate flush
+				 * progress as we are releasing WALFlushLock.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				if (XLogCtl->LogwrtResult.Flush < LogwrtResult.Flush)
+					XLogCtl->LogwrtResult.Flush = LogwrtResult.Flush;
+				if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
+					XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
+				LogwrtResult.Flush = XLogCtl->LogwrtResult.Flush;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				LWLockRelease(WALFlushLock);
+
+				/*
+				 * Reacquire the WALWriteLock and get the Write progress.
+				 * We don't need spinlocks since, we update write progress
+				 * only while holding WALWriteLock.
+				 */
+				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+				LogwrtResult.Write = XLogCtl->LogwrtResult.Write;
+
+				/* Next cache block may have been writtent already. */
+				moveToNextBuf = false;
 			}
 		}
 
@@ -2362,6 +2443,40 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	Assert(npages == 0);
 
 	/*
+	 * Update shared memory status to indicate write
+	 * progress as we are releasing WALWriteLock. Also,
+	 * update local copy og LogwrtResult since we got a chance.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (XLogCtl->LogwrtResult.Write < LogwrtResult.Write)
+		XLogCtl->LogwrtResult.Write = LogwrtResult.Write;
+	if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
+		XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+	LWLockRelease(WALWriteLock);
+
+	/*
+	 * Now wake up all the procs waiting for their writes to be
+	 * completed.
+	 */
+	while (wake_pendingWriteWALElem)
+	{
+		proc_to_clear = (PGPROC *) (((char *) wake_pendingWriteWALElem) -
+									offsetof(PGPROC, pendingWriteWALLinks));
+
+		wake_pendingWriteWALElem = wake_pendingWriteWALElem->next;
+
+		/* Mark that Xid has cleared for this proc */
+		proc_to_clear->writeWAL = false;
+
+		pg_write_barrier();
+
+		if (proc_to_clear != MyProc)
+			PGSemaphoreUnlock(&proc_to_clear->sem);
+	}
+
+	/*
 	 * If asked to flush, do so
 	 */
 	if (LogwrtResult.Flush < WriteRqst.Flush &&
@@ -2376,40 +2491,71 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
-				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
-				XLogFileClose();
-			if (openLogFile < 0)
+			for(;;)
 			{
-				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
-				openLogFile = XLogFileOpen(openLogSegNo);
-				openLogOff = 0;
-			}
+				SpinLockAcquire(&XLogCtl->info_lck);
+				LogwrtResult = XLogCtl->LogwrtResult;
+				SpinLockRelease(&XLogCtl->info_lck);
 
-			issue_xlog_fsync(openLogFile, openLogSegNo);
-		}
+				/* done already? */
+				if (WriteRqst.Flush <= LogwrtResult.Flush ||
+					LogwrtResult.Write <= LogwrtResult.Flush)
+					break;
 
-		/* signal that we need to wakeup walsenders later */
-		WalSndWakeupRequest();
+				if (!LWLockAcquireOrWait(WALFlushLock, LW_EXCLUSIVE))
+				{
+					/*
+					 * The lock is now free, but we didn't acquire it yet. Before we
+					 * do, loop back to check if someone else flushed the record for
+					 * us already.
+					 */
+					continue;
+				}
+				if (enableFsync &&
+					MinimumActiveBackends(CommitSiblings))
+				{
+					//pg_usleep(1);
+				}
+				/* Got the lock; recheck whether request is satisfied */
+				LogwrtResult.Flush = XLogCtl->LogwrtResult.Flush;
 
-		LogwrtResult.Flush = LogwrtResult.Write;
-	}
+				if (WriteRqst.Flush <= LogwrtResult.Flush ||
+					LogwrtResult.Write <= LogwrtResult.Flush)
+				{
+					LWLockRelease(WALFlushLock);
+					break;
+				}
+				if (openLogFile >= 0 &&
+					!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
+				{
+					XLogFileClose();
+				}
+				if (openLogFile < 0)
+				{
+					XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
+					openLogFile = XLogFileOpen(openLogSegNo);
+					openLogOff = 0;
+				}
 
-	/*
-	 * Update shared-memory status
-	 *
-	 * We make sure that the shared 'request' values do not fall behind the
-	 * 'result' values.  This is not absolutely essential, but it saves some
-	 * code in a couple of places.
-	 */
-	{
-		SpinLockAcquire(&XLogCtl->info_lck);
-		XLogCtl->LogwrtResult = LogwrtResult;
-		if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
-			XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
-		if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
-			XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
-		SpinLockRelease(&XLogCtl->info_lck);
+				issue_xlog_fsync(openLogFile, openLogSegNo);
+
+				LogwrtResult.Flush = LogwrtResult.Write;
+				/*
+				 * Update shared memory status to indicate flush
+				 * progress as we are releasing WALFlushLock.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->LogwrtResult.Flush = LogwrtResult.Flush;
+				if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
+					XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				/* signal that we need to wakeup walsenders later */
+				WalSndWakeupRequest();
+				LWLockRelease(WALFlushLock);
+				break;
+			}
+		}
 	}
 }
 
@@ -2573,6 +2719,15 @@ XLogFlush(XLogRecPtr record)
 {
 	XLogRecPtr	WriteRqstPtr;
 	XLogwrtRqst WriteRqst;
+	XLogRecPtr	insertpos;
+	/* use volatile pointer to prevent code rearrangement */
+	volatile PROC_HDR *procglobal = ProcGlobal;
+	PGPROC_LIST	*nonempty_pending_write_wal_list = NULL;
+	PGPROC_LIST	*pendingWriteWALElem, *wake_pendingWriteWALElem;
+	PGPROC		*proc_to_clear;
+	int			extraWaits = 0;
+	XLogRecPtr	largestWALWritePos = 0;
+	PGPROC		*proc = MyProc;
 
 	/*
 	 * During REDO, we are reading not writing WAL.  Therefore, instead of
@@ -2616,90 +2771,180 @@ XLogFlush(XLogRecPtr record)
 	 * Now wait until we get the write lock, or someone else does the flush
 	 * for us.
 	 */
-	for (;;)
+
+	/* read LogwrtResult and update local state */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (WriteRqstPtr < XLogCtl->LogwrtRqst.Write)
+		WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* done already? */
+	if (record <= LogwrtResult.Flush)
+		goto wal_is_written;
+
+	/*
+	 * Before actually performing the write, wait for all in-flight
+	 * insertions to the pages we're about to write to finish.
+	 */
+	insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+
+	/*
+	 * Except one proc, all other proc's will wait for their write location
+	 * to be written. Each proc will advertise it's write location and add
+	 * itself to the pendingWriteWALList and the first backend that has seen
+	 * the list as empty, will proceed to write the WAL for all the procs
+	 * in pendingWriteWALList (lets call it Write leader); if it is not the
+	 * first one, then just wait on semaphorelock till the wal is written for
+	 * the proc.  The Write leader will acquire WALWriteLock and traverse the
+	 * pendingWriteWALList to find the largest write location upto which it
+	 * needs to write the wal and use the same to write the WAL.  After write,
+	 * it will release the WALWriteLock and wakes all the procs for which it
+	 * has written the WAL.
+	 */
+	proc->writeWAL = true;
+
+	/*
+	 * Add the proc to pending pendingWriteWALList list and advertise
+	 * the insertpos, upto which the WAL needs to be written for this
+	 * proc.
+	 */
+	proc->writePos = insertpos;
+	while (true)
 	{
-		XLogRecPtr	insertpos;
+		proc->pendingWriteWALLinks.next = procglobal->pendingWriteWALList;
+		pg_read_barrier();
+		nonempty_pending_write_wal_list = proc->pendingWriteWALLinks.next;
+		if (pg_atomic_compare_exchange_u64((volatile pg_atomic_uint64*) &procglobal->pendingWriteWALList,
+											(uint64*)&proc->pendingWriteWALLinks.next,
+											(uint64)&proc->pendingWriteWALLinks))
+			break;
+	}
 
-		/* read LogwrtResult and update local state */
-		SpinLockAcquire(&XLogCtl->info_lck);
-		if (WriteRqstPtr < XLogCtl->LogwrtRqst.Write)
-			WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
-		LogwrtResult = XLogCtl->LogwrtResult;
-		SpinLockRelease(&XLogCtl->info_lck);
+	/*
+	 * only first process which has seen the pending write wal list as
+	 * empty will group write wal for proc's on pending list, all other
+	 * processes will wait for their wal to be written. Once, there wal
+	 * is written, they will try to acquire WALFlushLock to flush their
+	 * data. Here also, flush calls will be accumulated.
+	 */
+	if (nonempty_pending_write_wal_list)
+	{
+		for (;;)
+		{
+			PGSemaphoreLock(&proc->sem);
+			if (!proc->writeWAL)
+			{
+				/* logic similar to lwlock.c is used for any absorbed wakeups. */
+				while (extraWaits-- > 0)
+					PGSemaphoreUnlock(&proc->sem);
+				/*
+				 * Before exiting from here, read LogwrtResult and update
+				 * local state.  When the backend didn't write it's WAL, the
+				 * local state is not updated which we need to update to ensure
+				 * that WAL has been actually flushed, see the check in end of
+				 * this function.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				LogwrtResult = XLogCtl->LogwrtResult;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				if (record <= LogwrtResult.Flush)
+					goto wal_is_written;
+
+				while (true)
+				{
+					if (!LWLockAcquireOrWait(WALFlushLock, LW_EXCLUSIVE))
+					{
+						SpinLockAcquire(&XLogCtl->info_lck);
+						LogwrtResult = XLogCtl->LogwrtResult;
+						SpinLockRelease(&XLogCtl->info_lck);
+						if (record <= LogwrtResult.Flush)
+							goto wal_is_written;
+						continue;
+					}
+					if (record <= LogwrtResult.Flush)
+					{
+						LWLockRelease(WALFlushLock);
+						goto wal_is_written;
+					}
+					else
+					{
+						WriteRqst.Flush = record;
+						XLogFsync(WriteRqst, false);
+						goto wal_is_written;
+					}
+				}
+			}
+			extraWaits++;
+		}
+	}
 
-		/* done already? */
-		if (record <= LogwrtResult.Flush)
+	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+
+	while (true)
+	{
+		pendingWriteWALElem = procglobal->pendingWriteWALList;
+		if (pg_atomic_compare_exchange_u64((volatile pg_atomic_uint64*) &procglobal->pendingWriteWALList,
+											(uint64*)&pendingWriteWALElem,
+											(uint64)NULL))
 			break;
+	}
+
+	/* save the list of procs whose wal needs to be written to wake them up. */
+	wake_pendingWriteWALElem = pendingWriteWALElem;
+
+	while (pendingWriteWALElem)
+	{
+		proc_to_clear = (PGPROC *) (((char *) pendingWriteWALElem) -
+									offsetof(PGPROC, pendingWriteWALLinks));
 
 		/*
-		 * Before actually performing the write, wait for all in-flight
-		 * insertions to the pages we're about to write to finish.
+		 * remember the largest wal location upto which we need to
+		 * write.
 		 */
-		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+		if (largestWALWritePos < proc_to_clear->writePos)
+			largestWALWritePos = proc_to_clear->writePos;
 
 		/*
-		 * Try to get the write lock. If we can't get it immediately, wait
-		 * until it's released, and recheck if we still need to do the flush
-		 * or if the backend that held the lock did it for us already. This
-		 * helps to maintain a good rate of group committing when the system
-		 * is bottlenecked by the speed of fsyncing.
+		 * move to next proc in list.
 		 */
-		if (!LWLockAcquireOrWait(WALWriteLock, LW_EXCLUSIVE))
-		{
-			/*
-			 * The lock is now free, but we didn't acquire it yet. Before we
-			 * do, loop back to check if someone else flushed the record for
-			 * us already.
-			 */
-			continue;
-		}
+		pendingWriteWALElem = pendingWriteWALElem->next;
+	}
 
-		/* Got the lock; recheck whether request is satisfied */
-		LogwrtResult = XLogCtl->LogwrtResult;
-		if (record <= LogwrtResult.Flush)
-		{
-			LWLockRelease(WALWriteLock);
-			break;
-		}
+	/* recheck whether request is satisfied, else write the wal. */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
 
-		/*
-		 * Sleep before flush! By adding a delay here, we may give further
-		 * backends the opportunity to join the backlog of group commit
-		 * followers; this can significantly improve transaction throughput,
-		 * at the risk of increasing transaction latency.
-		 *
-		 * We do not sleep if enableFsync is not turned on, nor if there are
-		 * fewer than CommitSiblings other backends with active transactions.
-		 */
-		if (CommitDelay > 0 && enableFsync &&
-			MinimumActiveBackends(CommitSiblings))
+	if (largestWALWritePos <= LogwrtResult.Flush)
+	{
+		LWLockRelease(WALWriteLock);
+		while (wake_pendingWriteWALElem)
 		{
-			pg_usleep(CommitDelay);
+			proc_to_clear = (PGPROC *) (((char *) wake_pendingWriteWALElem) -
+										offsetof(PGPROC, pendingWriteWALLinks));
 
-			/*
-			 * Re-check how far we can now flush the WAL. It's generally not
-			 * safe to call WaitXLogInsertionsToFinish while holding
-			 * WALWriteLock, because an in-progress insertion might need to
-			 * also grab WALWriteLock to make progress. But we know that all
-			 * the insertions up to insertpos have already finished, because
-			 * that's what the earlier WaitXLogInsertionsToFinish() returned.
-			 * We're only calling it again to allow insertpos to be moved
-			 * further forward, not to actually wait for anyone.
-			 */
-			insertpos = WaitXLogInsertionsToFinish(insertpos);
-		}
+			wake_pendingWriteWALElem = wake_pendingWriteWALElem->next;
 
-		/* try to write/flush later additions to XLOG as well */
-		WriteRqst.Write = insertpos;
-		WriteRqst.Flush = insertpos;
+			/* Mark that Xid has cleared for this proc */
+			proc_to_clear->writeWAL = false;
 
-		XLogWrite(WriteRqst, false);
+			pg_write_barrier();
 
-		LWLockRelease(WALWriteLock);
-		/* done */
-		break;
+			if (proc_to_clear != MyProc)
+				PGSemaphoreUnlock(&proc_to_clear->sem);
+		}
+	}
+	else
+	{
+		WriteRqst.Write = largestWALWritePos;
+		WriteRqst.Flush = largestWALWritePos;
+
+		XLogWrite(WriteRqst, false, wake_pendingWriteWALElem);
 	}
 
+wal_is_written:
 	END_CRIT_SECTION();
 
 	/* wake up walsenders now that we've released heavily contended locks */
@@ -2855,13 +3100,16 @@ XLogBackgroundFlush(void)
 	/* now wait for any in-progress insertions to finish and get write lock */
 	WaitXLogInsertionsToFinish(WriteRqst.Write);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
 	if (WriteRqst.Write > LogwrtResult.Write ||
 		WriteRqst.Flush > LogwrtResult.Flush)
 	{
-		XLogWrite(WriteRqst, flexible);
+		XLogWrite(WriteRqst, flexible, NULL);
 	}
-	LWLockRelease(WALWriteLock);
+	else
+		LWLockRelease(WALWriteLock);
 
 	END_CRIT_SECTION();
 
@@ -11719,3 +11967,78 @@ XLogRequestWalReceiverReply(void)
 {
 	doRequestWalReceiverReply = true;
 }
+
+/*
+ * Similar to XLogWrite but will just fsync the wal pages
+ * in kernel buffers.
+ */
+static void
+XLogFsync(XLogwrtRqst WriteRqst, bool flexible)
+{
+	SpinLockAcquire(&XLogCtl->info_lck);
+	if (XLogCtl->LogwrtResult.Write < LogwrtResult.Write)
+		XLogCtl->LogwrtResult.Write = LogwrtResult.Write;
+	if (XLogCtl->LogwrtRqst.Write < LogwrtResult.Write)
+		XLogCtl->LogwrtRqst.Write = LogwrtResult.Write;
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	if (LogwrtResult.Flush < WriteRqst.Flush)
+	{
+		/*
+		 * Could get here without iterating above loop, in which case we might
+		 * have no open file or the wrong one.  However, we do not need to
+		 * fsync more than one file.
+		 */
+		if (sync_method != SYNC_METHOD_OPEN &&
+			sync_method != SYNC_METHOD_OPEN_DSYNC)
+		{
+			for(;;)
+			{
+				SpinLockAcquire(&XLogCtl->info_lck);
+				LogwrtResult = XLogCtl->LogwrtResult;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				/* done already? */
+				if (WriteRqst.Flush <= LogwrtResult.Flush)
+				{
+					LWLockRelease(WALFlushLock);
+					break;
+				}
+
+				if (openLogFile >= 0 &&
+					!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo))
+				{
+					XLogFileClose();
+				}
+
+				if (openLogFile < 0)
+				{
+					XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo);
+					openLogFile = XLogFileOpen(openLogSegNo);
+					openLogOff = 0;
+				}
+
+				issue_xlog_fsync(openLogFile, openLogSegNo);
+
+				LogwrtResult.Flush = WriteRqst.Flush;
+				/*
+				 * Update shared memory status to indicate flush
+				 * progress as we are releasing WALFlushLock.
+				 */
+				SpinLockAcquire(&XLogCtl->info_lck);
+				XLogCtl->LogwrtResult.Flush = LogwrtResult.Flush;
+				if (XLogCtl->LogwrtRqst.Flush < LogwrtResult.Flush)
+					XLogCtl->LogwrtRqst.Flush = LogwrtResult.Flush;
+				SpinLockRelease(&XLogCtl->info_lck);
+
+				/* signal that we need to wakeup walsenders later */
+				WalSndWakeupRequest();
+				LWLockRelease(WALFlushLock);
+				break;
+			}
+		}
+	}
+	else
+		LWLockRelease(WALFlushLock);
+}
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index f8996cd..82fec8a 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -47,3 +47,4 @@ CommitTsLock						39
 ReplicationOriginLock				40
 MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
+WALFlushLock						43
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 33e7023..0203b90 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -266,6 +266,8 @@ InitProcGlobal(void)
 		dlist_init(&procs[i].lockGroupMembers);
 	}
 
+	ProcGlobal->pendingWriteWALList = NULL;
+
 	/*
 	 * Save pointers to the blocks of PGPROC structures reserved for auxiliary
 	 * processes and prepared transactions.
@@ -405,6 +407,11 @@ InitProcess(void)
 	/* Initialize wait event information. */
 	MyProc->wait_event_info = 0;
 
+	/* Initialize fields for writting WAL */
+	MyProc->writeWAL = false;
+	MyProc->writePos = 0;
+	MyProc->pendingWriteWALLinks.next = NULL;
+
 	/*
 	 * Acquire ownership of the PGPROC's latch, so that we can use WaitLatch
 	 * on it.  That allows us to repoint the process latch, which so far
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index f576f05..d69cc33 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -66,6 +66,17 @@ struct XidCache
 #define INVALID_PGPROCNO		PG_INT32_MAX
 
 /*
+ * List to link PGPROC structures for the cases when they
+ * can't be linked via SHM_QUEUE like when we want to operate
+ * the list using atomic operations.  Currently this is used
+ * to link PGPROC's for clearing their XID information.
+ */
+typedef struct PGPROC_LIST
+{
+	struct PGPROC_LIST *next;
+} PGPROC_LIST;
+
+/*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
  *
@@ -141,6 +152,9 @@ struct PGPROC
 	SHM_QUEUE	myProcLocks[NUM_LOCK_PARTITIONS];
 
 	struct XidCache subxids;	/* cache for subtransaction XIDs */
+	bool            writeWAL;
+	XLogRecPtr      writePos;
+	PGPROC_LIST     pendingWriteWALLinks;
 
 	/* Support for group XID clearing. */
 	/* true, if member of ProcArray group waiting for XID clear */
@@ -225,6 +239,8 @@ typedef struct PROC_HDR
 	PGPROC	   *autovacFreeProcs;
 	/* Head of list of bgworker free PGPROC structures */
 	PGPROC	   *bgworkerFreeProcs;
+	/* list of PGPROC structures that need to have their WAL written */
+	PGPROC_LIST *pendingWriteWALList;
 	/* First pgproc waiting for group XID clear */
 	pg_atomic_uint32 procArrayGroupFirst;
 	/* WALWriter process's latch */
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Kuntal Ghosh (#1)
Re: An attempt to reduce WALWriteLock contention

On 12/22/2016 04:00 PM, Kuntal Ghosh wrote:

Hello all,

...

\t
select wait_event_type, wait_event from pg_stat_activity where pid !=
pg_backend_pid();
\watch 0.5
HEAD
------------------------
48642 LWLockNamed | WALWriteLock

With Patch
----------------------------------
31889 LWLockNamed | WALFlushLock
25212 LWLockNamed | WALWriteLock

How do these counts compare to the other wait events? For example
CLogControlLock, which is what Amit's patch [1]/messages/by-id/84c22fbb-b9c4-a02f-384b-b4feb2c67193@2ndquadrant.com is about?

[1]: /messages/by-id/84c22fbb-b9c4-a02f-384b-b4feb2c67193@2ndquadrant.com
/messages/by-id/84c22fbb-b9c4-a02f-384b-b4feb2c67193@2ndquadrant.com

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3jasrajd
jasrajd@microsoft.com
In reply to: Kuntal Ghosh (#1)
Re: An attempt to reduce WALWriteLock contention

We are also seeing contention on the walwritelock and repeated writes to the
same offset if we move the flush outside the lock in the Azure environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

--
View this message in context: http://www.postgresql-archive.org/An-attempt-to-reduce-WALWriteLock-contention-tp5935907p5967786.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: jasrajd (#3)
Re: An attempt to reduce WALWriteLock contention

On Wed, Jun 21, 2017 at 4:57 PM, jasrajd <jasrajd@microsoft.com> wrote:

We are also seeing contention on the walwritelock and repeated writes to the
same offset if we move the flush outside the lock in the Azure environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

As of now, there is no patch in the development queue for Postgres 11
that is dedicated to this particularly lock contention. There is a
patch for LWlocks in general with power PC, but that's all:
https://commitfest.postgresql.org/14/984/

Not sure if Kuntal has plans to submit again this patch. It is
actually a bit sad to not see things moving on and use an approach to
group flushes.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@anarazel.de
In reply to: jasrajd (#3)
Re: An attempt to reduce WALWriteLock contention

On 2017-06-21 00:57:32 -0700, jasrajd wrote:

We are also seeing contention on the walwritelock and repeated writes to the
same offset if we move the flush outside the lock in the Azure environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

I kind of doubt that scalability limit is directly related to this patch
- I've seen postgres scale furhter without that lock becoming the prime
issue. What exactly are you measuring / observing?

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Tomas Vondra (#2)
Re: An attempt to reduce WALWriteLock contention

On Thu, Dec 22, 2016 at 11:29 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:

How do these counts compare to the other wait events? For example
CLogControlLock, which is what Amit's patch [1] is about?

[1]
/messages/by-id/84c22fbb-b9c4-a02f-384b-b4feb2c67193@2ndquadrant.com

Hello Tomas,

I'm really sorry for this late reply. I've somehow missed the thread.
Actually, I've seen some performance improvement with the
CLogControlLock patch. But, then it turned out all the improvements
were because of the CLogControlLock patch alone.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Michael Paquier (#4)
Re: An attempt to reduce WALWriteLock contention

On Thu, Jun 22, 2017 at 6:46 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Wed, Jun 21, 2017 at 4:57 PM, jasrajd <jasrajd@microsoft.com> wrote:

We are also seeing contention on the walwritelock and repeated writes to the
same offset if we move the flush outside the lock in the Azure environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

As of now, there is no patch in the development queue for Postgres 11
that is dedicated to this particularly lock contention. There is a
patch for LWlocks in general with power PC, but that's all:
https://commitfest.postgresql.org/14/984/

Not sure if Kuntal has plans to submit again this patch. It is
actually a bit sad to not see things moving on and use an approach to
group flushes.

As of now, I've no plans to re-submit the patch. Actually, I'm not
sure what I should try next. I would love to get some advice/direction
regarding this.

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Sokolov Yura
y.sokolov@postgrespro.ru
In reply to: Michael Paquier (#4)
Re: An attempt to reduce WALWriteLock contention

On 2017-06-22 04:16, Michael Paquier wrote:

On Wed, Jun 21, 2017 at 4:57 PM, jasrajd <jasrajd@microsoft.com> wrote:

We are also seeing contention on the walwritelock and repeated writes
to the
same offset if we move the flush outside the lock in the Azure
environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

As of now, there is no patch in the development queue for Postgres 11
that is dedicated to this particularly lock contention. There is a
patch for LWlocks in general with power PC, but that's all:
https://commitfest.postgresql.org/14/984/

Not sure if Kuntal has plans to submit again this patch. It is
actually a bit sad to not see things moving on and use an approach to
group flushes.
--
Michael

There is also patch against LWLock degradation on NUMA :
https://commitfest.postgresql.org/14/1166/

But they are both about LWLock itself, and not its usage.

--
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#5)
Re: An attempt to reduce WALWriteLock contention

On Thu, Jun 22, 2017 at 6:54 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-06-21 00:57:32 -0700, jasrajd wrote:

We are also seeing contention on the walwritelock and repeated writes to the
same offset if we move the flush outside the lock in the Azure environment.
pgbench doesn't scale beyond ~8 cores without saturating the IOPs or
bandwidth. Is there more work being done in this area?

That should not happen if the writes from various backends are
combined in some way. However, it is not very clear what exactly you
have done as part of taking flush calls out of walwritelock. Can you
share patch or some details about how you have done it and how have
you measured the contention you are seeing?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers