WalSndWakeup() and synchronous_commit=off

Started by Andres Freundover 13 years ago23 messages

andres@2ndquadrant.com

over 13 years ago

Hi all,

I noticed that when synchronous_commit=off were not waking up the wal sender
latch in xact.c:RecordTransactionCommit which leads to ugly delays of approx 7
seconds (1 + replication_timeout/10) with default settings.
Given that were flushing the wal to disk much sooner this appears to be a bad
idea - especially as this may happen even under load if we ever reach the
'coughtup' state.

I wonder why the WalSndWakeup isn't done like:

diff --git a/src/backend/access/transam/xlog.c 
b/src/backend/access/transam/xlog.c
index ecb71b6..7a3224b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1906,6 +1906,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool 
xlog_switch)
            xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
        SpinLockRelease(&xlogctl->info_lck);
    }
+
+   /* the walsender wasn't woken up in xact.c */
+   if(max_wal_senders > 1 && synchronous_commit == SYNCHRONOUS_COMMIT_OFF)
+       WalSndWakeup();
 }

Doing that for the synchronous_commit=off case can imo be considered a bugfix,
but I wonder why we ever wake the senders somewhere else?
The only argument I can see for doing it at places like StartTransactionCommit
is that thats the place after which the data will be visible on the client. I
think thats a non-argument though because if wal is flushed to disk outside of
a commit there normally is enough data to make it worthwile.

Doing the above results in a very noticeable reduction in lagginess and even a
noticeable reduction in cpu-usage spikes on a busy replication test setup.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Andres Freund (#1)

Re: WalSndWakeup() and synchronous_commit=off

On 10 May 2012 20:51, Andres Freund <andres@2ndquadrant.com> wrote:

I noticed that when synchronous_commit=off were not waking up the wal sender
latch in xact.c:RecordTransactionCommit which leads to ugly delays of approx 7
seconds (1 + replication_timeout/10) with default settings.
Given that were flushing the wal to disk much sooner this appears to be a bad
idea - especially as this may happen even under load if we ever reach the
'coughtup' state.

Sounds like a problem. I'll have a look.

I wonder why the WalSndWakeup isn't done like:
diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index ecb71b6..7a3224b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1906,6 +1906,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool
xlog_switch)
           xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
       SpinLockRelease(&xlogctl->info_lck);
   }
+
+   /* the walsender wasn't woken up in xact.c */
+   if(max_wal_senders > 1 && synchronous_commit == SYNCHRONOUS_COMMIT_OFF)
+       WalSndWakeup();
 }
Doing that for the synchronous_commit=off case can imo be considered a bugfix,
but I wonder why we ever wake the senders somewhere else?
The only argument I can see for doing it at places like StartTransactionCommit
is that thats the place after which the data will be visible on the client. I
think thats a non-argument though because if wal is flushed to disk outside of
a commit there normally is enough data to make it worthwile.

Doing the above results in a very noticeable reduction in lagginess and even a
noticeable reduction in cpu-usage spikes on a busy replication test setup.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Andres Freund (#1)

Re: WalSndWakeup() and synchronous_commit=off

On Fri, May 11, 2012 at 4:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index ecb71b6..7a3224b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1906,6 +1906,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool
xlog_switch)
           xlogctl->LogwrtRqst.Flush = LogwrtResult.Flush;
       SpinLockRelease(&xlogctl->info_lck);
   }
+
+   /* the walsender wasn't woken up in xact.c */
+   if(max_wal_senders > 1 && synchronous_commit == SYNCHRONOUS_COMMIT_OFF)
+       WalSndWakeup();
 }

Calling WalSndWakeup() while WALWriteLock is being held might cause another
performance degradation. No?

Regards,

--
Fujii Masao

Robert Haas

robertmhaas@gmail.com

over 13 years ago

In reply to: Fujii Masao (#3)

Re: WalSndWakeup() and synchronous_commit=off

On Fri, May 11, 2012 at 1:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Calling WalSndWakeup() while WALWriteLock is being held might cause another
performance degradation. No?

That definitely doesn't seem ideal - a lot of things can pile up
behind WALWriteLock. I'm not sure how big a problem it would be in
practice, but we generally make a practice of avoiding sending signals
while holding LWLocks whenever possible...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Robert Haas (#4)

Re: WalSndWakeup() and synchronous_commit=off

On Friday, May 11, 2012 07:20:26 PM Robert Haas wrote:

On Fri, May 11, 2012 at 1:09 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Calling WalSndWakeup() while WALWriteLock is being held might cause
another performance degradation. No?

That definitely doesn't seem ideal - a lot of things can pile up
behind WALWriteLock. I'm not sure how big a problem it would be in
practice, but we generally make a practice of avoiding sending signals
while holding LWLocks whenever possible...

In my measurements on moderately powerful hardware I couldn't see any
degradation on the primary - in fact the contrary, but the improvements were
around 0.4% and I only tested 10min so its not exactly hard evidence. But I
aggree its not ideal.
Its the only place though which knows whether its actually sensible to wakeup
the walsender. We could make it return whether it wrote anything and do the
wakeup at the callers. I count 4 different callsites which would be an
annoying duplication but I don't really see anything better right now.

Better Ideas?

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Robert Haas (#4)

Re: WalSndWakeup() and synchronous_commit=off

Robert Haas <robertmhaas@gmail.com> writes:

That definitely doesn't seem ideal - a lot of things can pile up
behind WALWriteLock. I'm not sure how big a problem it would be in
practice, but we generally make a practice of avoiding sending signals
while holding LWLocks whenever possible...

There's a good reason for that, which is that the scheduler might well
decide to go run the wakened process instead of you. Admittedly this
tends to not be a problem on machines with $bignum CPUs, but on
single-CPU machines I've seen it happen a lot.

Refactoring so that the signal is sent only after lock release seems
like a good idea to me.

regards, tom lane

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#6)

Re: WalSndWakeup() and synchronous_commit=off

On Friday, May 11, 2012 08:36:24 PM Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

That definitely doesn't seem ideal - a lot of things can pile up
behind WALWriteLock. I'm not sure how big a problem it would be in
practice, but we generally make a practice of avoiding sending signals
while holding LWLocks whenever possible...

There's a good reason for that, which is that the scheduler might well
decide to go run the wakened process instead of you. Admittedly this
tends to not be a problem on machines with $bignum CPUs, but on
single-CPU machines I've seen it happen a lot.

Refactoring so that the signal is sent only after lock release seems
like a good idea to me.

Will send a patch lateron, duplication seems to be manageable.

Andres

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Andres Freund (#5)

Re: WalSndWakeup() and synchronous_commit=off

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible to wakeup
the walsender. We could make it return whether it wrote anything and do the
wakeup at the callers. I count 4 different callsites which would be an
annoying duplication but I don't really see anything better right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to WalSndWakeup
are also inside critical sections, generally for no good reason that I
can see. For example, in EndPrepare(), why was the call placed where
it is and not down beside SyncRepWaitForLSN?

regards, tom lane

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Tom Lane (#8)

Re: WalSndWakeup() and synchronous_commit=off

On 11 May 2012 19:45, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible to wakeup
the walsender. We could make it return whether it wrote anything and do the
wakeup at the callers. I count 4 different callsites which would be an
annoying duplication but I don't really see anything better right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to WalSndWakeup
are also inside critical sections, generally for no good reason that I
can see. For example, in EndPrepare(), why was the call placed where
it is and not down beside SyncRepWaitForLSN?

I think because nobody thought of that.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#8)

Re: WalSndWakeup() and synchronous_commit=off

On Friday, May 11, 2012 08:45:23 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible to
wakeup the walsender. We could make it return whether it wrote anything
and do the wakeup at the callers. I count 4 different callsites which
would be an annoying duplication but I don't really see anything better
right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to WalSndWakeup
are also inside critical sections, generally for no good reason that I
can see. For example, in EndPrepare(), why was the call placed where
it is and not down beside SyncRepWaitForLSN?

Hm. While I see no real problem moving it out of the lock I don't really see a
way to cleanly outside critical sections everywhere. The impact of doing so
seems to be rather big to me. The only externally visible place where it
actually is known whether we write out data and thus do the wakeup is
XLogInsert, XLogFlush and XLogBackgroundFlush. The first two of those are
routinely called inside a critical section.

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#11

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Andres Freund (#10)

1 attachment(s)

Re: WalSndWakeup() and synchronous_commit=off

On Mon, May 14, 2012 at 6:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On Friday, May 11, 2012 08:45:23 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible to
wakeup the walsender. We could make it return whether it wrote anything
and do the wakeup at the callers. I count 4 different callsites which
would be an annoying duplication but I don't really see anything better
right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to WalSndWakeup
are also inside critical sections, generally for no good reason that I
can see. For example, in EndPrepare(), why was the call placed where
it is and not down beside SyncRepWaitForLSN?

Hm. While I see no real problem moving it out of the lock I don't really see a
way to cleanly outside critical sections everywhere. The impact of doing so
seems to be rather big to me. The only externally visible place where it
actually is known whether we write out data and thus do the wakeup is
XLogInsert, XLogFlush and XLogBackgroundFlush. The first two of those are
routinely called inside a critical section.

So what about moving the existing calls of WalSndWakeup() out of a critical
section and adding new call of WalSndWakeup() into XLogBackgroundFlush()?
Then all WalSndWakeup()s are called outside a critical section and after
releasing WALWriteLock. I attached the patch.

Regards,

--
Fujii Masao

Attachments:

move_walsndwakeup_v1.patchapplication/octet-stream; name=move_walsndwakeup_v1.patchDownload

*** a/src/backend/access/transam/twophase.c
--- b/src/backend/access/transam/twophase.c
***************
*** 1041,1053 **** EndPrepare(GlobalTransaction gxact)
  
  	/* If we crash now, we have prepared: WAL replay will fix things */
  
- 	/*
- 	 * Wake up all walsenders to send WAL up to the PREPARE record immediately
- 	 * if replication is enabled
- 	 */
- 	if (max_wal_senders > 0)
- 		WalSndWakeup();
- 
  	/* write correct CRC and close file */
  	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
  	{
--- 1041,1046 ----
***************
*** 1086,1091 **** EndPrepare(GlobalTransaction gxact)
--- 1079,1091 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * Wake up all walsenders to send WAL up to the PREPARE record immediately
+ 	 * outside a critical section if replication is enabled
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WalSndWakeup();
+ 
+ 	/*
  	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked the prepare, but still show as
***************
*** 2048,2060 **** RecordTransactionCommitPrepared(TransactionId xid,
  	/* Flush XLOG to disk */
  	XLogFlush(recptr);
  
- 	/*
- 	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
- 	 * immediately if replication is enabled
- 	 */
- 	if (max_wal_senders > 0)
- 		WalSndWakeup();
- 
  	/* Mark the transaction committed in pg_clog */
  	TransactionIdCommitTree(xid, nchildren, children);
  
--- 2048,2053 ----
***************
*** 2064,2069 **** RecordTransactionCommitPrepared(TransactionId xid,
--- 2057,2069 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
+ 	 * immediately outside a critical section if replication is enabled
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WalSndWakeup();
+ 
+ 	/*
  	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
***************
*** 2136,2148 **** RecordTransactionAbortPrepared(TransactionId xid,
  	XLogFlush(recptr);
  
  	/*
- 	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
- 	 * immediately if replication is enabled
- 	 */
- 	if (max_wal_senders > 0)
- 		WalSndWakeup();
- 
- 	/*
  	 * Mark the transaction aborted in clog.  This is not absolutely necessary
  	 * but we may as well do it while we are here.
  	 */
--- 2136,2141 ----
***************
*** 2151,2156 **** RecordTransactionAbortPrepared(TransactionId xid,
--- 2144,2156 ----
  	END_CRIT_SECTION();
  
  	/*
+ 	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
+ 	 * immediately outside a critical section if replication is enabled
+ 	 */
+ 	if (max_wal_senders > 0)
+ 		WalSndWakeup();
+ 
+ 	/*
  	 * Wait for synchronous replication, if required.
  	 *
  	 * Note that at this stage we have marked clog, but still show as running
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 942,947 **** RecordTransactionCommit(void)
--- 942,948 ----
  	SharedInvalidationMessage *invalMessages = NULL;
  	bool		RelcacheInitFileInval = false;
  	bool		wrote_xlog;
+ 	bool		wakeup = false;
  
  	/* Get data needed for commit record */
  	nrels = smgrGetPendingDeletes(true, &rels);
***************
*** 1135,1147 **** RecordTransactionCommit(void)
  			pg_usleep(CommitDelay);
  
  		XLogFlush(XactLastRecEnd);
! 
! 		/*
! 		 * Wake up all walsenders to send WAL up to the COMMIT record
! 		 * immediately if replication is enabled
! 		 */
! 		if (max_wal_senders > 0)
! 			WalSndWakeup();
  
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
--- 1136,1142 ----
  			pg_usleep(CommitDelay);
  
  		XLogFlush(XactLastRecEnd);
! 		wakeup = true;
  
  		/*
  		 * Now we may update the CLOG, if we wrote a COMMIT record above
***************
*** 1183,1188 **** RecordTransactionCommit(void)
--- 1178,1190 ----
  		END_CRIT_SECTION();
  	}
  
+ 	/*
+ 	 * Wake up all walsenders to send WAL up to the COMMIT record
+ 	 * immediately outside a critical section if replication is enabled
+ 	 */
+ 	if (max_wal_senders > 0 && wakeup)
+ 		WalSndWakeup();
+ 
  	/* Compute latestXid while we have the child XIDs handy */
  	latestXid = TransactionIdLatest(xid, nchildren, children);
  
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 2271,2276 **** XLogBackgroundFlush(void)
--- 2271,2283 ----
  
  	END_CRIT_SECTION();
  
+ 	/*
+ 	 * Wake up all walsenders to send WAL up to the async commit record
+ 	 * immediately outside a critical section is replication is enabled
+ 	 */
+ 	if (max_wal_senders > 0 && !flexible && wrote_something)
+ 		WalSndWakeup();
+ 
  	return wrote_something;
  }

#12

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Fujii Masao (#11)

Re: WalSndWakeup() and synchronous_commit=off

On Monday, May 14, 2012 07:55:32 PM Fujii Masao wrote:

On Mon, May 14, 2012 at 6:32 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Friday, May 11, 2012 08:45:23 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible to
wakeup the walsender. We could make it return whether it wrote
anything and do the wakeup at the callers. I count 4 different
callsites which would be an annoying duplication but I don't really
see anything better right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to WalSndWakeup
are also inside critical sections, generally for no good reason that I
can see. For example, in EndPrepare(), why was the call placed where
it is and not down beside SyncRepWaitForLSN?

Hm. While I see no real problem moving it out of the lock I don't really
see a way to cleanly outside critical sections everywhere. The impact of
doing so seems to be rather big to me. The only externally visible place
where it actually is known whether we write out data and thus do the
wakeup is XLogInsert, XLogFlush and XLogBackgroundFlush. The first two
of those are routinely called inside a critical section.

So what about moving the existing calls of WalSndWakeup() out of a critical
section and adding new call of WalSndWakeup() into XLogBackgroundFlush()?
Then all WalSndWakeup()s are called outside a critical section and after
releasing WALWriteLock. I attached the patch.

Imo its simply not sensible to call WalSndWakeup at *any* of the current
locations. They simply don't have the necessary information. They will wakeup
too often (because with concurrency commits often won't require additional wal
writes) and too seldom (because a flush caused by XLogInsert wont cause a
wakeup).

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Andres Freund (#12)

Re: WalSndWakeup() and synchronous_commit=off

On Tuesday, May 15, 2012 05:30:27 PM Andres Freund wrote:

On Monday, May 14, 2012 07:55:32 PM Fujii Masao wrote:

On Mon, May 14, 2012 at 6:32 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Friday, May 11, 2012 08:45:23 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Its the only place though which knows whether its actually sensible
to wakeup the walsender. We could make it return whether it wrote
anything and do the wakeup at the callers. I count 4 different
callsites which would be an annoying duplication but I don't really
see anything better right now.

Another point here is that XLogWrite is not only normally called with
the lock held, but inside a critical section. I see no reason to take
the risk of doing signal sending inside critical sections.

BTW, a depressingly large fraction of the existing calls to
WalSndWakeup are also inside critical sections, generally for no good
reason that I can see. For example, in EndPrepare(), why was the
call placed where it is and not down beside SyncRepWaitForLSN?

Hm. While I see no real problem moving it out of the lock I don't
really see a way to cleanly outside critical sections everywhere. The
impact of doing so seems to be rather big to me. The only externally
visible place where it actually is known whether we write out data and
thus do the wakeup is XLogInsert, XLogFlush and XLogBackgroundFlush.
The first two of those are routinely called inside a critical section.

So what about moving the existing calls of WalSndWakeup() out of a
critical section and adding new call of WalSndWakeup() into
XLogBackgroundFlush()? Then all WalSndWakeup()s are called outside a
critical section and after releasing WALWriteLock. I attached the patch.

Imo its simply not sensible to call WalSndWakeup at *any* of the current
locations. They simply don't have the necessary information. They will
wakeup too often (because with concurrency commits often won't require
additional wal writes) and too seldom (because a flush caused by
XLogInsert wont cause a wakeup).

Does anybody have a better idea than to either call WalSndWakeup() at
essentially the wrong places or calling it inside a critical section?

Tom, what danger do you see from calling it in a critical section?

Greetings,

Andres

#14

Tom Lane

tgl@sss.pgh.pa.us

over 13 years ago

In reply to: Andres Freund (#13)

Re: WalSndWakeup() and synchronous_commit=off

Andres Freund <andres@2ndquadrant.com> writes:

Does anybody have a better idea than to either call WalSndWakeup() at
essentially the wrong places or calling it inside a critical section?

Tom, what danger do you see from calling it in a critical section?

My concern was basically that it might throw an error. Looking at the
current implementation of SetLatch, it seems that's not possible, but
I wonder whether we want to lock ourselves into that assumption.

Still, if the alternatives are worse, maybe that's the best answer.
If we do that, though, let's add comments to WalSndWakeup and SetLatch
mentioning that they mustn't throw error.

regards, tom lane

#15

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Tom Lane (#14)

1 attachment(s)

Re: WalSndWakeup() and synchronous_commit=off

Hi,

On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Does anybody have a better idea than to either call WalSndWakeup() at
essentially the wrong places or calling it inside a critical section?

Tom, what danger do you see from calling it in a critical section?

My concern was basically that it might throw an error. Looking at the
current implementation of SetLatch, it seems that's not possible, but
I wonder whether we want to lock ourselves into that assumption.

The assumption is already made at several other places I think.
XLogSetAsyncXactLSN does a SetLatch and is called from critical sections;
several signal handlers call it without any attention to the context.

Requiring it to be called outside would make its usage considerably less
convenient and I don't really see what could change that would require to
throw non-panic errors.

Still, if the alternatives are worse, maybe that's the best answer.
If we do that, though, let's add comments to WalSndWakeup and SetLatch
mentioning that they mustn't throw error.

Patch attached.

Greetings,

Andres

PS: Sorry for dropping the CC list before...

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-walsender-wakeup-handling.patchtext/x-patch; charset=UTF-8; name=0001-Fix-walsender-wakeup-handling.patchDownload

From c4d6badac0ec07a3d4b188eab2078cffdcf57716 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 29 May 2012 20:00:16 +0200
Subject: [PATCH] Fix walsender wakeup handling

The previous coding could miss xlog writeouts at several places. E.g. when wal
was written out by the background writer or even after a commit if
synchronous_commit=off.
This could lead to delays in sending data to the standby of up to 7 seconds.

To fix this move the responsibility of notification to the layer where the
neccessary information is actually present. We take some care not to do the
notification while we hold conteded locks like WALInsertLock or WalWriteLock
locks.

Document the preexisting fact that we rely on SetLatch to be safe from within
signal handlers and critical sections.
---
 src/backend/access/transam/twophase.c |   21 -----------------
 src/backend/access/transam/xact.c     |    7 ------
 src/backend/access/transam/xlog.c     |   18 ++++++++++++++
 src/backend/port/unix_latch.c         |    3 +++
 src/backend/port/win32_latch.c        |    4 ++++
 src/backend/replication/walsender.c   |   42 ++++++++++++++++++++++++++++++++-
 src/include/replication/walsender.h   |    2 ++
 7 files changed, 68 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 6db46c0..edbbdf1 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1041,13 +1041,6 @@ EndPrepare(GlobalTransaction gxact)
 
 	/* If we crash now, we have prepared: WAL replay will fix things */
 
-	/*
-	 * Wake up all walsenders to send WAL up to the PREPARE record immediately
-	 * if replication is enabled
-	 */
-	if (max_wal_senders > 0)
-		WalSndWakeup();
-
 	/* write correct CRC and close file */
 	if ((write(fd, &statefile_crc, sizeof(pg_crc32))) != sizeof(pg_crc32))
 	{
@@ -2048,13 +2041,6 @@ RecordTransactionCommitPrepared(TransactionId xid,
 	/* Flush XLOG to disk */
 	XLogFlush(recptr);
 
-	/*
-	 * Wake up all walsenders to send WAL up to the COMMIT PREPARED record
-	 * immediately if replication is enabled
-	 */
-	if (max_wal_senders > 0)
-		WalSndWakeup();
-
 	/* Mark the transaction committed in pg_clog */
 	TransactionIdCommitTree(xid, nchildren, children);
 
@@ -2136,13 +2122,6 @@ RecordTransactionAbortPrepared(TransactionId xid,
 	XLogFlush(recptr);
 
 	/*
-	 * Wake up all walsenders to send WAL up to the ABORT PREPARED record
-	 * immediately if replication is enabled
-	 */
-	if (max_wal_senders > 0)
-		WalSndWakeup();
-
-	/*
 	 * Mark the transaction aborted in clog.  This is not absolutely necessary
 	 * but we may as well do it while we are here.
 	 */
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c71a10e..d697ab8 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1137,13 +1137,6 @@ RecordTransactionCommit(void)
 		XLogFlush(XactLastRecEnd);
 
 		/*
-		 * Wake up all walsenders to send WAL up to the COMMIT record
-		 * immediately if replication is enabled
-		 */
-		if (max_wal_senders > 0)
-			WalSndWakeup();
-
-		/*
 		 * Now we may update the CLOG, if we wrote a COMMIT record above
 		 */
 		if (markXidCommitted)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d3650bd..ef64f33 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1014,6 +1014,8 @@ begin:;
 
 		END_CRIT_SECTION();
 
+		/* wakeup the WalSnd now that we released the WALWriteLock */
+		WalSndWakeupProcess();
 		return RecPtr;
 	}
 
@@ -1215,6 +1217,9 @@ begin:;
 
 	END_CRIT_SECTION();
 
+	/* wakeup the WalSnd now that we outside contented locks */
+	WalSndWakeupProcess();
+
 	return RecPtr;
 }
 
@@ -1819,6 +1824,10 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 			if (finishing_seg || (xlog_switch && last_iteration))
 			{
 				issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
+
+				/* signal that we need to wakeup WalSnd later */
+				WalSndWakeupRequest();
+
 				LogwrtResult.Flush = LogwrtResult.Write;		/* end of page */
 
 				if (XLogArchivingActive())
@@ -1883,6 +1892,9 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible, bool xlog_switch)
 				openLogOff = 0;
 			}
 			issue_xlog_fsync(openLogFile, openLogId, openLogSeg);
+
+			/* signal that we need to wakeup WalSnd later */
+			WalSndWakeupRequest();
 		}
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
@@ -2146,6 +2158,9 @@ XLogFlush(XLogRecPtr record)
 
 	END_CRIT_SECTION();
 
+	/* wakeup the WalSnd now that we released the WALWriteLock */
+	WalSndWakeupProcess();
+
 	/*
 	 * If we still haven't flushed to the request point then we have a
 	 * problem; most likely, the requested flush point is past end of XLOG.
@@ -2271,6 +2286,9 @@ XLogBackgroundFlush(void)
 
 	END_CRIT_SECTION();
 
+	/* wakeup the WalSnd now that we released the WALWriteLock */
+	WalSndWakeupProcess();
+
 	return wrote_something;
 }
 
diff --git a/src/backend/port/unix_latch.c b/src/backend/port/unix_latch.c
index e64282c..50731ac 100644
--- a/src/backend/port/unix_latch.c
+++ b/src/backend/port/unix_latch.c
@@ -416,6 +416,9 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
  * NB: when calling this in a signal handler, be sure to save and restore
  * errno around it.  (That's standard practice in most signal handlers, of
  * course, but we used to omit it in handlers that only set a flag.)
+ *
+ * NB: this function is called from critical sections and signal handlers so
+ * throwing an error is not a good idea.
  */
 void
 SetLatch(volatile Latch *latch)
diff --git a/src/backend/port/win32_latch.c b/src/backend/port/win32_latch.c
index 05b3426..988da53 100644
--- a/src/backend/port/win32_latch.c
+++ b/src/backend/port/win32_latch.c
@@ -246,6 +246,10 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	return result;
 }
 
+/*
+ * The comments above the unix implementation (unix_latch.c) of this function
+ * apply here as well.
+ */
 void
 SetLatch(volatile Latch *latch)
 {
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 5f93812..459127b 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -106,6 +106,12 @@ static StringInfoData reply_message;
  */
 static TimestampTz last_reply_timestamp;
 
+
+/*
+ * State for WalSndWakeupRequest
+ */
+static bool wroteNewXlogData = false;
+
 /* Flags set by signal handlers for later service in main loop */
 static volatile sig_atomic_t got_SIGHUP = false;
 volatile sig_atomic_t walsender_shutdown_requested = false;
@@ -1423,7 +1429,12 @@ WalSndShmemInit(void)
 	}
 }
 
-/* Wake up all walsenders */
+/*
+ * Wake up all walsenders
+ *
+ * This will be called inside critical sections, so throwing an error is not
+ * adviseable.
+ */
 void
 WalSndWakeup(void)
 {
@@ -1433,6 +1444,35 @@ WalSndWakeup(void)
 		SetLatch(&WalSndCtl->walsnds[i].latch);
 }
 
+/*
+ * Remember that we want to wakeup walsenders later
+ *
+ * This is separated from doing the actual wakeup because the writeout is done
+ * while holding contended locks.
+ */
+void
+WalSndWakeupRequest(void)
+{
+	wroteNewXlogData = true;
+}
+
+/*
+ * wakeup walsenders if there is work to be done
+ */
+void
+WalSndWakeupProcess(void)
+{
+	if(wroteNewXlogData){
+		wroteNewXlogData = false;
+		/*
+		 * Wake up all walsenders to send WAL up to the point where its flushed
+		 * safely to disk.
+		 */
+		if (max_wal_senders > 0)
+			WalSndWakeup();
+	}
+}
+
 /* Set state for current walsender (only called in walsender) */
 void
 WalSndSetState(WalSndState state)
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 128d2db..38191e7 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -31,6 +31,8 @@ extern void WalSndSignals(void);
 extern Size WalSndShmemSize(void);
 extern void WalSndShmemInit(void);
 extern void WalSndWakeup(void);
+extern void WalSndWakeupRequest(void);
+extern void WalSndWakeupProcess(void);
 extern void WalSndRqstFileReload(void);
 
 extern Datum pg_stat_get_wal_senders(PG_FUNCTION_ARGS);
-- 
1.7.10.rc3.3.g19a6c.dirty

#16

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Andres Freund (#15)

Re: WalSndWakeup() and synchronous_commit=off

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Patch attached.

Imo this patch should be backported to 9.1, 9.0 doesn't use latches and does
not do explicit wakeup of the sender so its not applicable there.

I can prepare a patch for 9.1 if people agree, there has been some amount of
change that won't make it apply cleanly.

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#17

Fujii Masao

masao.fujii@gmail.com

over 13 years ago

In reply to: Andres Freund (#16)

Re: WalSndWakeup() and synchronous_commit=off

On Wed, May 30, 2012 at 9:46 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Patch attached.

Imo this patch should be backported to 9.1, 9.0 doesn't use latches and does
not do explicit wakeup of the sender so its not applicable there.

I can prepare a patch for 9.1 if people agree, there has been some amount of
change that won't make it apply cleanly.

The patch wakes up walsender more frequently than now. Which leads
walsender to send smaller WAL data packet more frequently, and furthermore
which leads walreceiver to issue fsync more frequently. So I'm afraid that
the patch makes the disk more busy and slows down the read-only query
in the standby. I'm also afraid that frequent fsync calls degrade the
performance
of sync replication. So it's better to do benchmark to address the concerns.

Regards,

--
Fujii Masao

#18

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Fujii Masao (#17)

Re: WalSndWakeup() and synchronous_commit=off

On Thursday, May 31, 2012 01:33:33 AM Fujii Masao wrote:

On Wed, May 30, 2012 at 9:46 PM, Andres Freund <andres@2ndquadrant.com>

wrote:

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Patch attached.

Imo this patch should be backported to 9.1, 9.0 doesn't use latches and
does not do explicit wakeup of the sender so its not applicable there.

I can prepare a patch for 9.1 if people agree, there has been some amount
of change that won't make it apply cleanly.

The patch wakes up walsender more frequently than now. Which leads
walsender to send smaller WAL data packet more frequently, and furthermore
which leads walreceiver to issue fsync more frequently. So I'm afraid that
the patch makes the disk more busy and slows down the read-only query
in the standby. I'm also afraid that frequent fsync calls degrade the
performance
of sync replication. So it's better to do benchmark to address the
concerns.

I couldn't measure any significant difference in #fsyncs or replication speed
in a busy pgbench workload. If anything there were less, but the difference
was small. Both with synchronous_commit=on/off. I did *not* test sync repl.

Why would you expect a change? Walsender is only signalled if XLogWrite
actually flushed data to disk. Thats the point of the exercise. Thats normally
only done if the wal buffers are full or a commit record (+some other stuff)
requires the xlog to be flushed up to some point.

Greetings,

Andres

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#19

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Andres Freund (#15)

Re: WalSndWakeup() and synchronous_commit=off

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Hi,

On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Does anybody have a better idea than to either call WalSndWakeup() at
essentially the wrong places or calling it inside a critical section?

Tom, what danger do you see from calling it in a critical section?

My concern was basically that it might throw an error. Looking at the
current implementation of SetLatch, it seems that's not possible, but
I wonder whether we want to lock ourselves into that assumption.

The assumption is already made at several other places I think.
XLogSetAsyncXactLSN does a SetLatch and is called from critical sections;
several signal handlers call it without any attention to the context.

Requiring it to be called outside would make its usage considerably less
convenient and I don't really see what could change that would require to
throw non-panic errors.

Still, if the alternatives are worse, maybe that's the best answer.
If we do that, though, let's add comments to WalSndWakeup and SetLatch
mentioning that they mustn't throw error.

Patch attached.

I would like to invite some more review (+commit...) here ;). Imo this is an
annoying bug which should be fixed before next point release or beta/rc comes
out...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#20

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Andres Freund (#19)

Re: WalSndWakeup() and synchronous_commit=off

On 6 June 2012 20:11, Andres Freund <andres@2ndquadrant.com> wrote:

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Hi,

On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Does anybody have a better idea than to either call WalSndWakeup() at
essentially the wrong places or calling it inside a critical section?

Tom, what danger do you see from calling it in a critical section?

My concern was basically that it might throw an error. Looking at the
current implementation of SetLatch, it seems that's not possible, but
I wonder whether we want to lock ourselves into that assumption.

The assumption is already made at several other places I think.
XLogSetAsyncXactLSN does a SetLatch and is called from critical sections;
several signal handlers call it without any attention to the context.

Requiring it to be called outside would make its usage considerably less
convenient and I don't really see what could change that would require to
throw non-panic errors.

Still, if the alternatives are worse, maybe that's the best answer.
If we do that, though, let's add comments to WalSndWakeup and SetLatch
mentioning that they mustn't throw error.

Patch attached.

I would like to invite some more review (+commit...) here ;). Imo this is an
annoying bug which should be fixed before next point release or beta/rc comes
out...

Moved the wakeup to a logical place outside a critical section.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Simon Riggs (#20)

Re: WalSndWakeup() and synchronous_commit=off

On Thursday, June 07, 2012 08:41:23 PM Simon Riggs wrote:

On 6 June 2012 20:11, Andres Freund <andres@2ndquadrant.com> wrote:

On Tuesday, May 29, 2012 08:42:43 PM Andres Freund wrote:

Hi,

On Monday, May 28, 2012 07:11:53 PM Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

Does anybody have a better idea than to either call WalSndWakeup()
at essentially the wrong places or calling it inside a critical
section?

Tom, what danger do you see from calling it in a critical section?

My concern was basically that it might throw an error. Looking at the
current implementation of SetLatch, it seems that's not possible, but
I wonder whether we want to lock ourselves into that assumption.

The assumption is already made at several other places I think.
XLogSetAsyncXactLSN does a SetLatch and is called from critical
sections; several signal handlers call it without any attention to the
context.

Requiring it to be called outside would make its usage considerably less
convenient and I don't really see what could change that would require
to throw non-panic errors.

Still, if the alternatives are worse, maybe that's the best answer.
If we do that, though, let's add comments to WalSndWakeup and SetLatch
mentioning that they mustn't throw error.

Patch attached.

I would like to invite some more review (+commit...) here ;). Imo this is
an annoying bug which should be fixed before next point release or
beta/rc comes out...

Moved the wakeup to a logical place outside a critical section.

Hm. I don't really like the way you implemented that. While it reduces the
likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out
the data because of missing space or if any place does an XLogFlush(lsn).
The knowledge is really only available in XLogWrite...

Andres

#22

Simon Riggs

simon@2ndQuadrant.com

over 13 years ago

In reply to: Andres Freund (#21)

Re: WalSndWakeup() and synchronous_commit=off

On 7 June 2012 21:08, Andres Freund <andres@2ndquadrant.com> wrote:

Moved the wakeup to a logical place outside a critical section.

Hm. I don't really like the way you implemented that. While it reduces the
likelihood quite a bit it will still miss wakeups if an XLogInsert pushes out
the data because of missing space or if any place does an XLogFlush(lsn).
The knowledge is really only available in XLogWrite...

Right, but the placement inside the critical section was objected to.

This way, any caller of XLogFlush() will be swept up at least once per
wal_writer_delay, so missing a few calls doesn't mean we have spikes
in replication delay.

Doing it more frequently was also an objection from Fujii, to which we
must listen.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#23

Andres Freund

andres@2ndquadrant.com

over 13 years ago

In reply to: Simon Riggs (#22)

Re: WalSndWakeup() and synchronous_commit=off

Hi,

On Friday, June 08, 2012 01:42:22 AM Simon Riggs wrote:

On 7 June 2012 21:08, Andres Freund <andres@2ndquadrant.com> wrote:

Moved the wakeup to a logical place outside a critical section.

Hm. I don't really like the way you implemented that. While it reduces
the likelihood quite a bit it will still miss wakeups if an XLogInsert
pushes out the data because of missing space or if any place does an
XLogFlush(lsn). The knowledge is really only available in XLogWrite...

Right, but the placement inside the critical section was objected to.

And that objection was later somewhat eased by Tom. There currently still are
several WalSndWakeup calls in critical sections and several other uses of
latches in critial sections and several in signal handlers (which may be
during a critical section). Thats why I added comments to SetLatch documenting
that they need to be signal/concurrency safe. (Which they are atm)

This way, any caller of XLogFlush() will be swept up at least once per
wal_writer_delay, so missing a few calls doesn't mean we have spikes
in replication delay.

Unfortunately it does. At least I measure it here ;) (obviously less than the
prev. 7 seconds). Its not that surprising though: Especially during high or
even more so during bursty wal activity a backend might decide to write out
wal itself. In that case the background writer doesn't necessarily have
anything to write out so it won't wake the walsender.
Under high load the number of wakeups is twice or thrice as high *before* my
patch than afterwards (with synchronous_commit=on obviously). As you most
definitely know (group commit work et al) in a concurrent situation many of
the wakeups from the current location (RecordTransactionCommit) are useless
because the xlog was already flushed by another backend/background writer
before we get to do it.

Doing it more frequently was also an objection from Fujii, to which we
must listen.

I had hoped that I argued successfully against that, but obviously not ;)

Greetings,

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services