Sync Rep and shutdown Re: Sync Rep v19

Started by Fujii Masaoalmost 15 years ago42 messages
#1Fujii Masao
masao.fujii@gmail.com
1 attachment(s)

On Wed, Mar 9, 2011 at 2:14 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

Agreed. I think that user who wants to request smart shutdown expects all
the existing connections to basically be closed normally by the client. So it
doesn't seem to be good idea to forcibly close the connection and prevent
the COMMIT from being returned in smart shutdown case. But I'm all ears
for better suggestions.

Anyway, we got the consensus about how fast shutdown should work with
sync rep. So I created the patch. Please feel free to comment and commit
the patch first ;)

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

fast_shutdown_syncrep_v1.patchapplication/octet-stream; name=fast_shutdown_syncrep_v1.patchDownload
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 164,183 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  
  			case SYNC_REP_WAITING:
  				/*
- 				 * Check for conditions that would cause us to leave the
- 				 * wait state before the LSN has been reached.
- 				 */
- 				if (!PostmasterIsAlive(true))
- 				{
- 					LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
- 					SHMQueueDelete(&(MyProc->syncRepLinks));
- 					LWLockRelease(SyncRepLock);
- 
- 					MyProc->syncRepState = SYNC_REP_MUST_DISCONNECT;
- 					return;
- 				}
- 
- 				/*
  				 * We don't receive SIGHUPs at this point, so resetting
  				 * synchronous_standby_names has no effect on waiters.
  				 */
--- 164,169 ----
***************
*** 191,216 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  				 * WalSender has checked our LSN and has removed us from
  				 * queue. Cleanup local state and leave.
  				 */
- 				Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
- 
  				MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! 				MyProc->waitLSN.xlogid = 0;
! 				MyProc->waitLSN.xrecoff = 0;
! 
! 				if (new_status)
! 				{
! 					/* Reset ps display */
! 					set_ps_display(new_status, false);
! 					pfree(new_status);
! 				}
! 
! 				return;
  
  			case SYNC_REP_MUST_DISCONNECT:
  				return;
  
  			default:
! 				elog(FATAL, "invalid syncRepState");
  		}
  
  		/*
--- 177,212 ----
  				 * WalSender has checked our LSN and has removed us from
  				 * queue. Cleanup local state and leave.
  				 */
  				MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! 				goto cleanup;
  
  			case SYNC_REP_MUST_DISCONNECT:
  				return;
  
  			default:
! 				/*
! 				 * Just close the connection without returning anything
! 				 * to the client instead of throwing ERROR/FATAL error
! 				 * since it's not allowed while in COMMIT state.
! 				 */
! 				elog(WARNING, "invalid syncRepState");
! 				goto disconnect;
! 		}
! 
! 		/*
! 		 * Check for conditions that would cause us to leave the
! 		 * wait state before the LSN has been reached.
! 		 */
! 		if (!PostmasterIsAlive(true) || ProcDiePending)
! 		{
! 			if (ProcDiePending)
! 				ereport(LOG,
! 						(errcode(ERRCODE_ADMIN_SHUTDOWN),
! 						 errmsg("canceling the wait for replication and terminating "
! 								"connection due to administrator command"),
! 						 errdetail("The transaction has already been committed locally "
! 								   "but might have not been replicated to the standby.")));
! 			goto disconnect;
  		}
  
  		/*
***************
*** 220,225 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
--- 216,241 ----
  		 */
  		WaitLatch(&MyProc->waitLatch, 60000000L);
  	}
+ 
+ disconnect:
+ 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	SHMQueueDelete(&(MyProc->syncRepLinks));
+ 	LWLockRelease(SyncRepLock);
+ 
+ 	MyProc->syncRepState = SYNC_REP_MUST_DISCONNECT;
+ 
+ cleanup:
+ 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+ 
+ 	MyProc->waitLSN.xlogid = 0;
+ 	MyProc->waitLSN.xrecoff = 0;
+ 
+ 	if (new_status)
+ 	{
+ 		/* Reset ps display */
+ 		set_ps_display(new_status, false);
+ 		pfree(new_status);
+ 	}
  }
  
  /*
*** a/src/backend/tcop/dest.c
--- b/src/backend/tcop/dest.c
***************
*** 36,41 ****
--- 36,45 ----
  #include "executor/tstoreReceiver.h"
  #include "libpq/libpq.h"
  #include "libpq/pqformat.h"
+ #include "miscadmin.h"
+ #include "replication/syncrep.h"
+ #include "storage/ipc.h"
+ #include "tcop/tcopprot.h"
  #include "utils/portal.h"
  
  
***************
*** 144,149 **** EndCommand(const char *commandTag, CommandDest dest)
--- 148,171 ----
  		case DestRemoteExecute:
  
  			/*
+ 			 * If we receive SIGTERM while we are waiting for synchronous
+ 			 * replication, we close the connection before returning the
+ 			 * "success" indication to the client. This prevents the client
+ 			 * from thinking that the transaction which might have not been
+ 			 * replicated yet has been committed and visible in the standby.
+ 			 */
+ 			if (MyProc->syncRepState == SYNC_REP_MUST_DISCONNECT)
+ 			{
+ 				/*
+ 				 * Reset whereToSendOutput to prevent ereport from
+ 				 * attempting to send any more messages to client.
+ 				 */
+ 				whereToSendOutput = DestNone;
+ 
+ 				proc_exit(0);
+ 			}
+ 
+ 			/*
  			 * We assume the commandTag is plain ASCII and therefore requires
  			 * no encoding conversion.
  			 */
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2643,2648 **** die(SIGNAL_ARGS)
--- 2643,2651 ----
  			InterruptHoldoffCount--;
  			ProcessInterrupts();
  		}
+ 
+ 		/* Wake up myself waiting on the latch */
+ 		SetLatch(&(MyProc->waitLatch));
  	}
  
  	errno = save_errno;
#2Yeb Havinga
yebhavinga@gmail.com
In reply to: Fujii Masao (#1)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-09 08:38, Fujii Masao wrote:

On Wed, Mar 9, 2011 at 2:14 PM, Jaime Casanova<jaime@2ndquadrant.com> wrote:

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas<robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

Agreed. I think that user who wants to request smart shutdown expects all
the existing connections to basically be closed normally by the client. So it
doesn't seem to be good idea to forcibly close the connection and prevent
the COMMIT from being returned in smart shutdown case. But I'm all ears
for better suggestions.

For me smart has always been synonymous to no forced disconnects/exits,
or put different, the 'clean' solution, as opposite to the fast and
unclean shutdown.

An alternative for a clean solution might be to forbid smart shutdown,
if none of the sync standbys is connected. This would prevent the master
to enter a state in which a standby cannot connect anymore.

regards,
Yeb Havinga

#3Magnus Hagander
magnus@hagander.net
In reply to: Fujii Masao (#1)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 9, 2011 at 08:38, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 9, 2011 at 2:14 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

Agreed. I think that user who wants to request smart shutdown expects all
the existing connections to basically be closed normally by the client. So it
doesn't seem to be good idea to forcibly close the connection and prevent
the COMMIT from being returned in smart shutdown case. But I'm all ears
for better suggestions.

"don't use smart shutdowns"? ;)

Anyway, for those that *do* use smart intentionally, I agree that
doing any kind of forced close at all is just plain wrong.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#4Simon Riggs
simon@2ndQuadrant.com
In reply to: Fujii Masao (#1)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, 2011-03-09 at 16:38 +0900, Fujii Masao wrote:

On Wed, Mar 9, 2011 at 2:14 PM, Jaime Casanova <jaime@2ndquadrant.com> wrote:

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

Agreed. I think that user who wants to request smart shutdown expects all
the existing connections to basically be closed normally by the client. So it
doesn't seem to be good idea to forcibly close the connection and prevent
the COMMIT from being returned in smart shutdown case. But I'm all ears
for better suggestions.

Anyway, we got the consensus about how fast shutdown should work with
sync rep. So I created the patch. Please feel free to comment and commit
the patch first ;)

We're just about to publish Alpha4 with this feature in.

If we release waiters too early we will cause effective data loss, that
part is agreed. We've also accepted that there are few ways to release
the waiters.

I want to release the first version as "safe" and then relax from there
after feedback. What I don't want to hear is lots of complaints or
arguments about data loss from the first version. We can relax later,
after some time and thought.

So for now, I don't want to apply this patch or other similar ones that
seek to release waiters in various circumstances. That isn't a rejection
of you, its just a wish to play it safe and slowly.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#5Yeb Havinga
yebhavinga@gmail.com
In reply to: Simon Riggs (#4)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-09 15:10, Simon Riggs wrote:

On Wed, 2011-03-09 at 16:38 +0900, Fujii Masao wrote:

On Wed, Mar 9, 2011 at 2:14 PM, Jaime Casanova<jaime@2ndquadrant.com> wrote:

On Tue, Mar 8, 2011 at 11:58 AM, Robert Haas<robertmhaas@gmail.com> wrote:

The fast shutdown handling seems fine, but why not just handle smart
shutdown the same way?

currently, smart shutdown means no new connections, wait until
existing ones close normally. for consistency, it should behave the
same for sync rep.

Agreed. I think that user who wants to request smart shutdown expects all
the existing connections to basically be closed normally by the client. So it
doesn't seem to be good idea to forcibly close the connection and prevent
the COMMIT from being returned in smart shutdown case. But I'm all ears
for better suggestions.

Anyway, we got the consensus about how fast shutdown should work with
sync rep. So I created the patch. Please feel free to comment and commit
the patch first ;)

We're just about to publish Alpha4 with this feature in.

If we release waiters too early we will cause effective data loss, that
part is agreed. We've also accepted that there are few ways to release
the waiters.

I want to release the first version as "safe" and then relax from there
after feedback.

This is not safe and possible in the first version:

1) issue stop on master when no sync standby is connected:
mgrid@mg73:~$ pg_ctl -D /data stop
waiting for server to shut
down............................................................... failed
pg_ctl: server does not shut down

2) start the standby that failed
mgrid@mg72:~$ pg_ctl -D /data start
pg_ctl: another server might be running; trying to start server anyway
LOG: 00000: database system was interrupted while in recovery at log
time 2011-03-09 15:22:31 CET
HINT: If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.
LOG: 00000: entering standby mode
LOG: 00000: redo starts at 57/1A000078
LOG: 00000: consistent recovery state reached at 57/1A0000A0
FATAL: XX000: could not connect to the primary server: FATAL: the
database system is shutting down

LOCATION: libpqrcv_connect, libpqwalreceiver.c:102
server starting
mgrid@mg72:~$ FATAL: XX000: could not connect to the primary server:
FATAL: the database system is shutting down

A safe solution would be to prevent smart shutdown on the master if it
is in sync mode and there are no sync standbys connected.

The current situation is definately unsafe because it forces people that
are in this state to do a fast shutdown.. but that fails as well, so
they are only left with immediate.

mgrid@mg73:~$ pg_ctl -D /data stop -m fast
waiting for server to shut
down............................................................... failed
pg_ctl: server does not shut down
mgrid@mg73:~$ pg_ctl -D /data stop -m immediate
waiting for server to shut down.... done
server stopped

regards,
Yeb Havinga

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Yeb Havinga (#5)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, 2011-03-09 at 15:37 +0100, Yeb Havinga wrote:

The current situation is definately unsafe because it forces people
that are in this state to do a fast shutdown.. but that fails as well,
so they are only left with immediate.

All the more reason not to change anything, since we disagree.

The idea is that you're supposed to wait for the standby to come back up
or do failover. If you shutdown the master its because you are choosing
to failover.

Shutting down the master and restarting without failover leads to a
situation where some sync rep commits are not on both master and
standby. Making it easier to shutdown encourages that, which I don't
wish to do, at this stage.

We may decide that this is the right approach but I don't wish to rush
into that decision. I want to have clear agreement about all the changes
we want and what we call them if we do them. Zero data loss is
ultimately about users having confidence in us, not about specific
features. Our disagreements on this patch risk damaging that confidence,
whoever is right/wrong.

Further changes can be made over the course of the next few weeks, based
upon feedback from a wider pool of potential users.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#7Fujii Masao
masao.fujii@gmail.com
In reply to: Simon Riggs (#6)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, Mar 10, 2011 at 12:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, 2011-03-09 at 15:37 +0100, Yeb Havinga wrote:

The current situation is definately unsafe because it forces people
that are in this state to do a fast shutdown.. but that fails as well,
so they are only left with immediate.

I agree with Yeb.

All the more reason not to change anything, since we disagree.

The idea is that you're supposed to wait for the standby to come back up
or do failover. If you shutdown the master its because you are choosing
to failover.

Shutting down the master and restarting without failover leads to a
situation where some sync rep commits are not on both master and
standby. Making it easier to shutdown encourages that, which I don't
wish to do, at this stage.

I'm not sure I follow you. The proposed fast shutdown prevents the
backends which have not received the ACK from the standby yet
from returning the "success" to the client. So even after restarting
the server, there is no data loss from client's point of view. If this is
really unsafe, we *must* forbid immediate shutdown while backend
is waiting for sync rep. Because immediate shutdown creates the
same situation.

What scenario are you concerned?

We may decide that this is the right approach but I don't wish to rush
into that decision. I want to have clear agreement about all the changes
we want and what we call them if we do them. Zero data loss is
ultimately about users having confidence in us, not about specific
features. Our disagreements on this patch risk damaging that confidence,
whoever is right/wrong.

Same as above. I think that it's more problematic to leave the code
as it is. Because smart/fast shutdown can make the server get stuck
until immediate shutdown is requested.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#8Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#7)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 9, 2011 at 11:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Same as above. I think that it's more problematic to leave the code
as it is. Because smart/fast shutdown can make the server get stuck
until immediate shutdown is requested.

I agree that the current state of affairs is a problem. However,
after looking through the code somewhat carefully, it looks a bit
difficult to fix. Suppose that backend A is waiting for sync rep. A
fast shutdown is performed. Right now, backend A shrugs its shoulders
and does nothing. Not good. But suppose we change it so that backend
A closes the connection and exits without either confirming the commit
or throwing ERROR/FATAL. That seems like correct behavior, since, if
we weren't using sync rep, the client would have to interpret that as
indicating that the connection denied in mid-COMMIT, and mustn't
assume anything about the state of the transaction. So far so good.

The problem is that there may be another backend B waiting on a lock
held by A. If backend A exits cleanly (without a PANIC), it will
remove itself from the ProcArray and release locks. That wakes up A,
which can now go do its thing. If the operating system is a bit on
the slow side delivering the signal to B, then the client to which B
is connected might manage to see a database state that shows the
transaction previous running in A as committed, even though that
transaction wasn't committed. That would stink, because the whole
point of having A hold onto locks until the standby ack'd the commit
was that no other transaction would see it as committed until it was
replicated.

This is a pretty unlikely race condition in practice but people who
are running sync rep are intending precisely to guard against unlikely
failure scenarios.

The only idea I have for allowing fast shutdown to still be fast, even
when sync rep is involved, is to shut down the system in two phases.
The postmaster would need to stop accepting new connections, and first
kill off all the backends that aren't waiting for sync rep. Then,
once all remaining backends are waiting for sync rep, we can have them
proceed as above: close the connection without acking the commit or
throwing ERROR/FATAL, and exit. That's pretty complicated, especially
given the rule that the postmaster mustn't touch shared memory, but I
don't see any alternative. We could just not allow fast shutdown, as
now, but I think that's worse.

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#8)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 11:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The problem is that there may be another backend B waiting on a lock
held by A.  If backend A exits cleanly (without a PANIC), it will
remove itself from the ProcArray and release locks.  That wakes up A,
which can now go do its thing.  If the operating system is a bit on
the slow side delivering the signal to B, then the client to which B
is connected might manage to see a database state that shows the
transaction previous running in A as committed, even though that
transaction wasn't committed.  That would stink, because the whole
point of having A hold onto locks until the standby ack'd the commit
was that no other transaction would see it as committed until it was
replicated.

The lock can be released also when the transaction running in A is
rollbacked. So I could not understand why the client wrongly always
see the transaction as commtted even though it's not committed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#10Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#8)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Tue, 2011-03-15 at 22:07 -0400, Robert Haas wrote:

On Wed, Mar 9, 2011 at 11:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Same as above. I think that it's more problematic to leave the code
as it is. Because smart/fast shutdown can make the server get stuck
until immediate shutdown is requested.

I agree that the current state of affairs is a problem. However,
after looking through the code somewhat carefully, it looks a bit
difficult to fix. Suppose that backend A is waiting for sync rep. A
fast shutdown is performed. Right now, backend A shrugs its shoulders
and does nothing. Not good. But suppose we change it so that backend
A closes the connection and exits without either confirming the commit
or throwing ERROR/FATAL. That seems like correct behavior, since, if
we weren't using sync rep, the client would have to interpret that as
indicating that the connection denied in mid-COMMIT, and mustn't
assume anything about the state of the transaction. So far so good.

The problem is that there may be another backend B waiting on a lock
held by A. If backend A exits cleanly (without a PANIC), it will
remove itself from the ProcArray and release locks. That wakes up A,
which can now go do its thing. If the operating system is a bit on
the slow side delivering the signal to B, then the client to which B
is connected might manage to see a database state that shows the
transaction previous running in A as committed, even though that
transaction wasn't committed. That would stink, because the whole
point of having A hold onto locks until the standby ack'd the commit
was that no other transaction would see it as committed until it was
replicated.

This is a pretty unlikely race condition in practice but people who
are running sync rep are intending precisely to guard against unlikely
failure scenarios.

The only idea I have for allowing fast shutdown to still be fast, even
when sync rep is involved, is to shut down the system in two phases.
The postmaster would need to stop accepting new connections, and first
kill off all the backends that aren't waiting for sync rep. Then,
once all remaining backends are waiting for sync rep, we can have them
proceed as above: close the connection without acking the commit or
throwing ERROR/FATAL, and exit. That's pretty complicated, especially
given the rule that the postmaster mustn't touch shared memory, but I
don't see any alternative.

We could just not allow fast shutdown, as
now, but I think that's worse.

Please explain why not allowing fast shutdown makes it worse?

For me, I'd rather not support a whole bunch of dubious code, just to
allow you to type -m fast when you can already type -m immediate.

What extra capability are we actually delivering by doing that??
The risk of introducing a bug and thereby losing data far outweighs the
rather dubious benefit.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#11Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#9)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 1:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Mar 16, 2011 at 11:07 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The problem is that there may be another backend B waiting on a lock
held by A.  If backend A exits cleanly (without a PANIC), it will
remove itself from the ProcArray and release locks.  That wakes up A,
which can now go do its thing.  If the operating system is a bit on
the slow side delivering the signal to B, then the client to which B
is connected might manage to see a database state that shows the
transaction previous running in A as committed, even though that
transaction wasn't committed.  That would stink, because the whole
point of having A hold onto locks until the standby ack'd the commit
was that no other transaction would see it as committed until it was
replicated.

The lock can be released also when the transaction running in A is
rollbacked. So I could not understand why the client wrongly always
see the transaction as commtted even though it's not committed.

The transaction IS committed, but only locally.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#10)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 4:51 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, 2011-03-15 at 22:07 -0400, Robert Haas wrote:

On Wed, Mar 9, 2011 at 11:11 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Same as above. I think that it's more problematic to leave the code
as it is. Because smart/fast shutdown can make the server get stuck
until immediate shutdown is requested.

I agree that the current state of affairs is a problem.  However,
after looking through the code somewhat carefully, it looks a bit
difficult to fix.  Suppose that backend A is waiting for sync rep.  A
fast shutdown is performed.  Right now, backend A shrugs its shoulders
and does nothing.  Not good.  But suppose we change it so that backend
A closes the connection and exits without either confirming the commit
or throwing ERROR/FATAL.  That seems like correct behavior, since, if
we weren't using sync rep, the client would have to interpret that as
indicating that the connection denied in mid-COMMIT, and mustn't
assume anything about the state of the transaction.  So far so good.

The problem is that there may be another backend B waiting on a lock
held by A.  If backend A exits cleanly (without a PANIC), it will
remove itself from the ProcArray and release locks.  That wakes up A,
which can now go do its thing.  If the operating system is a bit on
the slow side delivering the signal to B, then the client to which B
is connected might manage to see a database state that shows the
transaction previous running in A as committed, even though that
transaction wasn't committed.  That would stink, because the whole
point of having A hold onto locks until the standby ack'd the commit
was that no other transaction would see it as committed until it was
replicated.

This is a pretty unlikely race condition in practice but people who
are running sync rep are intending precisely to guard against unlikely
failure scenarios.

The only idea I have for allowing fast shutdown to still be fast, even
when sync rep is involved, is to shut down the system in two phases.
The postmaster would need to stop accepting new connections, and first
kill off all the backends that aren't waiting for sync rep.  Then,
once all remaining backends are waiting for sync rep, we can have them
proceed as above: close the connection without acking the commit or
throwing ERROR/FATAL, and exit.  That's pretty complicated, especially
given the rule that the postmaster mustn't touch shared memory, but I
don't see any alternative.

We could just not allow fast shutdown, as
now, but I think that's worse.

Please explain why not allowing fast shutdown makes it worse?

For me, I'd rather not support a whole bunch of dubious code, just to
allow you to type -m fast when you can already type -m immediate.

What extra capability are we actually delivering by doing that??
The risk of introducing a bug and thereby losing data far outweighs the
rather dubious benefit.

Well, my belief is that when users ask the database to shut down, they
want it to work. If I'm the only one who thinks that, then whatever.
But I firmly believe we'll get bug reports about this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#12)
1 attachment(s)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 7:39 AM, Robert Haas <robertmhaas@gmail.com> wrote:

The only idea I have for allowing fast shutdown to still be fast, even
when sync rep is involved, is to shut down the system in two phases.
The postmaster would need to stop accepting new connections, and first
kill off all the backends that aren't waiting for sync rep.  Then,
once all remaining backends are waiting for sync rep, we can have them
proceed as above: close the connection without acking the commit or
throwing ERROR/FATAL, and exit.  That's pretty complicated, especially
given the rule that the postmaster mustn't touch shared memory, but I
don't see any alternative.

What extra capability are we actually delivering by doing that??
The risk of introducing a bug and thereby losing data far outweighs the
rather dubious benefit.

Well, my belief is that when users ask the database to shut down, they
want it to work.  If I'm the only one who thinks that, then whatever.
But I firmly believe we'll get bug reports about this.

On further review, the approach proposed above doesn't really work,
because a backend can get a SIGTERM either because the system is doing
a fast shutdown or because a user has issued
pg_terminate_backend(PID); and in the latter case we have to continue
letting in connections.

As of right now, synchronous replication continues to wait even when:

- someone tries to perform a fast shutdown
- someone tries to kill the backend using pg_terminate_backend()
- someone attempts to cancel the query using pg_cancel_backend() or by
pressing control-C in, for example, psql
- someone attempts to shut off synchronous replication by changing
synchronous_standby_names in postgresql.conf and issuing pg_ctl reload

We've worked pretty hard to ensure that things like query cancel and
shutdown work quickly and reliably, and I don't think we want to make
synchronous replication the one part of the system that departs from
that general principle.

So, patch attached. This patch arranges to do the following things:

1. If a die interrupt is received (pg_terminate_backend or fast
shutdown), then terminate the sync rep wait and arrange for the
connection to be closed without acknowledging the commit (but do send
a warning message back). The commit still happened, though, so other
transactions will see its effects. This is unavoidable unless we're
willing to either ignore attempts to terminate a backend waiting for
sync rep, or panic the system when it happens, and I don't think
either of those is appropriate.

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit. Again, the alternative is to either ignore the cancel or
panic, neither of which I believe to be what users will want.

3. If synchronous_standby_names is changed to '' by editing
postgresql.conf and issuing pg_ctl reload, then cancel all waits in
progress and wake everybody up. As I mentioned before, reloading the
config file from within the waiting backend (which can't safely throw
an error) seems risky, so what I did instead is made WAL writer
responsible for handling this. Nobody's allowed to wait for sync rep
unless a global shared memory flag is set, and the WAL writer process
is responsible for setting and clearing this flag when the config file
is reloaded. This has basically no performance cost; WAL writer only
ever does any extra work at all with this code when it receives a
SIGHUP, and even then the work is trivial except in the case where
synchronous_standby_names has changed from empty to non-empty or visca
versa. The advantage of putting this in WAL writer rather than, say,
bgwriter is that WAL writer doesn't have nearly as many jobs to do and
they don't involve nearly as much I/O, so the chances of a long delay
due to the process being busy are much less.

4. Remove the SYNC_REP_MUST_DISCONNECT state, which actually does
absolutely nothing right now, despite what the name would seem to
imply. In particular, it doesn't arrange for any sort of disconnect.
This patch does arrange for that, but not using this mechanism.

5. The existing code relies on being able to read MyProc->syncRepState
without holding the lock, even while a WAL sender must be updating it
in another process. I'm not 100% sure this is safe on a
multi-processor machine with weak memory ordering. In practice, the
chances of something going wrong here seem extremely small. You'd
need something like this: a WAL sender updates MyProc->syncRepState
just after the wait timeout expires and before the latch is reset, but
the regular backend fails to see the state change due to
memory-ordering effects and drops through the loop, waiting another 60
s, and then finally wakes up and completes the wait (but a minute
later than expected). That seems vanishingly unlikely but it's also
simple to protect against, so I did.

Review appreciated.

Thanks,

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sync-rep-wait-fixes.patchapplication/octet-stream; name=sync-rep-wait-fixes.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 2098,2105 **** SET ENABLE_SEQSCAN TO OFF;
          If a standby is removed from the list of servers then it will stop
          being the synchronous standby, allowing another to take its place.
          If the list is empty, synchronous replication will not be
!         possible, whatever the setting of <varname>synchronous_replication</>,
!         however, already waiting commits will continue to wait.
          Standbys may also be added to the list without restarting the server.
         </para>
        </listitem>
--- 2098,2104 ----
          If a standby is removed from the list of servers then it will stop
          being the synchronous standby, allowing another to take its place.
          If the list is empty, synchronous replication will not be
!         possible, whatever the setting of <varname>synchronous_replication</>.
          Standbys may also be added to the list without restarting the server.
         </para>
        </listitem>
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 49,54 ****
--- 49,55 ----
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "postmaster/walwriter.h"
+ #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
***************
*** 216,221 **** WalWriterMain(void)
--- 217,225 ----
  	 */
  	PG_SETMASK(&UnBlockSig);
  
+ 	/* Do this once before starting the loop, then just at SIGHUP time. */
+ 	SyncRepUpdateSyncStandbysDefined();
+ 
  	/*
  	 * Loop forever
  	 */
***************
*** 237,242 **** WalWriterMain(void)
--- 241,248 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 			/* update global shmem state for sync rep */
+ 			SyncRepUpdateSyncStandbysDefined();
  		}
  		if (shutdown_requested)
  		{
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 55,60 ****
--- 55,61 ----
  #include "storage/ipc.h"
  #include "storage/pmsignal.h"
  #include "storage/proc.h"
+ #include "tcop/tcopprot.h"
  #include "utils/builtins.h"
  #include "utils/guc.h"
  #include "utils/guc_tables.h"
***************
*** 65,71 ****
  bool	synchronous_replication = false;		/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
! static bool sync_standbys_defined = false;	/* Is there at least one name? */
  static bool announce_next_takeover = true;
  
  static void SyncRepQueueInsert(void);
--- 66,74 ----
  bool	synchronous_replication = false;		/* Only set in user backends */
  char 	*SyncRepStandbyNames;
  
! #define SyncStandbysDefined() \
! 	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
! 
  static bool announce_next_takeover = true;
  
  static void SyncRepQueueInsert(void);
***************
*** 83,88 **** static bool SyncRepQueueIsOrderedByLSN(void);
--- 86,98 ----
  
  /*
   * Wait for synchronous replication, if requested by user.
+  *
+  * Initially backends start in state SYNC_REP_NOT_WAITING and then
+  * change that state to SYNC_REP_WAITING before adding ourselves
+  * to the wait queue. During SyncRepWakeQueue() a WALSender changes
+  * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
+  * This backend then resets its state to SYNC_REP_NOT_WAITING when
+  * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
   */
  void
  SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
***************
*** 95,104 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncRepRequested() || !sync_standbys_defined)
  		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
  
  	/*
  	 * Wait for specified LSN to be confirmed.
--- 105,153 ----
  	 * there are no sync replication standby names defined.
  	 * Note that those standbys don't need to be connected.
  	 */
! 	if (!SyncRepRequested() || !SyncStandbysDefined())
  		return;
  
  	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+ 	Assert(WalSndCtl != NULL);
+ 
+ 	/* Reset the latch before adding ourselves to the queue. */
+ 	ResetLatch(&MyProc->waitLatch);
+ 
+ 	/*
+ 	 * Set our waitLSN so WALSender will know when to wake us, and add
+ 	 * ourselves to the queue.
+ 	 */
+ 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+ 	if (!WalSndCtl->sync_standbys_defined)
+ 	{
+ 		/*
+ 		 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
+ 		 * not set.  See SyncRepUpdateSyncStandbysDefined.
+ 		 */
+ 		LWLockRelease(SyncRepLock);
+ 		return;
+ 	}
+ 	MyProc->waitLSN = XactCommitLSN;
+ 	MyProc->syncRepState = SYNC_REP_WAITING;
+ 	SyncRepQueueInsert();
+ 	Assert(SyncRepQueueIsOrderedByLSN());
+ 	LWLockRelease(SyncRepLock);
+ 
+ 	/* Alter ps display to show waiting for sync rep. */
+ 	if (update_process_title)
+ 	{
+ 		int			len;
+ 
+ 		old_status = get_ps_display(&len);
+ 		new_status = (char *) palloc(len + 32 + 1);
+ 		memcpy(new_status, old_status, len);
+ 		sprintf(new_status + len, " waiting for %X/%X",
+ 				XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+ 		set_ps_display(new_status, false);
+ 		new_status[len] = '\0'; /* truncate off " waiting ..." */
+ 	}
  
  	/*
  	 * Wait for specified LSN to be confirmed.
***************
*** 108,224 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
  	 */
  	for (;;)
  	{
  		ResetLatch(&MyProc->waitLatch);
  
  		/*
! 		 * Synchronous Replication state machine within user backend
! 		 *
! 		 * Initially backends start in state SYNC_REP_NOT_WAITING and then
! 		 * change that state to SYNC_REP_WAITING before adding ourselves
! 		 * to the wait queue. During SyncRepWakeQueue() a WALSender changes
! 		 * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
! 		 * This backend then resets its state to SYNC_REP_NOT_WAITING when
! 		 * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
! 		 *
! 		 * We read MyProc->syncRepState without SyncRepLock, which
! 		 * assumes that read access is atomic.
  		 */
! 		switch (MyProc->syncRepState)
  		{
! 			case SYNC_REP_NOT_WAITING:
! 				/*
! 				 * Set our waitLSN so WALSender will know when to wake us.
! 				 * We set this before we add ourselves to queue, so that
! 				 * any proc on the queue can be examined freely without
! 				 * taking a lock on each process in the queue, as long as
! 				 * they hold SyncRepLock.
! 				 */
! 				MyProc->waitLSN = XactCommitLSN;
! 				MyProc->syncRepState = SYNC_REP_WAITING;
! 
! 				/*
! 				 * Add to queue while holding lock.
! 				 */
! 				LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! 				SyncRepQueueInsert();
! 				Assert(SyncRepQueueIsOrderedByLSN());
! 				LWLockRelease(SyncRepLock);
! 
! 				/*
! 				 * Alter ps display to show waiting for sync rep.
! 				 */
! 				if (update_process_title)
! 				{
! 					int			len;
! 
! 					old_status = get_ps_display(&len);
! 					new_status = (char *) palloc(len + 32 + 1);
! 					memcpy(new_status, old_status, len);
! 					sprintf(new_status + len, " waiting for %X/%X",
! 						 XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
! 					set_ps_display(new_status, false);
! 					new_status[len] = '\0'; /* truncate off " waiting ..." */
! 				}
! 
! 				break;
! 
! 			case SYNC_REP_WAITING:
! 				/*
! 				 * Check for conditions that would cause us to leave the
! 				 * wait state before the LSN has been reached.
! 				 */
! 				if (!PostmasterIsAlive(true))
! 				{
! 					LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! 					SHMQueueDelete(&(MyProc->syncRepLinks));
! 					LWLockRelease(SyncRepLock);
! 
! 					MyProc->syncRepState = SYNC_REP_MUST_DISCONNECT;
! 					return;
! 				}
! 
! 				/*
! 				 * We don't receive SIGHUPs at this point, so resetting
! 				 * synchronous_standby_names has no effect on waiters.
! 				 */
! 
! 				/* Continue waiting */
! 
! 				break;
! 
! 			case SYNC_REP_WAIT_COMPLETE:
! 				/*
! 				 * WalSender has checked our LSN and has removed us from
! 				 * queue. Cleanup local state and leave.
! 				 */
! 				Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
! 
! 				MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! 				MyProc->waitLSN.xlogid = 0;
! 				MyProc->waitLSN.xrecoff = 0;
! 
! 				if (new_status)
! 				{
! 					/* Reset ps display */
! 					set_ps_display(new_status, false);
! 					pfree(new_status);
! 				}
! 
! 				return;
! 
! 			case SYNC_REP_MUST_DISCONNECT:
! 				return;
! 
! 			default:
! 				elog(FATAL, "invalid syncRepState");
  		}
  
  		/*
! 		 * Wait on latch for up to 60 seconds. This allows us to
! 		 * check for postmaster death regularly while waiting.
! 		 * Note that timeout here does not necessarily release from loop.
  		 */
! 		WaitLatch(&MyProc->waitLatch, 60000000L);
  	}
  }
  
--- 157,265 ----
  	 */
  	for (;;)
  	{
+ 		int syncRepState;
+ 
+ 		/*
+ 		 * Wait on latch for up to 60 seconds. This allows us to
+ 		 * check for postmaster death regularly while waiting.
+ 		 * Note that timeout here does not necessarily release from loop.
+ 		 */
+ 		WaitLatch(&MyProc->waitLatch, 60000000L);
+ 
+ 		/* Must reset the latch before testing state. */
  		ResetLatch(&MyProc->waitLatch);
  
  		/*
! 		 * Try checking the state without the lock first.  There's no guarantee
! 		 * that we'll read the most up-to-date value, so if it looks like we're
! 		 * still waiting, recheck while holding the lock.  But if it looks like
! 		 * we're done, we must really be done, because once walsender changes
! 		 * the state to SYNC_REP_WAIT_COMPLETE, it will never update it again,
! 		 * so we can't be seeing a stale value in that case.
  		 */
! 		syncRepState = MyProc->syncRepState;
! 		if (syncRepState == SYNC_REP_WAITING)
  		{
! 			LWLockAcquire(SyncRepLock, LW_SHARED);
! 			syncRepState = MyProc->syncRepState;
! 			LWLockRelease(SyncRepLock);
  		}
+ 		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+ 			break;
  
  		/*
! 		 * If a wait for synchronous replication is pending, we can neither
! 		 * acknowledge the commit nor raise ERROR or FATAL.  The latter
! 		 * would lead the client to believe that that the transaction
! 		 * aborted, which is not true: it's already committed locally.
! 		 * The former is no good either: the client has requested
! 		 * synchronous replication, and is entitled to assume that an
! 		 * acknowledged commit is also replicated, which may not be true.
! 		 * So in this case we issue a NOTICE (which some clients may
! 		 * be able to interpret) and shut off further output.  We do NOT
! 		 * reset ProcDiePending, so that the process will die after the
! 		 * commit is cleaned up.
  		 */
! 		if (ProcDiePending)
! 		{
! 			ereport(WARNING,
! 					(errcode(ERRCODE_ADMIN_SHUTDOWN),
! 					 errmsg("canceling the wait for replication and terminating connection due to administrator command"),
! 					 errdetail("The transaction has already been committed locally but might have not been replicated to the standby.")));
! 			whereToSendOutput = DestNone;
! 			LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! 			MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 			SHMQueueDelete(&(MyProc->syncRepLinks));
! 			LWLockRelease(SyncRepLock);
! 			break;
! 		}
! 
! 		/*
! 		 * It's unclear what to do if a query cancel interrupt arrives.  We
! 		 * can't actually abort at this point, but ignoring the interrupt
! 		 * altogether is not helpful, so we just terminate the wait with
! 		 * a suitable warning.
! 		 */
! 		if (QueryCancelPending)
! 		{
! 			QueryCancelPending = false;
! 			ereport(WARNING,
! 					(errmsg("canceling wait for synchronous replication due to user request"),
! 					 errdetail("The transaction has committed locally, but may not have replicated to the standby.")));
! 			LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
! 			MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
! 			SHMQueueDelete(&(MyProc->syncRepLinks));
! 			LWLockRelease(SyncRepLock);
! 			break;
! 		}
! 
! 		/*
! 		 * If the postmaster dies, we'll probably never get an acknowledgement,
! 		 * because all the wal sender processes will exit.  So just bail out.
! 		 */
! 		if (!PostmasterIsAlive(true))
! 		{
! 			whereToSendOutput = DestNone;
! 			proc_exit(1);
! 		}
! 	}
! 
! 	/*
! 	 * WalSender has checked our LSN and has removed us from queue. Clean up
! 	 * state and leave.  It's OK to reset these shared memory fields without
! 	 * holding SyncRepLock, because any walsenders will ignore us anyway when
! 	 * we're not on the queue.
! 	 */
! 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
! 	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
! 	MyProc->waitLSN.xlogid = 0;
! 	MyProc->waitLSN.xrecoff = 0;
! 
! 	if (new_status)
! 	{
! 		/* Reset ps display */
! 		set_ps_display(new_status, false);
! 		pfree(new_status);
  	}
  }
  
***************
*** 506,511 **** SyncRepWakeQueue(bool all)
--- 547,589 ----
  	return numprocs;
  }
  
+ /*
+  * WAL writer calls this as needed to update the shared sync_standbys_needed
+  * flag, so that backends don't remain permanently wedged if
+  * synchronous_standby_names is unset.  It's safe to check the current value
+  * without the lock, because it's only ever updated by one process.  But we
+  * must take the lock to change it.
+  */
+ void
+ SyncRepUpdateSyncStandbysDefined(void)
+ {
+ 	bool		sync_standbys_defined = SyncStandbysDefined();
+ 
+ 	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+ 	{
+ 		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+ 
+ 		/*
+ 		 * If synchronous_standby_names has been reset to empty, it's futile
+ 		 * for backends to continue to waiting.  Since the user no longer
+ 		 * wants synchronous replication, we'd better wake them up.
+ 		 */
+ 		if (!sync_standbys_defined)
+ 			SyncRepWakeQueue(true);
+ 
+ 		/*
+ 		 * Only allow people to join the queue when there are synchronous
+ 		 * standbys defined.  Without this interlock, there's a race
+ 		 * condition: we might wake up all the current waiters; then, some
+ 		 * backend that hasn't yet reloaded its config might go to sleep on
+ 		 * the queue (and never wake up).  This prevents that.
+ 		 */
+ 		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+ 
+ 		LWLockRelease(SyncRepLock);
+ 	}
+ }
+ 
  #ifdef USE_ASSERT_CHECKING
  static bool
  SyncRepQueueIsOrderedByLSN(void)
***************
*** 568,580 **** assign_synchronous_standby_names(const char *newval, bool doit, GucSource source
  	}
  
  	/*
- 	 * Is there at least one sync standby? If so cache this knowledge to
- 	 * improve performance of SyncRepWaitForLSN() for all-async configs.
- 	 */
- 	if (doit && list_length(elemlist) > 0)
- 		sync_standbys_defined = true;
- 
- 	/*
  	 * Any additional validation of standby names should go here.
  	 *
  	 * Don't attempt to set WALSender priority because this is executed by
--- 646,651 ----
*** a/src/backend/tcop/postgres.c
--- b/src/backend/tcop/postgres.c
***************
*** 2643,2648 **** die(SIGNAL_ARGS)
--- 2643,2651 ----
  			InterruptHoldoffCount--;
  			ProcessInterrupts();
  		}
+ 
+ 		/* Interrupt any sync rep wait which is currently in progress. */
+ 		SetLatch(&(MyProc->waitLatch));
  	}
  
  	errno = save_errno;
***************
*** 2681,2686 **** StatementCancelHandler(SIGNAL_ARGS)
--- 2684,2692 ----
  			InterruptHoldoffCount--;
  			ProcessInterrupts();
  		}
+ 
+ 		/* Interrupt any sync rep wait which is currently in progress. */
+ 		SetLatch(&(MyProc->waitLatch));
  	}
  
  	errno = save_errno;
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 26,32 ****
  #define SYNC_REP_NOT_WAITING		0
  #define SYNC_REP_WAITING			1
  #define SYNC_REP_WAIT_COMPLETE		2
- #define SYNC_REP_MUST_DISCONNECT	3
  
  /* user-settable parameters for synchronous replication */
  extern bool synchronous_replication;
--- 26,31 ----
***************
*** 42,47 **** extern void SyncRepCleanupAtProcExit(int code, Datum arg);
--- 41,49 ----
  extern void SyncRepInitConfig(void);
  extern void SyncRepReleaseWaiters(void);
  
+ /* called by wal writer */
+ extern void SyncRepUpdateSyncStandbysDefined(void);
+ 
  /* called by various procs */
  extern int SyncRepWakeQueue(bool all);
  extern const char *assign_synchronous_standby_names(const char *newval, bool doit, GucSource source);
*** a/src/include/replication/walsender.h
--- b/src/include/replication/walsender.h
***************
*** 78,83 **** typedef struct
--- 78,90 ----
  	 */
  	XLogRecPtr	lsn;
  
+ 	/*
+ 	 * Are any sync standbys defined?  Waiting backends can't reload the
+ 	 * config file safely, so WAL writer updates this value as needed.
+ 	 * Protected by SyncRepLock.
+ 	 */
+ 	bool		sync_standbys_defined;
+ 
  	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
  } WalSndCtlData;
  
#14Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Robert Haas (#13)
Re: Sync Rep and shutdown Re: Sync Rep v19

Robert Haas <robertmhaas@gmail.com> writes:

1. If a die interrupt is received (pg_terminate_backend or fast
shutdown), then terminate the sync rep wait and arrange for the
connection to be closed without acknowledging the commit (but do send
a warning message back). The commit still happened, though, so other
transactions will see its effects. This is unavoidable unless we're
willing to either ignore attempts to terminate a backend waiting for
sync rep, or panic the system when it happens, and I don't think
either of those is appropriate.

Is it possible to force the standby out here, so that logs show that
there was something going on wrt replication?

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit. Again, the alternative is to either ignore the cancel or
panic, neither of which I believe to be what users will want.

Or force the standby to disconnect.

In both those cases what we have is a situation were either we can't
satisfy the user request or we can't continue to offer sync rep. You're
saying that we have to satisfy the user's query, so I say kick off sync
rep or it does not make any sense.

3. If synchronous_standby_names is changed to '' by editing
postgresql.conf and issuing pg_ctl reload, then cancel all waits in
progress and wake everybody up. As I mentioned before, reloading the

Ok.

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#15Robert Haas
robertmhaas@gmail.com
In reply to: Dimitri Fontaine (#14)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 6:23 PM, Dimitri Fontaine
<dimitri@2ndquadrant.fr> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

1. If a die interrupt is received (pg_terminate_backend or fast
shutdown), then terminate the sync rep wait and arrange for the
connection to be closed without acknowledging the commit (but do send
a warning message back).  The commit still happened, though, so other
transactions will see its effects.  This is unavoidable unless we're
willing to either ignore attempts to terminate a backend waiting for
sync rep, or panic the system when it happens, and I don't think
either of those is appropriate.

Is it possible to force the standby out here, so that logs show that
there was something going on wrt replication?

That's an interesting idea, but I think it might be too much spooky
action at a distance. I think we should look at getting Fujii Masao's
replication_timeout patch committed; that seems like the right way to
kick out unresponsive standbys. Another problem with doing it here is
that any ERROR will turn into a PANIC, which rules out doing anything
very complicated. Also note that we can (and do) log a WARNING, which
I think answers your concern about having something in the logs wrt
replication.

A further point is that even if we could kick out the standby, it'd
presumably reconnect after the usual 2 s interval, so it doesn't seem
like it really accomplishes much. We can't just unilaterally decide
that it is no longer allowed to be a sync standby ever again; that's
controlled by postgresql.conf.

I think the most important part of all this is that it is logged.
Anyone who is running synchronous replication should also be doing
careful monitoring; if not, shame on them, because if your data is
important enough that you need synchronous replication, it's surely
important enough to watch the logs. If you don't, all sorts of bad
things can happen to your data (either related to sync rep, or
otherwise) and you'll have no idea until it's far too late.

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.  Again, the alternative is to either ignore the cancel or
panic, neither of which I believe to be what users will want.

Or force the standby to disconnect.

In both those cases what we have is a situation were either we can't
satisfy the user request or we can't continue to offer sync rep.  You're
saying that we have to satisfy the user's query, so I say kick off sync
rep or it does not make any sense.

Same considerations here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#16Aidan Van Dyk
aidan@highrise.ca
In reply to: Robert Haas (#15)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, Mar 16, 2011 at 8:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I think the most important part of all this is that it is logged.
Anyone who is running synchronous replication should also be doing
careful monitoring; if not, shame on them, because if your data is
important enough that you need synchronous replication, it's surely
important enough to watch the logs.  If you don't, all sorts of bad
things can happen to your data (either related to sync rep, or
otherwise) and you'll have no idea until it's far too late.

+++++

If your data is that important, your logs/monitoring are *equally*
important, because they are what give you confidence your data is as
safe as you think it is...

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#13)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, Mar 17, 2011 at 2:35 AM, Robert Haas <robertmhaas@gmail.com> wrote:

1. If a die interrupt is received (pg_terminate_backend or fast
shutdown), then terminate the sync rep wait and arrange for the
connection to be closed without acknowledging the commit (but do send
a warning message back).  The commit still happened, though, so other
transactions will see its effects.  This is unavoidable unless we're
willing to either ignore attempts to terminate a backend waiting for
sync rep, or panic the system when it happens, and I don't think
either of those is appropriate.

OK.

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.  Again, the alternative is to either ignore the cancel or
panic, neither of which I believe to be what users will want.

OK.

3. If synchronous_standby_names is changed to '' by editing
postgresql.conf and issuing pg_ctl reload, then cancel all waits in
progress and wake everybody up.  As I mentioned before, reloading the
config file from within the waiting backend (which can't safely throw
an error) seems risky,

AFAIR, ProcessConfigFile() doesn't throw an error. So I don't think
that's so risky. But, as you said in another thread, reading config file
at that point is inconsistent, I agree. And it seems better to leave
background process to wake up backends.

so what I did instead is made WAL writer
responsible for handling this.  Nobody's allowed to wait for sync rep
unless a global shared memory flag is set, and the WAL writer process
is responsible for setting and clearing this flag when the config file
is reloaded.  This has basically no performance cost; WAL writer only
ever does any extra work at all with this code when it receives a
SIGHUP, and even then the work is trivial except in the case where
synchronous_standby_names has changed from empty to non-empty or visca
versa.  The advantage of putting this in WAL writer rather than, say,
bgwriter is that WAL writer doesn't have nearly as many jobs to do and
they don't involve nearly as much I/O, so the chances of a long delay
due to the process being busy are much less.

This occurs to me; we should ensure that, in shutdown case, walwriter
should exit after all the backends have gone out? I'm not sure if it's worth
thinking of the case, but what if synchronous_standby_names is unset
and config file is reloaded after smart shutdown is requested? In this
case, the reload cannot wake up the waiting backends since walwriter
has already exited. This behavior looks a bit inconsistent.

4. Remove the SYNC_REP_MUST_DISCONNECT state, which actually does
absolutely nothing right now, despite what the name would seem to
imply.  In particular, it doesn't arrange for any sort of disconnect.
This patch does arrange for that, but not using this mechanism.

OK.

5. The existing code relies on being able to read MyProc->syncRepState
without holding the lock, even while a WAL sender must be updating it
in another process.  I'm not 100% sure this is safe on a
multi-processor machine with weak memory ordering.  In practice, the
chances of something going wrong here seem extremely small.  You'd
need something like this: a WAL sender updates MyProc->syncRepState
just after the wait timeout expires and before the latch is reset, but
the regular backend fails to see the state change due to
memory-ordering effects and drops through the loop, waiting another 60
s, and then finally wakes up and completes the wait (but a minute
later than expected).  That seems vanishingly unlikely but it's also
simple to protect against, so I did.

In the patch, in order to read the latest value, you take a light-weight lock.
But I wonder why taking a lock can ensure that the value is up-to-date.

Review appreciated.

Thanks! Here are some comments:

+ * WAL writer calls this as needed to update the shared sync_standbys_needed

Typo: s/sync_standbys_needed/sync_standbys_defined

+ * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.

Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required.

+ * So in this case we issue a NOTICE (which some clients may

Typo: s/NOTICE/WARNING

+		if (ProcDiePending)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("canceling the wait for replication and terminating
connection due to administrator command"),
+					 errdetail("The transaction has already been committed locally
but might have not been replicated to the standby.")));
+			whereToSendOutput = DestNone;
+			LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+			MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
+			SHMQueueDelete(&(MyProc->syncRepLinks));

SHMQueueIsDetached() should be checked before calling SHMQueueDelete().
Walsender can already delete the backend from the queue before reaching here.

+		if (QueryCancelPending)
+		{
+			QueryCancelPending = false;
+			ereport(WARNING,
+					(errmsg("canceling wait for synchronous replication due to user request"),
+					 errdetail("The transaction has committed locally, but may not
have replicated to the standby.")));
+			LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+			MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
+			SHMQueueDelete(&(MyProc->syncRepLinks));
+			LWLockRelease(SyncRepLock);

Same as above.

+		if (!PostmasterIsAlive(true))
+		{
+			whereToSendOutput = DestNone;
+			proc_exit(1);

proc_exit() should not be called at that point because it leads PANIC.

I think that it's better to check ProcDiePending, QueryCancelPending
and PostmasterIsAlive *before* waiting on the latch, not after. Because
those events can occur before reaching there, and it's not worth waiting
for 60 seconds to detect them.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#13)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 16.03.2011 19:35, Robert Haas wrote:

3. If synchronous_standby_names is changed to '' by editing
postgresql.conf and issuing pg_ctl reload, then cancel all waits in
progress and wake everybody up. As I mentioned before, reloading the
config file from within the waiting backend (which can't safely throw
an error) seems risky, so what I did instead is made WAL writer
responsible for handling this. Nobody's allowed to wait for sync rep
unless a global shared memory flag is set, and the WAL writer process
is responsible for setting and clearing this flag when the config file
is reloaded. This has basically no performance cost; WAL writer only
ever does any extra work at all with this code when it receives a
SIGHUP, and even then the work is trivial except in the case where
synchronous_standby_names has changed from empty to non-empty or visca
versa. The advantage of putting this in WAL writer rather than, say,
bgwriter is that WAL writer doesn't have nearly as many jobs to do and
they don't involve nearly as much I/O, so the chances of a long delay
due to the process being busy are much less.

Hmm, so setting synchronous_standby_names to '' takes effect
immediately, but other changes to it don't apply to already-blocked
commits. That seems a bit inconsistent. Perhaps walwriter should store
the parsed list of standby-names in shared memory, not just a boolean.

+1 otherwise.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#17)
1 attachment(s)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, Mar 17, 2011 at 2:08 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

This occurs to me; we should ensure that, in shutdown case, walwriter
should exit after all the backends have gone out? I'm not sure if it's worth
thinking of the case, but what if synchronous_standby_names is unset
and config file is reloaded after smart shutdown is requested? In this
case, the reload cannot wake up the waiting backends since walwriter
has already exited. This behavior looks a bit inconsistent.

I agree we need to fix smart shutdown. I think that's going to have
to be a separate patch, however; it's got more problems than this
patch can fix without expanding into a monster.

In the patch, in order to read the latest value, you take a light-weight lock.
But I wonder why taking a lock can ensure that the value is up-to-date.

A lock acquisition acts as a memory sequence point.

+ * WAL writer calls this as needed to update the shared sync_standbys_needed

Typo: s/sync_standbys_needed/sync_standbys_defined

Fixed.

+ * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.

Typo: the reference to SYNC_REP_MUST_DISCONNECT is not required.

Fixed.

+                * So in this case we issue a NOTICE (which some clients may

Typo: s/NOTICE/WARNING

Fixed.

+               if (ProcDiePending)
+               {
+                       ereport(WARNING,
+                                       (errcode(ERRCODE_ADMIN_SHUTDOWN),
+                                        errmsg("canceling the wait for replication and terminating
connection due to administrator command"),
+                                        errdetail("The transaction has already been committed locally
but might have not been replicated to the standby.")));
+                       whereToSendOutput = DestNone;
+                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
+                       SHMQueueDelete(&(MyProc->syncRepLinks));

SHMQueueIsDetached() should be checked before calling SHMQueueDelete().
Walsender can already delete the backend from the queue before reaching here.

Fixed. But come to think of it, doesn't this mean
SyncRepCleanupAtProcExit() needs to repeat the test after acquiring
the lock?

+               if (QueryCancelPending)
+               {
+                       QueryCancelPending = false;
+                       ereport(WARNING,
+                                       (errmsg("canceling wait for synchronous replication due to user request"),
+                                        errdetail("The transaction has committed locally, but may not
have replicated to the standby.")));
+                       LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+                       MyProc->syncRepState = SYNC_REP_WAIT_COMPLETE;
+                       SHMQueueDelete(&(MyProc->syncRepLinks));
+                       LWLockRelease(SyncRepLock);

Same as above.

Fixed.

+               if (!PostmasterIsAlive(true))
+               {
+                       whereToSendOutput = DestNone;
+                       proc_exit(1);

proc_exit() should not be called at that point because it leads PANIC.

Fixed, although I'm not sure it matters.

I think that it's better to check ProcDiePending, QueryCancelPending
and PostmasterIsAlive *before* waiting on the latch, not after. Because
those events can occur before reaching there, and it's not worth waiting
for 60 seconds to detect them.

Not necessary. Inspired by one of your earlier patches, I made die()
and StatementCancelHandler() set the latch. We could still wait up to
60 s to detect postmaster death, but that's a very rare situation and
it's not worth slowing down the common case for it, especially since
there is a race condition no matter what.

Thanks for the review!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sync-rep-wait-fixes-v2.patchapplication/octet-stream; name=sync-rep-wait-fixes-v2.patchDownload
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 212f426..e16c0ab 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2098,8 +2098,7 @@ SET ENABLE_SEQSCAN TO OFF;
         If a standby is removed from the list of servers then it will stop
         being the synchronous standby, allowing another to take its place.
         If the list is empty, synchronous replication will not be
-        possible, whatever the setting of <varname>synchronous_replication</>,
-        however, already waiting commits will continue to wait.
+        possible, whatever the setting of <varname>synchronous_replication</>.
         Standbys may also be added to the list without restarting the server.
        </para>
       </listitem>
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index d0d7c9b..e97ac63 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -49,6 +49,7 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "postmaster/walwriter.h"
+#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -216,6 +217,9 @@ WalWriterMain(void)
 	 */
 	PG_SETMASK(&UnBlockSig);
 
+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();
+
 	/*
 	 * Loop forever
 	 */
@@ -237,6 +241,8 @@ WalWriterMain(void)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
 		}
 		if (shutdown_requested)
 		{
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 3ef9cdd..b70de39 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -55,6 +55,7 @@
 #include "storage/ipc.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
+#include "tcop/tcopprot.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/guc_tables.h"
@@ -65,10 +66,13 @@
 bool	synchronous_replication = false;		/* Only set in user backends */
 char 	*SyncRepStandbyNames;
 
-static bool sync_standbys_defined = false;	/* Is there at least one name? */
+#define SyncStandbysDefined() \
+	(SyncRepStandbyNames != NULL && SyncRepStandbyNames[0] != '\0')
+
 static bool announce_next_takeover = true;
 
 static void SyncRepQueueInsert(void);
+static void SyncRepCancelWait(void);
 
 static int SyncRepGetStandbyPriority(void);
 #ifdef USE_ASSERT_CHECKING
@@ -83,6 +87,12 @@ static bool SyncRepQueueIsOrderedByLSN(void);
 
 /*
  * Wait for synchronous replication, if requested by user.
+ *
+ * Initially backends start in state SYNC_REP_NOT_WAITING and then
+ * change that state to SYNC_REP_WAITING before adding ourselves
+ * to the wait queue. During SyncRepWakeQueue() a WALSender changes
+ * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
+ * This backend then resets its state to SYNC_REP_NOT_WAITING.
  */
 void
 SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
@@ -95,10 +105,49 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * there are no sync replication standby names defined.
 	 * Note that those standbys don't need to be connected.
 	 */
-	if (!SyncRepRequested() || !sync_standbys_defined)
+	if (!SyncRepRequested() || !SyncStandbysDefined())
 		return;
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	Assert(WalSndCtl != NULL);
+
+	/* Reset the latch before adding ourselves to the queue. */
+	ResetLatch(&MyProc->waitLatch);
+
+	/*
+	 * Set our waitLSN so WALSender will know when to wake us, and add
+	 * ourselves to the queue.
+	 */
+	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
+	if (!WalSndCtl->sync_standbys_defined)
+	{
+		/*
+		 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
+		 * not set.  See SyncRepUpdateSyncStandbysDefined.
+		 */
+		LWLockRelease(SyncRepLock);
+		return;
+	}
+	MyProc->waitLSN = XactCommitLSN;
+	MyProc->syncRepState = SYNC_REP_WAITING;
+	SyncRepQueueInsert();
+	Assert(SyncRepQueueIsOrderedByLSN());
+	LWLockRelease(SyncRepLock);
+
+	/* Alter ps display to show waiting for sync rep. */
+	if (update_process_title)
+	{
+		int			len;
+
+		old_status = get_ps_display(&len);
+		new_status = (char *) palloc(len + 32 + 1);
+		memcpy(new_status, old_status, len);
+		sprintf(new_status + len, " waiting for %X/%X",
+				XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
+		set_ps_display(new_status, false);
+		new_status[len] = '\0'; /* truncate off " waiting ..." */
+	}
 
 	/*
 	 * Wait for specified LSN to be confirmed.
@@ -108,117 +157,105 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 */
 	for (;;)
 	{
+		int syncRepState;
+
+		/*
+		 * Wait on latch for up to 60 seconds. This allows us to
+		 * check for postmaster death regularly while waiting.
+		 * Note that timeout here does not necessarily release from loop.
+		 */
+		WaitLatch(&MyProc->waitLatch, 60000000L);
+
+		/* Must reset the latch before testing state. */
 		ResetLatch(&MyProc->waitLatch);
 
 		/*
-		 * Synchronous Replication state machine within user backend
-		 *
-		 * Initially backends start in state SYNC_REP_NOT_WAITING and then
-		 * change that state to SYNC_REP_WAITING before adding ourselves
-		 * to the wait queue. During SyncRepWakeQueue() a WALSender changes
-		 * the state to SYNC_REP_WAIT_COMPLETE once replication is confirmed.
-		 * This backend then resets its state to SYNC_REP_NOT_WAITING when
-		 * we exit normally, or SYNC_REP_MUST_DISCONNECT in abnormal cases.
-		 *
-		 * We read MyProc->syncRepState without SyncRepLock, which
-		 * assumes that read access is atomic.
+		 * Try checking the state without the lock first.  There's no guarantee
+		 * that we'll read the most up-to-date value, so if it looks like we're
+		 * still waiting, recheck while holding the lock.  But if it looks like
+		 * we're done, we must really be done, because once walsender changes
+		 * the state to SYNC_REP_WAIT_COMPLETE, it will never update it again,
+		 * so we can't be seeing a stale value in that case.
 		 */
-		switch (MyProc->syncRepState)
+		syncRepState = MyProc->syncRepState;
+		if (syncRepState == SYNC_REP_WAITING)
 		{
-			case SYNC_REP_NOT_WAITING:
-				/*
-				 * Set our waitLSN so WALSender will know when to wake us.
-				 * We set this before we add ourselves to queue, so that
-				 * any proc on the queue can be examined freely without
-				 * taking a lock on each process in the queue, as long as
-				 * they hold SyncRepLock.
-				 */
-				MyProc->waitLSN = XactCommitLSN;
-				MyProc->syncRepState = SYNC_REP_WAITING;
-
-				/*
-				 * Add to queue while holding lock.
-				 */
-				LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-				SyncRepQueueInsert();
-				Assert(SyncRepQueueIsOrderedByLSN());
-				LWLockRelease(SyncRepLock);
-
-				/*
-				 * Alter ps display to show waiting for sync rep.
-				 */
-				if (update_process_title)
-				{
-					int			len;
-
-					old_status = get_ps_display(&len);
-					new_status = (char *) palloc(len + 32 + 1);
-					memcpy(new_status, old_status, len);
-					sprintf(new_status + len, " waiting for %X/%X",
-						 XactCommitLSN.xlogid, XactCommitLSN.xrecoff);
-					set_ps_display(new_status, false);
-					new_status[len] = '\0'; /* truncate off " waiting ..." */
-				}
-
-				break;
-
-			case SYNC_REP_WAITING:
-				/*
-				 * Check for conditions that would cause us to leave the
-				 * wait state before the LSN has been reached.
-				 */
-				if (!PostmasterIsAlive(true))
-				{
-					LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-					SHMQueueDelete(&(MyProc->syncRepLinks));
-					LWLockRelease(SyncRepLock);
-
-					MyProc->syncRepState = SYNC_REP_MUST_DISCONNECT;
-					return;
-				}
-
-				/*
-				 * We don't receive SIGHUPs at this point, so resetting
-				 * synchronous_standby_names has no effect on waiters.
-				 */
-
-				/* Continue waiting */
-
-				break;
-
-			case SYNC_REP_WAIT_COMPLETE:
-				/*
-				 * WalSender has checked our LSN and has removed us from
-				 * queue. Cleanup local state and leave.
-				 */
-				Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
-
-				MyProc->syncRepState = SYNC_REP_NOT_WAITING;
-				MyProc->waitLSN.xlogid = 0;
-				MyProc->waitLSN.xrecoff = 0;
-
-				if (new_status)
-				{
-					/* Reset ps display */
-					set_ps_display(new_status, false);
-					pfree(new_status);
-				}
-
-				return;
-
-			case SYNC_REP_MUST_DISCONNECT:
-				return;
-
-			default:
-				elog(FATAL, "invalid syncRepState");
+			LWLockAcquire(SyncRepLock, LW_SHARED);
+			syncRepState = MyProc->syncRepState;
+			LWLockRelease(SyncRepLock);
 		}
+		if (syncRepState == SYNC_REP_WAIT_COMPLETE)
+			break;
 
 		/*
-		 * Wait on latch for up to 60 seconds. This allows us to
-		 * check for postmaster death regularly while waiting.
-		 * Note that timeout here does not necessarily release from loop.
+		 * If a wait for synchronous replication is pending, we can neither
+		 * acknowledge the commit nor raise ERROR or FATAL.  The latter
+		 * would lead the client to believe that that the transaction
+		 * aborted, which is not true: it's already committed locally.
+		 * The former is no good either: the client has requested
+		 * synchronous replication, and is entitled to assume that an
+		 * acknowledged commit is also replicated, which may not be true.
+		 * So in this case we issue a WARNING (which some clients may
+		 * be able to interpret) and shut off further output.  We do NOT
+		 * reset ProcDiePending, so that the process will die after the
+		 * commit is cleaned up.
 		 */
-		WaitLatch(&MyProc->waitLatch, 60000000L);
+		if (ProcDiePending)
+		{
+			ereport(WARNING,
+					(errcode(ERRCODE_ADMIN_SHUTDOWN),
+					 errmsg("canceling the wait for replication and terminating connection due to administrator command"),
+					 errdetail("The transaction has already been committed locally but might have not been replicated to the standby.")));
+			whereToSendOutput = DestNone;
+			SyncRepCancelWait();
+			break;
+		}
+
+		/*
+		 * It's unclear what to do if a query cancel interrupt arrives.  We
+		 * can't actually abort at this point, but ignoring the interrupt
+		 * altogether is not helpful, so we just terminate the wait with
+		 * a suitable warning.
+		 */
+		if (QueryCancelPending)
+		{
+			QueryCancelPending = false;
+			ereport(WARNING,
+					(errmsg("canceling wait for synchronous replication due to user request"),
+					 errdetail("The transaction has committed locally, but may not have replicated to the standby.")));
+			SyncRepCancelWait();
+			break;
+		}
+
+		/*
+		 * If the postmaster dies, we'll probably never get an acknowledgement,
+		 * because all the wal sender processes will exit.  So just bail out.
+		 */
+		if (!PostmasterIsAlive(true))
+		{
+			ProcDiePending = true;
+			whereToSendOutput = DestNone;
+			SyncRepCancelWait();
+			break;
+		}
+	}
+
+	/*
+	 * WalSender has checked our LSN and has removed us from queue. Clean up
+	 * state and leave.  It's OK to reset these shared memory fields without
+	 * holding SyncRepLock, because any walsenders will ignore us anyway when
+	 * we're not on the queue.
+	 */
+	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	MyProc->waitLSN.xlogid = 0;
+	MyProc->waitLSN.xrecoff = 0;
+
+	if (new_status)
+	{
+		/* Reset ps display */
+		set_ps_display(new_status, false);
+		pfree(new_status);
 	}
 }
 
@@ -257,6 +294,19 @@ SyncRepQueueInsert(void)
 		SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
 }
 
+/*
+ * Acquire SyncRepLock and cancel any wait currently in progress.
+ */
+static void
+SyncRepCancelWait(void)
+{
+	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+	if (!SHMQueueIsDetached(&(MyProc->syncRepLinks)))
+		SHMQueueDelete(&(MyProc->syncRepLinks));
+	MyProc->syncRepState = SYNC_REP_NOT_WAITING;
+	LWLockRelease(SyncRepLock);
+}
+
 void
 SyncRepCleanupAtProcExit(int code, Datum arg)
 {
@@ -506,6 +556,43 @@ SyncRepWakeQueue(bool all)
 	return numprocs;
 }
 
+/*
+ * WAL writer calls this as needed to update the shared sync_standbys_defined
+ * flag, so that backends don't remain permanently wedged if
+ * synchronous_standby_names is unset.  It's safe to check the current value
+ * without the lock, because it's only ever updated by one process.  But we
+ * must take the lock to change it.
+ */
+void
+SyncRepUpdateSyncStandbysDefined(void)
+{
+	bool		sync_standbys_defined = SyncStandbysDefined();
+
+	if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
+	{
+		LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
+
+		/*
+		 * If synchronous_standby_names has been reset to empty, it's futile
+		 * for backends to continue to waiting.  Since the user no longer
+		 * wants synchronous replication, we'd better wake them up.
+		 */
+		if (!sync_standbys_defined)
+			SyncRepWakeQueue(true);
+
+		/*
+		 * Only allow people to join the queue when there are synchronous
+		 * standbys defined.  Without this interlock, there's a race
+		 * condition: we might wake up all the current waiters; then, some
+		 * backend that hasn't yet reloaded its config might go to sleep on
+		 * the queue (and never wake up).  This prevents that.
+		 */
+		WalSndCtl->sync_standbys_defined = sync_standbys_defined;
+
+		LWLockRelease(SyncRepLock);
+	}
+}
+
 #ifdef USE_ASSERT_CHECKING
 static bool
 SyncRepQueueIsOrderedByLSN(void)
@@ -568,13 +655,6 @@ assign_synchronous_standby_names(const char *newval, bool doit, GucSource source
 	}
 
 	/*
-	 * Is there at least one sync standby? If so cache this knowledge to
-	 * improve performance of SyncRepWaitForLSN() for all-async configs.
-	 */
-	if (doit && list_length(elemlist) > 0)
-		sync_standbys_defined = true;
-
-	/*
 	 * Any additional validation of standby names should go here.
 	 *
 	 * Don't attempt to set WALSender priority because this is executed by
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 39b7b5b..79c0f68 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -2643,6 +2643,9 @@ die(SIGNAL_ARGS)
 			InterruptHoldoffCount--;
 			ProcessInterrupts();
 		}
+
+		/* Interrupt any sync rep wait which is currently in progress. */
+		SetLatch(&(MyProc->waitLatch));
 	}
 
 	errno = save_errno;
@@ -2681,6 +2684,9 @@ StatementCancelHandler(SIGNAL_ARGS)
 			InterruptHoldoffCount--;
 			ProcessInterrupts();
 		}
+
+		/* Interrupt any sync rep wait which is currently in progress. */
+		SetLatch(&(MyProc->waitLatch));
 	}
 
 	errno = save_errno;
diff --git a/src/include/replication/syncrep.h b/src/include/replication/syncrep.h
index 9171eb6..188ec65 100644
--- a/src/include/replication/syncrep.h
+++ b/src/include/replication/syncrep.h
@@ -26,7 +26,6 @@
 #define SYNC_REP_NOT_WAITING		0
 #define SYNC_REP_WAITING			1
 #define SYNC_REP_WAIT_COMPLETE		2
-#define SYNC_REP_MUST_DISCONNECT	3
 
 /* user-settable parameters for synchronous replication */
 extern bool synchronous_replication;
@@ -42,6 +41,9 @@ extern void SyncRepCleanupAtProcExit(int code, Datum arg);
 extern void SyncRepInitConfig(void);
 extern void SyncRepReleaseWaiters(void);
 
+/* called by wal writer */
+extern void SyncRepUpdateSyncStandbysDefined(void);
+
 /* called by various procs */
 extern int SyncRepWakeQueue(bool all);
 extern const char *assign_synchronous_standby_names(const char *newval, bool doit, GucSource source);
diff --git a/src/include/replication/walsender.h b/src/include/replication/walsender.h
index 2e5b209..150a71f 100644
--- a/src/include/replication/walsender.h
+++ b/src/include/replication/walsender.h
@@ -78,6 +78,13 @@ typedef struct
 	 */
 	XLogRecPtr	lsn;
 
+	/*
+	 * Are any sync standbys defined?  Waiting backends can't reload the
+	 * config file safely, so WAL writer updates this value as needed.
+	 * Protected by SyncRepLock.
+	 */
+	bool		sync_standbys_defined;
+
 	WalSnd		walsnds[1];		/* VARIABLE LENGTH ARRAY */
 } WalSndCtlData;
 
#20Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, Mar 17, 2011 at 8:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, so setting synchronous_standby_names to '' takes effect immediately,
but other changes to it don't apply to already-blocked commits. That seems a
bit inconsistent. Perhaps walwriter should store the parsed list of
standby-names in shared memory, not just a boolean.

I don't think this is necessary. In general, the current or potential
WAL sender processes are responsible for working out among themselves
whose job it is to release waiters, and doing it. As long as
synchronous_standby_names is non-empty, then either (1) there are one
or more standbys connected that can take on the role of synchronous
standby, and whoever does will release waiters or (2) there are no
standbys connected that can take on the role of synchronous standbys,
in which case no waiters should be released until one connects. But
when synchronous_standby_names becomes completely empty, that doesn't
mean "wait until a standby connects whose application name is in the
empty set and make him the synchronous standby" but rather
"synchronous replication is administratively disabled, don't wait in
the first place". So we just need a Boolean flag.

+1 otherwise.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#13)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.

When I saw this commit, I noticed that the WARNING doesn't have an
errcode(). It seems like it should -- this is the kind of thing that the
client is likely to care about, and may want to handle specially.

Regards,
Jeff Davis

#22Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#21)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.

When I saw this commit, I noticed that the WARNING doesn't have an
errcode(). It seems like it should -- this is the kind of thing that the
client is likely to care about, and may want to handle specially.

Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#23Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#22)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote:

On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.

When I saw this commit, I noticed that the WARNING doesn't have an
errcode(). It seems like it should -- this is the kind of thing that the
client is likely to care about, and may want to handle specially.

Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

I think it's reasonable to invent a new code here. Perhaps use the word
"synchronous" rather than "replicated", though?

Regards,
Jeff Davis

#24Robert Haas
robertmhaas@gmail.com
In reply to: Jeff Davis (#23)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, Mar 18, 2011 at 10:17 AM, Jeff Davis <pgsql@j-davis.com> wrote:

On Fri, 2011-03-18 at 08:27 -0400, Robert Haas wrote:

On Thu, Mar 17, 2011 at 6:00 PM, Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2011-03-16 at 13:35 -0400, Robert Haas wrote:

2. If a query cancel interrupt is received (pg_cancel_backend or ^C),
then cancel the sync rep wait and issue a warning before acknowledging
the commit.

When I saw this commit, I noticed that the WARNING doesn't have an
errcode(). It seems like it should -- this is the kind of thing that the
client is likely to care about, and may want to handle specially.

Should I invent ERRCODE_WARNING_TRANSACTION_NOT_REPLICATED?

I think it's reasonable to invent a new code here. Perhaps use the word
"synchronous" rather than "replicated", though?

I think we have to, because it's definitely not the same situation
that someone would expect after ERRCODE_QUERY_CANCELLED.

But ERRCODE_WARNING_TRANSACTION_NOT_SYNCHRONOUS, which is what I read
you reply as suggesting, seems pretty wonky. I wouldn't know what
that meant. Another option might be:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.

Other thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#25Jeff Davis
pgsql@j-davis.com
In reply to: Robert Haas (#24)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.

Works for me.

Regards,
Jeff Davis

#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Davis (#25)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 18.03.2011 17:38, Jeff Davis wrote:

On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.

Works for me.

Yes, sounds reasonable. Without "WARNING_", please.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#27Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#26)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, Mar 18, 2011 at 11:58 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 18.03.2011 17:38, Jeff Davis wrote:

On Fri, 2011-03-18 at 10:27 -0400, Robert Haas wrote:

ERRCODE_(WARNING_?)REPLICATION_WAIT_CANCELLED

...which might have something to recommend it.

Works for me.

Yes, sounds reasonable. Without "WARNING_", please.

The reason I included WARNING is because warnings have their own
section in errcodes.txt, and each errcode is marked E for error or W
for warning. Since we CAN'T actually error out here, I thought it
might be more appropriate to make this a warning; and all of the
existing such codes contain WARNING.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#19)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote:

Thanks for the review!

Lets have a look here...

You've added a test inside the lock to see if there is a standby, which
I took out for performance reasons. Maybe there's another way, I know
that code is fiddly.

You've also added back in the lock acquisition at wakeup with very
little justification, which was a major performance hit.

Together that's about a >20% hit in performance in Yeb's tests. I think
you should spend a little time thinking how to retune that.

I see handling added for ProcDiePending and QueryCancelPending directly
into syncrep.c without any comments in postgres.c to indicate that you
bypass ProcessInterrupts() in some cases. That looks pretty hokey to me.

SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
waiters won't be released if we do a sighup during a fast shutdown,
since the walwriter gets killed as soon as that starts. I'm thinking
bgwriter should handle that now.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#29Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#28)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote:

Thanks for the review!

Lets have a look here...

You've added a test inside the lock to see if there is a standby, which
I took out for performance reasons. Maybe there's another way, I know
that code is fiddly.

You've also added back in the lock acquisition at wakeup with very
little justification, which was a major performance hit.

Together that's about a >20% hit in performance in Yeb's tests. I think
you should spend a little time thinking how to retune that.

Ouch. Do you have a link that describes his testing methodology? I
will look at it.

I see handling added for ProcDiePending and QueryCancelPending directly
into syncrep.c without any comments in postgres.c to indicate that you
bypass ProcessInterrupts() in some cases. That looks pretty hokey to me.

I can add some comments. Unfortunately, it's not feasible to call
ProcessInterrupts() directly from this point in the code - it causes a
database panic.

SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
waiters won't be released if we do a sighup during a fast shutdown,
since the walwriter gets killed as soon as that starts. I'm thinking
bgwriter should handle that now.

Hmm. I was thinking that doing it in WAL writer would make it more
responsive, but since this is a fairly unlikely scenario, it's
probably not worth complicating the shutdown sequence to do it the way
I did. I'll move it to bgwriter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Alvaro Herrera
alvherre@commandprompt.com
In reply to: Robert Haas (#29)
Re: Sync Rep and shutdown Re: Sync Rep v19

Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011:

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
waiters won't be released if we do a sighup during a fast shutdown,
since the walwriter gets killed as soon as that starts. I'm thinking
bgwriter should handle that now.

Hmm. I was thinking that doing it in WAL writer would make it more
responsive, but since this is a fairly unlikely scenario, it's
probably not worth complicating the shutdown sequence to do it the way
I did. I'll move it to bgwriter.

Can't they both do it?

--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#31Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#30)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Fri, Mar 18, 2011 at 2:55 PM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:

Excerpts from Robert Haas's message of vie mar 18 14:25:16 -0300 2011:

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

SyncRepUpdateSyncStandbysDefined() is added into walwriter, which means
waiters won't be released if we do a sighup during a fast shutdown,
since the walwriter gets killed as soon as that starts. I'm thinking
bgwriter should handle that now.

Hmm.  I was thinking that doing it in WAL writer would make it more
responsive, but since this is a fairly unlikely scenario, it's
probably not worth complicating the shutdown sequence to do it the way
I did.  I'll move it to bgwriter.

Can't they both do it?

Yeah, but it seems fairly pointless. In retrospect, I probably should
have done it the way Simon is proposing to begin with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#29)
1 attachment(s)
Re: Sync Rep and shutdown Re: Sync Rep v19

Responding to this again, somewhat out of order...

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Together that's about a >20% hit in performance in Yeb's tests. I think
you should spend a little time thinking how to retune that.

I've spent some time playing around with pgbench and so far I haven't
been able to reliably reproduce this, which is not to say I don't
believe the effect is real, but rather that either I'm doing something
completely wrong, or it requires some specific setup to measure that
doesn't match my environment, or that it's somewhat finicky to
reproduce, or some combination of the above.

You've added a test inside the lock to see if there is a standby, which
I took out for performance reasons. Maybe there's another way, I know
that code is fiddly.

It seems pretty easy to remove the branch from the test at the top of
the function by just rearranging things a bit. Patch attached; does
this help?

You've also added back in the lock acquisition at wakeup with very
little justification, which was a major performance hit.

I have a very difficult time believing this is a real problem. That
extra lock acquisition and release only happens if WaitLatchOrSocket()
returns but MyProc->syncRepState still appears to be SYNC_REP_WAITING.
That should only happen if the latch wait hits the timeout (which
takes 60 s!) or if the precise memory ordering problem that was put in
to fix is occurring (in which case it should dramatically *improve*
performance, by avoiding an extra 60 s wait). I stuck in a call to
elog(LOG, "got here") and it didn't fire even once in a 5-minute
pgbench test (~45k transactions). So I have a hard time crediting
this for any performance problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

sync-standbys-defined-rearrangement.patchapplication/octet-stream; name=sync-standbys-defined-rearrangement.patchDownload
diff --git a/src/backend/replication/syncrep.c b/src/backend/replication/syncrep.c
index 8aef998..e1f6408 100644
--- a/src/backend/replication/syncrep.c
+++ b/src/backend/replication/syncrep.c
@@ -99,6 +99,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 {
 	char 		*new_status = NULL;
 	const char *old_status;
+	bool		sync_standbys_defined;
 
 	/*
 	 * Fast exit if user has not requested sync replication, or
@@ -110,6 +111,7 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 
 	Assert(SHMQueueIsDetached(&(MyProc->syncRepLinks)));
 	Assert(WalSndCtl != NULL);
+	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
 
 	/* Reset the latch before adding ourselves to the queue. */
 	ResetLatch(&MyProc->waitLatch);
@@ -119,22 +121,25 @@ SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
 	 * ourselves to the queue.
 	 */
 	LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
-	Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
-	if (!WalSndCtl->sync_standbys_defined)
-	{
-		/*
-		 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
-		 * not set.  See SyncRepUpdateSyncStandbysDefined.
-		 */
-		LWLockRelease(SyncRepLock);
-		return;
-	}
+	sync_standbys_defined = WalSndCtl->sync_standbys_defined;
 	MyProc->waitLSN = XactCommitLSN;
 	MyProc->syncRepState = SYNC_REP_WAITING;
 	SyncRepQueueInsert();
 	Assert(SyncRepQueueIsOrderedByLSN());
 	LWLockRelease(SyncRepLock);
 
+	/*
+	 * We don't wait for sync rep if WalSndCtl->sync_standbys_defined is
+	 * not set.  See SyncRepUpdateSyncStandbysDefined.  However, since this
+	 * is an unlikely scenario, we release the lock before doing the test,
+	 * so as not to slow down the common case.
+	 */
+	if (!sync_standbys_defined)
+	{
+		SyncRepCancelWait();
+		return;
+	}
+
 	/* Alter ps display to show waiting for sync rep. */
 	if (update_process_title)
 	{
#33Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#29)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-18 18:25, Robert Haas wrote:

On Fri, Mar 18, 2011 at 1:15 PM, Simon Riggs<simon@2ndquadrant.com> wrote:

On Thu, 2011-03-17 at 09:33 -0400, Robert Haas wrote:

Thanks for the review!

Lets have a look here...

You've added a test inside the lock to see if there is a standby, which
I took out for performance reasons. Maybe there's another way, I know
that code is fiddly.

You've also added back in the lock acquisition at wakeup with very
little justification, which was a major performance hit.

Together that's about a>20% hit in performance in Yeb's tests. I think
you should spend a little time thinking how to retune that.

Ouch. Do you have a link that describes his testing methodology? I
will look at it.

Testing 'methodology' sounds a bit heavy. I tested a number of patch
versions over time, with 30 second, hourly and nightly pgbench runs. The
nightly more for durability/memory leak testing than tps numbers, since
I gradually got the impression that pg_bench tests on syncrep setups
somehow suffer less from big differences between tests.

postgres and recovery.conf I used to test v17 is listed here
http://archives.postgresql.org/pgsql-hackers/2011-02/msg02364.php

After the tests on v17 I played a bit with small memory changes in the
postgres.conf to see if the tps would go up. It went up a little but not
enough to mention on the lists. All tests after v17 were done with the
postgres.conf that I've copy pasted below.

I mentioned a performance regression in
http://archives.postgresql.org/pgsql-hackers/2011-03/msg00298.php

And performance improvement in
http://archives.postgresql.org/pgsql-hackers/2011-03/msg00464.php

All three servers (el cheapo consumer grade) the same: triple core
AMD's, 16GB ECC, raid 0 over 2 SATA disks, XFS, nobarrier, separated
data and xlog partitions. NB: there is no BBU controller in these
servers. They don't run production stuff, it's just for testing. 1Gbit
ethernet on non-blocking HP switch. No other load. ./configure
--enable-depend --with-ossp-uuid --with-libxml --prefix=/mgrid/postgres

regards,
Yeb Havinga

Here's the postgresql.conf non-default I used after each new initdb.
(synchronous_replication is off since it prevented me from adding a
replication user, so after a initial basebackup I needed to turn it on)
#------------------------------------------------------------------------------
# CUSTOMIZED OPTIONS
#------------------------------------------------------------------------------

#custom_variable_classes = '' # list of custom variable class
names

#shared_preload_libraries = 'pg_stat_statements'
#custom_variable_classes = 'pg_stat_statements'
#pg_stat_statements.max = 100
#pg_stat_statements.track = all
########
syslog_ident = relay
autovacuum = off
#debug_print_parse = on
#debug_print_rewritten = on
#debug_print_plan = on
#debug_pretty_print = on
log_error_verbosity = verbose
log_min_messages = warning
log_min_error_statement = warning
listen_addresses = '*' # what IP address(es) to listen on;
search_path='\"$user\", public, hl7'
archive_mode = on
archive_command = 'cd .'
checkpoint_completion_target = 0.9
checkpoint_segments = 16
default_statistics_target = 500
constraint_exclusion = on
max_connections = 100
maintenance_work_mem = 528MB
effective_cache_size = 5GB
work_mem = 144MB
wal_buffers = 8MB
shared_buffers = 528MB
wal_level = 'archive'
max_wal_senders = 10
wal_keep_segments = 100 # 1600MB (for production increase this)
synchronous_standby_names = 'standby1,standby2,standby3'
#synchronous_replication = on

#34Robert Haas
robertmhaas@gmail.com
In reply to: Yeb Havinga (#33)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Sat, Mar 19, 2011 at 10:32 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:

Testing 'methodology' sounds a bit heavy. I tested a number of patch
versions over time, with 30 second, hourly and nightly pgbench runs. The
nightly more for durability/memory leak testing than tps numbers, since I
gradually got the impression that pg_bench tests on syncrep setups somehow
suffer less from big differences between tests.

postgres and recovery.conf I used to test v17 is listed here
http://archives.postgresql.org/pgsql-hackers/2011-02/msg02364.php

After the tests on v17 I played a bit with small memory changes in the
postgres.conf to see if the tps would go up. It went up a little but not
enough to mention on the lists. All tests after v17 were done with the
postgres.conf that I've copy pasted below.

Hmm, I'm not going to be able to reproduce this here, and my test
setup didn't show a clear regression. I can try beating on it some
more, but... Any chance you could rerun your test with the latest
master-branch code, and perhaps also with the patch I proposed
upthread to remove a branch from the section protection by
SyncRepLock? I can't really tell from reading the emails you linked
what was responsible for the slowdowns and speedups, and it is unclear
to me how much impact my recent changes actually had. I would have
expected the dominant cost to be waiting for the slave to complete its
fsync, with network overhead as runner-up. And indeed this appears to
be the case on my ext4-based system. I would not have expected
contention on SyncRepLock to be much of a factor.

It strikes me that if contention on SyncRepLock IS the dominating
factor, the whole approach to queue management is pretty well busted.
*Every* walsender takes SyncRepLock in exclusive mode on receipt of
*every* standby reply message. That seems rather inefficient. To
make matters worse, every time a walsender grabs SyncRepLock, it
redoes the whole computation of who the synchronous standby is from
scratch. It strikes me that it ought to be possible to rejigger
things so that when a walsender exits, it signals any other walsenders
that exist to recheck whether they need to take over the role of
synchronous standby; then, each walsender needs to take the lock and
recheck only when it first connects, and each time it's signalled
thereafter. When a walsender does decide that a change in the
synchronous standby is needed, it should store the ID of the new
walsender in shared memory, in a variable protected by SyncRepLock, so
that the current synchronous standby can notice the change with a
simple integer comparison.

It also strikes me that we ought to be able to rejigger things so that
the backends being woken up never need to touch shared memory at all -
i.e. eliminate syncRepState - thus reducing cache line contention. If
WaitLatch() returns true, then the latch was set, presumably by
walsender. My recent patch added a couple of places where the latch
can be set by the waiting process itself in response to an interrupt,
but that case can be handled by adding a backend-local flag variable
indicating whether we set the latch ourselves. If we determine that
the latch is set and the did-it-myself flag is clear, we can conclude
that we were awakened by a walsender and call it good. If the latch
is set and the did-it-myself flag is also set, then we were
interrupted, and we MAY also have been awakened by walsender at around
the same time. We grab SyncRepLock to remove ourselves from the
queue, and if we find we're already removed, then we know we were
interrupted just as walsender awakened us; otherwise, it's a pure
interrupt.

It'd be interesting to see the results of some testing with
LWLOCK_STATS defined, to see whether SyncRepLock actually is contended
and if so to what degree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#34)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-20 05:44, Robert Haas wrote:

Hmm, I'm not going to be able to reproduce this here, and my test
setup didn't show a clear regression. I can try beating on it some
more, but... Any chance you could rerun your test with the latest
master-branch code, and perhaps also with the patch I proposed
upthread to remove a branch from the section protection by
SyncRepLock? I can't really tell from reading the emails you linked
what was responsible for the slowdowns and speedups, and it is unclear
to me how much impact my recent changes actually had.

No problem. Could you tell me the name of the "remove a branch from the
section protection by SyncRepLock" ? patch, or perhaps a message-link?
Upthread I see sync-standbys-defined-rearrangement.patch but also two
sync-rep-wait-fixes.

regards,
Yeb Havinga

#36Robert Haas
robertmhaas@gmail.com
In reply to: Yeb Havinga (#35)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Sun, Mar 20, 2011 at 11:03 AM, Yeb Havinga <yebhavinga@gmail.com> wrote:

On 2011-03-20 05:44, Robert Haas wrote:

Hmm, I'm not going to be able to reproduce this here, and my test
setup didn't show a clear regression.  I can try beating on it some
more, but...  Any chance you could rerun your test with the latest
master-branch code, and perhaps also with the patch I proposed
upthread to remove a branch from the section protection by
SyncRepLock?  I can't really tell from reading the emails you linked
what was responsible for the slowdowns and speedups, and it is unclear
to me how much impact my recent changes actually had.

No problem. Could you tell me the name of the "remove a branch from the
section protection by SyncRepLock" ? patch, or perhaps a message-link?
Upthread I see sync-standbys-defined-rearrangement.patch but also two
sync-rep-wait-fixes.

Thanks! The things I'd like to see compared are:

- performance as of commit e148443ddd95cd29edf4cc1de6188eb9cee029c5
- performance as of current git master
- performance as of current git master with
sync-standbys-defined-rearrangement applied

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#36)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-21 02:05, Robert Haas wrote:

On Sun, Mar 20, 2011 at 11:03 AM, Yeb Havinga<yebhavinga@gmail.com> wrote:

On 2011-03-20 05:44, Robert Haas wrote:

Hmm, I'm not going to be able to reproduce this here, and my test
setup didn't show a clear regression. I can try beating on it some
more, but... Any chance you could rerun your test with the latest
master-branch code, and perhaps also with the patch I proposed
upthread to remove a branch from the section protection by
SyncRepLock? I can't really tell from reading the emails you linked
what was responsible for the slowdowns and speedups, and it is unclear
to me how much impact my recent changes actually had.

No problem. Could you tell me the name of the "remove a branch from the
section protection by SyncRepLock" ? patch, or perhaps a message-link?
Upthread I see sync-standbys-defined-rearrangement.patch but also two
sync-rep-wait-fixes.

Thanks! The things I'd like to see compared are:

pgbench -i -s 50 test
Two runs of "pgbench -c 10 -M prepared -T 600 test" with 1 sync standby
- server configs etc were mailed upthread.

- performance as of commit e148443ddd95cd29edf4cc1de6188eb9cee029c5

1158 and 1306 (avg 1232)

- performance as of current git master

1181 and 1280 (avg 1230,5)

- performance as of current git master with
sync-standbys-defined-rearrangement applied

1152 and 1269 (avg 1210,5)

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#38Robert Haas
robertmhaas@gmail.com
In reply to: Yeb Havinga (#37)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Mon, Mar 21, 2011 at 12:29 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

pgbench -i -s 50 test
Two runs of "pgbench -c 10 -M prepared -T 600 test" with 1 sync standby -
server configs etc were mailed upthread.

- performance as of commit e148443ddd95cd29edf4cc1de6188eb9cee029c5

1158 and 1306 (avg 1232)

- performance as of current git master

1181 and 1280 (avg 1230,5)

- performance as of current git master with
sync-standbys-defined-rearrangement applied

1152 and 1269 (avg 1210,5)

Hmm, that doesn't appear to show the 20% regression Simon claimed
upthread. That's good... but I'm confused as to how you are getting
numbers this high at all without a BBU. If every commit has to wait
for two consecutive fsyncs, cranking out 1200+ commits per second is a
lot. Maybe it's just barely plausible if these are 15K drives and all
the commits are piggybacking on the fsyncs at top speed, but, man,
that's fast.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#39Yeb Havinga
yebhavinga@gmail.com
In reply to: Robert Haas (#38)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-21 18:04, Robert Haas wrote:

On Mon, Mar 21, 2011 at 12:29 PM, Yeb Havinga<yebhavinga@gmail.com> wrote:

pgbench -i -s 50 test
Two runs of "pgbench -c 10 -M prepared -T 600 test" with 1 sync standby -
server configs etc were mailed upthread.

- performance as of commit e148443ddd95cd29edf4cc1de6188eb9cee029c5

1158 and 1306 (avg 1232)

- performance as of current git master

1181 and 1280 (avg 1230,5)

- performance as of current git master with
sync-standbys-defined-rearrangement applied

1152 and 1269 (avg 1210,5)

I ran another pgbench with this last setup, which gives it a 1240,33
average:
tps = 1300.786386 (including connections establishing)
tps = 1300.844220 (excluding connections establishing)

IMO what these tests have shown is that there is no 20% performance
difference between the different versions. To determine if there are
differences, n should be a lot higher, or perhaps a single one with a
very large duration.

Hmm, that doesn't appear to show the 20% regression Simon claimed
upthread. That's good... but I'm confused as to how you are getting
numbers this high at all without a BBU.

For the sake of testing syncrep, I put xfs in nobarrier mode on both
master and standby:

/dev/sdc1 on /xlog type xfs (rw,noatime,nodiratime,nobarrier)
/dev/md11 on /archive type xfs
(rw,noatime,nodiratime,nobarrier,logdev=/dev/sdc3)
/dev/md10 on /data type xfs
(rw,noatime,nodiratime,nobarrier,logdev=/dev/sdc2)

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#40Yeb Havinga
yebhavinga@gmail.com
In reply to: Yeb Havinga (#39)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Mon, Mar 21, 2011 at 7:51 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

On 2011-03-21 18:04, Robert Haas wrote:

On Mon, Mar 21, 2011 at 12:29 PM, Yeb Havinga<yebhavinga@gmail.com>
wrote:

pgbench -i -s 50 test
Two runs of "pgbench -c 10 -M prepared -T 600 test" with 1 sync standby -
server configs etc were mailed upthread.

- performance as of commit e148443ddd95cd29edf4cc1de6188eb9cee029c5

1158 and 1306 (avg 1232)

- performance as of current git master

1181 and 1280 (avg 1230,5)

- performance as of current git master with
sync-standbys-defined-rearrangement applied

1152 and 1269 (avg 1210,5)

IMO what these tests have shown is that there is no 20% performance
difference between the different versions. To determine if there are
differences, n should be a lot higher, or perhaps a single one with a very
large duration.

pgbench -T 3600:

sync-standbys-defined-rearrangement 1270 tps
current git master 1306 tps

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#41Yeb Havinga
yebhavinga@gmail.com
In reply to: Yeb Havinga (#40)
Re: Sync Rep and shutdown Re: Sync Rep v19

On 2011-03-21 23:58, Yeb Havinga wrote:

On Mon, Mar 21, 2011 at 7:51 PM, Yeb Havinga <yebhavinga@gmail.com
<mailto:yebhavinga@gmail.com>> wrote:

On 2011-03-21 18:04, Robert Haas wrote:

On Mon, Mar 21, 2011 at 12:29 PM, Yeb
Havinga<yebhavinga@gmail.com <mailto:yebhavinga@gmail.com>>
wrote:

pgbench -i -s 50 test
Two runs of "pgbench -c 10 -M prepared -T 600 test" with 1
sync standby -
server configs etc were mailed upthread.

- performance as of commit
e148443ddd95cd29edf4cc1de6188eb9cee029c5

1158 and 1306 (avg 1232)

- performance as of current git master

1181 and 1280 (avg 1230,5)

- performance as of current git master with
sync-standbys-defined-rearrangement applied

1152 and 1269 (avg 1210,5)

IMO what these tests have shown is that there is no 20%
performance difference between the different versions. To
determine if there are differences, n should be a lot higher, or
perhaps a single one with a very large duration.

pgbench -T 3600:

sync-standbys-defined-rearrangement 1270 tps
current git master 1306 tps

Result of pgbench -T 30000
sync-standbys-defined-rearrangement 1267 tps
current (or few days old) git master 1326 tps

So the patch eats 4,5% from git master's syncrep performance in my
setup. Don't know how to measure it better than that.

--
Yeb Havinga
http://www.mgrid.net/
Mastering Medical Data

#42Robert Haas
robertmhaas@gmail.com
In reply to: Yeb Havinga (#41)
Re: Sync Rep and shutdown Re: Sync Rep v19

On Tue, Mar 22, 2011 at 3:25 PM, Yeb Havinga <yebhavinga@gmail.com> wrote:

So the patch eats 4,5% from git master's syncrep performance in my setup.
Don't know how to measure it better than that.

That's quite surprising, but I guess the way forward is clear: don't
apply that patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company