New replication mode: write
Hi,
Previously I proposed the replication mode "recv" on the above thread,
but it's not
committed yet. Now I'd like to propose that mode again because it's
useful to reduce
the overhead of synchronous replication. Attached patch implements that mode.
If you choose that mode, transaction waits for its WAL to be write()'d
on the standby,
IOW, waits until the standby saves the WAL in the memory. Which provides lower
level of durability than that current synchronous replication (i.e.,
transaction waits for
its WAL to be flushed to the disk) does. However, it's practically
useful setting
because it can decrease the response time for the transaction, and
causes no data loss
unless both the master and the standby crashes and the database of the
master gets
corrupted at the same time.
In the patch, you can choose that mode by setting synchronous_commit to write.
I renamed that mode to "write" from "recv" on the basis of its actual behavior.
I measured how much "write" mode improves the performance in
synchronous replication.
Here is the result:
synchronous_commit = on
tps = 424.510843 (including connections establishing)
tps = 420.767883 (including connections establishing)
tps = 419.715658 (including connections establishing)
tps = 428.810001 (including connections establishing)
tps = 337.341445 (including connections establishing)
synchronous_commit = write
tps = 550.752712 (including connections establishing)
tps = 407.104036 (including connections establishing)
tps = 455.576190 (including connections establishing)
tps = 453.548672 (including connections establishing)
tps = 555.171325 (including connections establishing)
I used pgbench (scale factor = 100) as a benchmark and ran the
following command.
pgbench -c 8 -j 8 -T 60 -M prepared
I always ran CHECKPOINT in both master and standby before starting each pgbench
test, to prevent CHECKPOINT from affecting the result of the performance test.
Thought? Comments?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
new_replication_mode_write_v1.patchtext/x-diff; charset=US-ASCII; name=new_replication_mode_write_v1.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1559,1565 **** SET ENABLE_SEQSCAN TO OFF;
<para>
Specifies whether transaction commit will wait for WAL records
to be written to disk before the command returns a <quote>success</>
! indication to the client. Valid values are <literal>on</>,
<literal>local</>, and <literal>off</>. The default, and safe, value
is <literal>on</>. When <literal>off</>, there can be a delay between
when success is reported to the client and when the transaction is
--- 1559,1565 ----
<para>
Specifies whether transaction commit will wait for WAL records
to be written to disk before the command returns a <quote>success</>
! indication to the client. Valid values are <literal>on</>, <literal>write</>,
<literal>local</>, and <literal>off</>. The default, and safe, value
is <literal>on</>. When <literal>off</>, there can be a delay between
when success is reported to the client and when the transaction is
***************
*** 1579,1589 **** SET ENABLE_SEQSCAN TO OFF;
If <xref linkend="guc-synchronous-standby-names"> is set, this
parameter also controls whether or not transaction commit will wait
for the transaction's WAL records to be flushed to disk and replicated
! to the standby server. The commit wait will last until a reply from
! the current synchronous standby indicates it has written the commit
! record of the transaction to durable storage. If synchronous
replication is in use, it will normally be sensible either to wait
! both for WAL records to reach both the local and remote disks, or
to allow the transaction to commit asynchronously. However, the
special value <literal>local</> is available for transactions that
wish to wait for local flush to disk, but not synchronous replication.
--- 1579,1597 ----
If <xref linkend="guc-synchronous-standby-names"> is set, this
parameter also controls whether or not transaction commit will wait
for the transaction's WAL records to be flushed to disk and replicated
! to the standby server. When <literal>write</>, the commit wait will
! last until a reply from the current synchronous standby indicates
! it has received the commit record of the transaction to memory.
! Normally this causes no data loss at the time of failover. However,
! if both primary and standby crash, and the database cluster of
! the primary gets corrupted, recent committed transactions might
! be lost. When <literal>on</>, the commit wait will last until a reply
! from the current synchronous standby indicates it has flushed
! the commit record of the transaction to durable storage. This will
! avoids any data loss unless the database cluster of both primary and
! standby gets corrupted simultaneously. If synchronous
replication is in use, it will normally be sensible either to wait
! for both local flush and replication of WAL records, or
to allow the transaction to commit asynchronously. However, the
special value <literal>local</> is available for transactions that
wish to wait for local flush to disk, but not synchronous replication.
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1020,1025 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 1020,1035 ----
</para>
<para>
+ Setting <varname>synchronous_commit</> to <literal>write</> will
+ cause each commit to wait for confirmation that the standby has received
+ the commit record to memory. This provides lower level of durability than
+ that <literal>on</> does. However, it's practically useful setting because
+ it can decrease the response time for the transaction, and causes
+ no data loss unless both the primary and the standby crashes and
+ the database of the primary gets corrupted at the same time.
+ </para>
+
+ <para>
Users will stop waiting if a fast shutdown is requested. However, as
when using asynchronous replication, the server will does not fully
shutdown until all outstanding WAL records are transferred to the currently
***************
*** 1074,1086 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
<para>
Commits made when <varname>synchronous_commit</> is set to <literal>on</>
! will wait until the sync standby responds. The response may never occur
! if the last, or only, standby should crash.
</para>
<para>
The best solution for avoiding data loss is to ensure you don't lose
! your last remaining sync standby. This can be achieved by naming multiple
potential synchronous standbys using <varname>synchronous_standby_names</>.
The first named standby will be used as the synchronous standby. Standbys
listed after this will take over the role of synchronous standby if the
--- 1084,1096 ----
<para>
Commits made when <varname>synchronous_commit</> is set to <literal>on</>
! or <literal>write</> will wait until the synchronous standby responds. The response
! may never occur if the last, or only, standby should crash.
</para>
<para>
The best solution for avoiding data loss is to ensure you don't lose
! your last remaining synchronous standby. This can be achieved by naming multiple
potential synchronous standbys using <varname>synchronous_standby_names</>.
The first named standby will be used as the synchronous standby. Standbys
listed after this will take over the role of synchronous standby if the
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 20,27 ****
* per-transaction state information.
*
* Replication is either synchronous or not synchronous (async). If it is
! * async, we just fastpath out of here. If it is sync, then in 9.1 we wait
! * for the flush location on the standby before releasing the waiting backend.
* Further complexity in that interaction is expected in later releases.
*
* The best performing way to manage the waiting backends is to have a
--- 20,27 ----
* per-transaction state information.
*
* Replication is either synchronous or not synchronous (async). If it is
! * async, we just fastpath out of here. If it is sync, then we wait for
! * the write or flush location on the standby before releasing the waiting backend.
* Further complexity in that interaction is expected in later releases.
*
* The best performing way to manage the waiting backends is to have a
***************
*** 67,79 **** char *SyncRepStandbyNames;
static bool announce_next_takeover = true;
! static void SyncRepQueueInsert(void);
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
#ifdef USE_ASSERT_CHECKING
! static bool SyncRepQueueIsOrderedByLSN(void);
#endif
/*
--- 67,81 ----
static bool announce_next_takeover = true;
! static int SyncRepWaitMode = SYNC_REP_NO_WAIT;
!
! static void SyncRepQueueInsert(int mode);
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
#ifdef USE_ASSERT_CHECKING
! static bool SyncRepQueueIsOrderedByLSN(int mode);
#endif
/*
***************
*** 120,126 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
* be a low cost check.
*/
if (!WalSndCtl->sync_standbys_defined ||
! XLByteLE(XactCommitLSN, WalSndCtl->lsn))
{
LWLockRelease(SyncRepLock);
return;
--- 122,128 ----
* be a low cost check.
*/
if (!WalSndCtl->sync_standbys_defined ||
! XLByteLE(XactCommitLSN, WalSndCtl->lsn[SyncRepWaitMode]))
{
LWLockRelease(SyncRepLock);
return;
***************
*** 132,139 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
*/
MyProc->waitLSN = XactCommitLSN;
MyProc->syncRepState = SYNC_REP_WAITING;
! SyncRepQueueInsert();
! Assert(SyncRepQueueIsOrderedByLSN());
LWLockRelease(SyncRepLock);
/* Alter ps display to show waiting for sync rep. */
--- 134,141 ----
*/
MyProc->waitLSN = XactCommitLSN;
MyProc->syncRepState = SYNC_REP_WAITING;
! SyncRepQueueInsert(SyncRepWaitMode);
! Assert(SyncRepQueueIsOrderedByLSN(SyncRepWaitMode));
LWLockRelease(SyncRepLock);
/* Alter ps display to show waiting for sync rep. */
***************
*** 267,284 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
}
/*
! * Insert MyProc into SyncRepQueue, maintaining sorted invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
static void
! SyncRepQueueInsert(void)
{
PGPROC *proc;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 269,287 ----
}
/*
! * Insert MyProc into the specified SyncRepQueue, maintaining sorted invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
static void
! SyncRepQueueInsert(int mode)
{
PGPROC *proc;
! Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 290,296 **** SyncRepQueueInsert(void)
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
--- 293,299 ----
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
***************
*** 298,304 **** SyncRepQueueInsert(void)
if (proc)
SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
else
! SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
}
/*
--- 301,307 ----
if (proc)
SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
else
! SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[mode]), &(MyProc->syncRepLinks));
}
/*
***************
*** 368,374 **** SyncRepReleaseWaiters(void)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
volatile WalSnd *syncWalSnd = NULL;
! int numprocs = 0;
int priority = 0;
int i;
--- 371,378 ----
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
volatile WalSnd *syncWalSnd = NULL;
! int numwrite = 0;
! int numflush = 0;
int priority = 0;
int i;
***************
*** 419,438 **** SyncRepReleaseWaiters(void)
return;
}
! if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
{
! /*
! * Set the lsn first so that when we wake backends they will release
! * up to this location.
! */
! walsndctl->lsn = MyWalSnd->flush;
! numprocs = SyncRepWakeQueue(false);
}
LWLockRelease(SyncRepLock);
! elog(DEBUG3, "released %d procs up to %X/%X",
! numprocs,
MyWalSnd->flush.xlogid,
MyWalSnd->flush.xrecoff);
--- 423,450 ----
return;
}
! /*
! * Set the lsn first so that when we wake backends they will release
! * up to this location.
! */
! if (XLByteLT(walsndctl->lsn[SYNC_REP_WAIT_WRITE], MyWalSnd->write))
{
! walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
! numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
! }
! if (XLByteLT(walsndctl->lsn[SYNC_REP_WAIT_FLUSH], MyWalSnd->flush))
! {
! walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
! numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
}
LWLockRelease(SyncRepLock);
! elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X",
! numwrite,
! MyWalSnd->write.xlogid,
! MyWalSnd->write.xrecoff,
! numflush,
MyWalSnd->flush.xlogid,
MyWalSnd->flush.xrecoff);
***************
*** 507,530 **** SyncRepGetStandbyPriority(void)
}
/*
! * Walk queue from head. Set the state of any backends that need to be woken,
! * remove them from the queue, and then wake them. Pass all = true to wake
! * whole queue; otherwise, just wake up to the walsender's LSN.
*
* Must hold SyncRepLock.
*/
int
! SyncRepWakeQueue(bool all)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
PGPROC *proc = NULL;
PGPROC *thisproc = NULL;
int numprocs = 0;
! Assert(SyncRepQueueIsOrderedByLSN());
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 519,544 ----
}
/*
! * Walk the specified queue from head. Set the state of any backends that
! * need to be woken, remove them from the queue, and then wake them.
! * Pass all = true to wake whole queue; otherwise, just wake up to
! * the walsender's LSN.
*
* Must hold SyncRepLock.
*/
int
! SyncRepWakeQueue(bool all, int mode)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
PGPROC *proc = NULL;
PGPROC *thisproc = NULL;
int numprocs = 0;
! Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
! Assert(SyncRepQueueIsOrderedByLSN(mode));
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 532,538 **** SyncRepWakeQueue(bool all)
/*
* Assume the queue is ordered by LSN
*/
! if (!all && XLByteLT(walsndctl->lsn, proc->waitLSN))
return numprocs;
/*
--- 546,552 ----
/*
* Assume the queue is ordered by LSN
*/
! if (!all && XLByteLT(walsndctl->lsn[mode], proc->waitLSN))
return numprocs;
/*
***************
*** 540,546 **** SyncRepWakeQueue(bool all)
* thisproc is valid, proc may be NULL after this.
*/
thisproc = proc;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
--- 554,560 ----
* thisproc is valid, proc may be NULL after this.
*/
thisproc = proc;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
***************
*** 588,594 **** SyncRepUpdateSyncStandbysDefined(void)
* wants synchronous replication, we'd better wake them up.
*/
if (!sync_standbys_defined)
! SyncRepWakeQueue(true);
/*
* Only allow people to join the queue when there are synchronous
--- 602,613 ----
* wants synchronous replication, we'd better wake them up.
*/
if (!sync_standbys_defined)
! {
! int i;
!
! for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
! SyncRepWakeQueue(true, i);
! }
/*
* Only allow people to join the queue when there are synchronous
***************
*** 605,620 **** SyncRepUpdateSyncStandbysDefined(void)
#ifdef USE_ASSERT_CHECKING
static bool
! SyncRepQueueIsOrderedByLSN(void)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 624,641 ----
#ifdef USE_ASSERT_CHECKING
static bool
! SyncRepQueueIsOrderedByLSN(int mode)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
+ Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
+
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 628,634 **** SyncRepQueueIsOrderedByLSN(void)
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
--- 649,655 ----
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
***************
*** 675,677 **** check_synchronous_standby_names(char **newval, void **extra, GucSource source)
--- 696,715 ----
return true;
}
+
+ void
+ assign_synchronous_commit(int newval, void *extra)
+ {
+ switch (newval)
+ {
+ case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
+ SyncRepWaitMode = SYNC_REP_WAIT_WRITE;
+ break;
+ case SYNCHRONOUS_COMMIT_REMOTE_FLUSH:
+ SyncRepWaitMode = SYNC_REP_WAIT_FLUSH;
+ break;
+ default:
+ SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ break;
+ }
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1405,1411 **** WalSndShmemInit(void)
/* First time through, so initialize */
MemSet(WalSndCtl, 0, WalSndShmemSize());
! SHMQueueInit(&(WalSndCtl->SyncRepQueue));
for (i = 0; i < max_wal_senders; i++)
{
--- 1405,1412 ----
/* First time through, so initialize */
MemSet(WalSndCtl, 0, WalSndShmemSize());
! for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
! SHMQueueInit(&(WalSndCtl->SyncRepQueue[i]));
for (i = 0; i < max_wal_senders; i++)
{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 370,380 **** static const struct config_enum_entry constraint_exclusion_options[] = {
};
/*
! * Although only "on", "off", and "local" are documented, we
* accept all the likely variants of "on" and "off".
*/
static const struct config_enum_entry synchronous_commit_options[] = {
{"local", SYNCHRONOUS_COMMIT_LOCAL_FLUSH, false},
{"on", SYNCHRONOUS_COMMIT_ON, false},
{"off", SYNCHRONOUS_COMMIT_OFF, false},
{"true", SYNCHRONOUS_COMMIT_ON, true},
--- 370,381 ----
};
/*
! * Although only "on", "off", "write", and "local" are documented, we
* accept all the likely variants of "on" and "off".
*/
static const struct config_enum_entry synchronous_commit_options[] = {
{"local", SYNCHRONOUS_COMMIT_LOCAL_FLUSH, false},
+ {"write", SYNCHRONOUS_COMMIT_REMOTE_WRITE, false},
{"on", SYNCHRONOUS_COMMIT_ON, false},
{"off", SYNCHRONOUS_COMMIT_OFF, false},
{"true", SYNCHRONOUS_COMMIT_ON, true},
***************
*** 3164,3170 **** static struct config_enum ConfigureNamesEnum[] =
},
&synchronous_commit,
SYNCHRONOUS_COMMIT_ON, synchronous_commit_options,
! NULL, NULL, NULL
},
{
--- 3165,3171 ----
},
&synchronous_commit,
SYNCHRONOUS_COMMIT_ON, synchronous_commit_options,
! NULL, assign_synchronous_commit, NULL
},
{
*** a/src/include/access/xact.h
--- b/src/include/access/xact.h
***************
*** 55,60 **** typedef enum
--- 55,61 ----
{
SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */
SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */
+ SYNCHRONOUS_COMMIT_REMOTE_WRITE, /* wait for local flush and remote write */
SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */
} SyncCommitLevel;
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 15,20 ****
--- 15,30 ----
#include "utils/guc.h"
+ #define SyncRepRequested() \
+ (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+
+ /* SyncRepWaitMode */
+ #define SYNC_REP_NO_WAIT -1
+ #define SYNC_REP_WAIT_WRITE 0
+ #define SYNC_REP_WAIT_FLUSH 1
+
+ #define NUM_SYNC_REP_WAIT_MODE 2
+
/* syncRepState */
#define SYNC_REP_NOT_WAITING 0
#define SYNC_REP_WAITING 1
***************
*** 37,44 **** extern void SyncRepReleaseWaiters(void);
extern void SyncRepUpdateSyncStandbysDefined(void);
/* called by various procs */
! extern int SyncRepWakeQueue(bool all);
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
#endif /* _SYNCREP_H */
--- 47,55 ----
extern void SyncRepUpdateSyncStandbysDefined(void);
/* called by various procs */
! extern int SyncRepWakeQueue(bool all, int mode);
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+ extern void assign_synchronous_commit(int newval, void *extra);
#endif /* _SYNCREP_H */
*** a/src/include/replication/walsender_private.h
--- b/src/include/replication/walsender_private.h
***************
*** 14,19 ****
--- 14,20 ----
#include "access/xlog.h"
#include "nodes/nodes.h"
+ #include "replication/syncrep.h"
#include "storage/latch.h"
#include "storage/shmem.h"
#include "storage/spin.h"
***************
*** 68,82 **** extern WalSnd *MyWalSnd;
typedef struct
{
/*
! * Synchronous replication queue. Protected by SyncRepLock.
*/
! SHM_QUEUE SyncRepQueue;
/*
* Current location of the head of the queue. All waiters should have a
* waitLSN that follows this value. Protected by SyncRepLock.
*/
! XLogRecPtr lsn;
/*
* Are any sync standbys defined? Waiting backends can't reload the
--- 69,84 ----
typedef struct
{
/*
! * Synchronous replication queue with one queue per request type.
! * Protected by SyncRepLock.
*/
! SHM_QUEUE SyncRepQueue[NUM_SYNC_REP_WAIT_MODE];
/*
* Current location of the head of the queue. All waiters should have a
* waitLSN that follows this value. Protected by SyncRepLock.
*/
! XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE];
/*
* Are any sync standbys defined? Waiting backends can't reload the
On Fri, Jan 13, 2012 at 7:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Jan 13, 2012 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Jan 13, 2012 at 7:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
Thought? Comments?
This is almost exactly the same as my patch series
"syncrep_queues.v[1,2].patch" earlier this year. Which I know because
I was updating that patch myself last night for 9.2. I'm about half
way through doing that, since you and I agreed in Ottawa I would do
this. Perhaps it is better if we work together?I think this comment is mostly pointless. We don't have time to work
together and there's no real reason to. You know what you're doing, so
I'll leave you to do it.Please add the Apply mode.
OK, will do.
In my patch, the reason I avoided doing WRITE mode (which we had
previously referred to as RECV) was that no fsync of the WAL contents
takes place. In that case we are applying changes using un-fsynced WAL
data and in case of crash this would cause a problem.
My patch has not changed the execution order of WAL flush and replay.
WAL records are always replayed after they are flushed by walreceiver.
So, such a problem doesn't happen.
But which means that transaction might need to wait for WAL flush caused
by previous transaction even if WRITE mode is chosen. Which limits the
performance gain by WRITE mode, and should be improved later, I think.
I was going to
make the WalWriter available during recovery to cater for that. Do you
not think that is no longer necessary?
That's still necessary to improve the performance in sync rep further, I think.
What I'd like to do (maybe in 9.3dev) after supporting WRITE mode is:
* Allow WAL records to be replayed before they are flushed to the disk.
* Add new GUC parameter specifying whether to allow the standby to defer
WAL flush. If the parameter is false, walreceiver flushes WAL whenever it
receives WAL (i.e., it's same as the current behavior). If true, walreceiver
doesn't flush WAL at all. Instead, walwriter, backend or startup process
does that. Walwriter periodically checks whether there is un-flushed WAL
file, and flushes it if exists. When the buffer page is written out, backend
or startup process forces WAL flush up to buffer's LSN.
If the above GUC parameter is set to true (i.e., walreceiver doesn't flush
WAL at all) and WRITE mode is chosen, transaction doesn't need to wait
for WAL flush on the standby at all. Also the frequency of WAL flush on
the standby would become lower, which significantly reduces I/O load.
After all, the performance in sync rep would improve very much.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Import Notes
Reply to msg id not found: CA+U5nMLZRpJWtjvhQ6hyUNCktQY-gkpMCMk02PFuq61Th4c1Q@mail.gmail.com
On Fri, Jan 13, 2012 at 12:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
In my patch, the reason I avoided doing WRITE mode (which we had
previously referred to as RECV) was that no fsync of the WAL contents
takes place. In that case we are applying changes using un-fsynced WAL
data and in case of crash this would cause a problem.My patch has not changed the execution order of WAL flush and replay.
WAL records are always replayed after they are flushed by walreceiver.
So, such a problem doesn't happen.
But which means that transaction might need to wait for WAL flush caused
by previous transaction even if WRITE mode is chosen. Which limits the
performance gain by WRITE mode, and should be improved later, I think.
If the WALreceiver still flushes that is OK.
The latency would be smoother and lower if the WALwriter were active.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jan 13, 2012 at 9:27 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Fri, Jan 13, 2012 at 7:30 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Jan 13, 2012 at 9:15 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Fri, Jan 13, 2012 at 7:41 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
Thought? Comments?
This is almost exactly the same as my patch series
"syncrep_queues.v[1,2].patch" earlier this year. Which I know because
I was updating that patch myself last night for 9.2. I'm about half
way through doing that, since you and I agreed in Ottawa I would do
this. Perhaps it is better if we work together?I think this comment is mostly pointless. We don't have time to work
together and there's no real reason to. You know what you're doing, so
I'll leave you to do it.Please add the Apply mode.
OK, will do.
Done. Attached is the updated version of the patch.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
new_replication_mode_v2.patchtext/x-diff; charset=US-ASCII; name=new_replication_mode_v2.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1559,1567 **** SET ENABLE_SEQSCAN TO OFF;
<para>
Specifies whether transaction commit will wait for WAL records
to be written to disk before the command returns a <quote>success</>
! indication to the client. Valid values are <literal>on</>,
! <literal>local</>, and <literal>off</>. The default, and safe, value
! is <literal>on</>. When <literal>off</>, there can be a delay between
when success is reported to the client and when the transaction is
really guaranteed to be safe against a server crash. (The maximum
delay is three times <xref linkend="guc-wal-writer-delay">.) Unlike
--- 1559,1567 ----
<para>
Specifies whether transaction commit will wait for WAL records
to be written to disk before the command returns a <quote>success</>
! indication to the client. Valid values are <literal>on</>, <literal>write</>,
! <literal>apply</>, <literal>local</>, and <literal>off</>. The default, and safe,
! value is <literal>on</>. When <literal>off</>, there can be a delay between
when success is reported to the client and when the transaction is
really guaranteed to be safe against a server crash. (The maximum
delay is three times <xref linkend="guc-wal-writer-delay">.) Unlike
***************
*** 1579,1589 **** SET ENABLE_SEQSCAN TO OFF;
If <xref linkend="guc-synchronous-standby-names"> is set, this
parameter also controls whether or not transaction commit will wait
for the transaction's WAL records to be flushed to disk and replicated
! to the standby server. The commit wait will last until a reply from
! the current synchronous standby indicates it has written the commit
! record of the transaction to durable storage. If synchronous
replication is in use, it will normally be sensible either to wait
! both for WAL records to reach both the local and remote disks, or
to allow the transaction to commit asynchronously. However, the
special value <literal>local</> is available for transactions that
wish to wait for local flush to disk, but not synchronous replication.
--- 1579,1600 ----
If <xref linkend="guc-synchronous-standby-names"> is set, this
parameter also controls whether or not transaction commit will wait
for the transaction's WAL records to be flushed to disk and replicated
! to the standby server. When <literal>on</>, the commit wait will last
! until a reply from the current synchronous standby indicates it has flushed
! the commit record of the transaction to durable storage. This will
! avoids any data loss unless the database cluster of both primary and
! standby gets corrupted simultaneously. When <literal>write</>,
! the commit wait will last until a reply from the current synchronous
! standby indicates it has received the commit record of the transaction
! to memory. Normally this causes no data loss at the time of failover.
! However, if both primary and standby crash, and the database cluster of
! the primary gets corrupted, recent committed transactions might
! be lost. When <literal>apply</>, the commit will wait until the current
! synchronous standby has replayed the committed changes successfully.
! This guarantees that any transactions are visible on the synchronous
! standby when they are committed. If synchronous
replication is in use, it will normally be sensible either to wait
! for both local flush and replication of WAL records, or
to allow the transaction to commit asynchronously. However, the
special value <literal>local</> is available for transactions that
wish to wait for local flush to disk, but not synchronous replication.
*** a/doc/src/sgml/high-availability.sgml
--- b/doc/src/sgml/high-availability.sgml
***************
*** 1011,1016 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
--- 1011,1039 ----
</para>
<para>
+ Setting <varname>synchronous_commit</> to <literal>write</> will
+ cause each commit to wait for confirmation that the standby has received
+ the commit record to memory. This provides lower level of durability than
+ that <literal>on</> does. However, it's practically useful setting because
+ it can decrease the response time for the transaction, and causes
+ no data loss unless both the primary and the standby crashes and
+ the database of the primary gets corrupted at the same time.
+ </para>
+
+ <para>
+ Setting <varname>synchronous_commit</> to <literal>apply</> will
+ cause each commit to wait for confirmation that the standby has flushed
+ the commit record to durable storage and replayed the committed changes
+ successfully. This provides the same level of durability as <literal>on</>
+ does. This guarantees that any transactions are visible on the standby
+ when they are committed. Note that this makes the transaction commit
+ wait longer time for replication than <literal>on</> or <literal>write</>
+ does because the confirmation about the apply position from the standby
+ is sent less frequently. To decrease the wait time, set
+ <varname>max_standby_streaming_delay</> to a low value.
+ </para>
+
+ <para>
Users will stop waiting if a fast shutdown is requested. However, as
when using asynchronous replication, the server will does not fully
shutdown until all outstanding WAL records are transferred to the currently
***************
*** 1064,1077 **** primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
<title>Planning for High Availability</title>
<para>
! Commits made when <varname>synchronous_commit</> is set to <literal>on</>
! will wait until the sync standby responds. The response may never occur
! if the last, or only, standby should crash.
</para>
<para>
The best solution for avoiding data loss is to ensure you don't lose
! your last remaining sync standby. This can be achieved by naming multiple
potential synchronous standbys using <varname>synchronous_standby_names</>.
The first named standby will be used as the synchronous standby. Standbys
listed after this will take over the role of synchronous standby if the
--- 1087,1100 ----
<title>Planning for High Availability</title>
<para>
! Commits made when <varname>synchronous_commit</> is set to <literal>on</>,
! <literal>write</> or <literal>apply</> will wait until the synchronous standby responds.
! The response may never occur if the last, or only, standby should crash.
</para>
<para>
The best solution for avoiding data loss is to ensure you don't lose
! your last remaining synchronous standby. This can be achieved by naming multiple
potential synchronous standbys using <varname>synchronous_standby_names</>.
The first named standby will be used as the synchronous standby. Standbys
listed after this will take over the role of synchronous standby if the
*** a/src/backend/replication/syncrep.c
--- b/src/backend/replication/syncrep.c
***************
*** 20,28 ****
* per-transaction state information.
*
* Replication is either synchronous or not synchronous (async). If it is
! * async, we just fastpath out of here. If it is sync, then in 9.1 we wait
! * for the flush location on the standby before releasing the waiting backend.
! * Further complexity in that interaction is expected in later releases.
*
* The best performing way to manage the waiting backends is to have a
* single ordered queue of waiting backends, so that we can avoid
--- 20,29 ----
* per-transaction state information.
*
* Replication is either synchronous or not synchronous (async). If it is
! * async, we just fastpath out of here. If it is sync, then we wait for
! * the write, flush or apply location on the standby before releasing
! * the waiting backend. Further complexity in that interaction is expected
! * in later releases.
*
* The best performing way to manage the waiting backends is to have a
* single ordered queue of waiting backends, so that we can avoid
***************
*** 67,79 **** char *SyncRepStandbyNames;
static bool announce_next_takeover = true;
! static void SyncRepQueueInsert(void);
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
#ifdef USE_ASSERT_CHECKING
! static bool SyncRepQueueIsOrderedByLSN(void);
#endif
/*
--- 68,82 ----
static bool announce_next_takeover = true;
! static int SyncRepWaitMode = SYNC_REP_NO_WAIT;
!
! static void SyncRepQueueInsert(int mode);
static void SyncRepCancelWait(void);
static int SyncRepGetStandbyPriority(void);
#ifdef USE_ASSERT_CHECKING
! static bool SyncRepQueueIsOrderedByLSN(int mode);
#endif
/*
***************
*** 120,126 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
* be a low cost check.
*/
if (!WalSndCtl->sync_standbys_defined ||
! XLByteLE(XactCommitLSN, WalSndCtl->lsn))
{
LWLockRelease(SyncRepLock);
return;
--- 123,129 ----
* be a low cost check.
*/
if (!WalSndCtl->sync_standbys_defined ||
! XLByteLE(XactCommitLSN, WalSndCtl->lsn[SyncRepWaitMode]))
{
LWLockRelease(SyncRepLock);
return;
***************
*** 132,139 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
*/
MyProc->waitLSN = XactCommitLSN;
MyProc->syncRepState = SYNC_REP_WAITING;
! SyncRepQueueInsert();
! Assert(SyncRepQueueIsOrderedByLSN());
LWLockRelease(SyncRepLock);
/* Alter ps display to show waiting for sync rep. */
--- 135,142 ----
*/
MyProc->waitLSN = XactCommitLSN;
MyProc->syncRepState = SYNC_REP_WAITING;
! SyncRepQueueInsert(SyncRepWaitMode);
! Assert(SyncRepQueueIsOrderedByLSN(SyncRepWaitMode));
LWLockRelease(SyncRepLock);
/* Alter ps display to show waiting for sync rep. */
***************
*** 267,284 **** SyncRepWaitForLSN(XLogRecPtr XactCommitLSN)
}
/*
! * Insert MyProc into SyncRepQueue, maintaining sorted invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
static void
! SyncRepQueueInsert(void)
{
PGPROC *proc;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 270,288 ----
}
/*
! * Insert MyProc into the specified SyncRepQueue, maintaining sorted invariant.
*
* Usually we will go at tail of queue, though it's possible that we arrive
* here out of order, so start at tail and work back to insertion point.
*/
static void
! SyncRepQueueInsert(int mode)
{
PGPROC *proc;
! Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 290,296 **** SyncRepQueueInsert(void)
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
--- 294,300 ----
if (XLByteLT(proc->waitLSN, MyProc->waitLSN))
break;
! proc = (PGPROC *) SHMQueuePrev(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
***************
*** 298,304 **** SyncRepQueueInsert(void)
if (proc)
SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
else
! SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue), &(MyProc->syncRepLinks));
}
/*
--- 302,308 ----
if (proc)
SHMQueueInsertAfter(&(proc->syncRepLinks), &(MyProc->syncRepLinks));
else
! SHMQueueInsertAfter(&(WalSndCtl->SyncRepQueue[mode]), &(MyProc->syncRepLinks));
}
/*
***************
*** 368,374 **** SyncRepReleaseWaiters(void)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
volatile WalSnd *syncWalSnd = NULL;
! int numprocs = 0;
int priority = 0;
int i;
--- 372,380 ----
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
volatile WalSnd *syncWalSnd = NULL;
! int numwrite = 0;
! int numflush = 0;
! int numapply = 0;
int priority = 0;
int i;
***************
*** 419,440 **** SyncRepReleaseWaiters(void)
return;
}
! if (XLByteLT(walsndctl->lsn, MyWalSnd->flush))
{
! /*
! * Set the lsn first so that when we wake backends they will release
! * up to this location.
! */
! walsndctl->lsn = MyWalSnd->flush;
! numprocs = SyncRepWakeQueue(false);
}
LWLockRelease(SyncRepLock);
! elog(DEBUG3, "released %d procs up to %X/%X",
! numprocs,
MyWalSnd->flush.xlogid,
! MyWalSnd->flush.xrecoff);
/*
* If we are managing the highest priority standby, though we weren't
--- 425,463 ----
return;
}
! /*
! * Set the lsn first so that when we wake backends they will release
! * up to this location.
! */
! if (XLByteLT(walsndctl->lsn[SYNC_REP_WAIT_WRITE], MyWalSnd->write))
{
! walsndctl->lsn[SYNC_REP_WAIT_WRITE] = MyWalSnd->write;
! numwrite = SyncRepWakeQueue(false, SYNC_REP_WAIT_WRITE);
! }
! if (XLByteLT(walsndctl->lsn[SYNC_REP_WAIT_FLUSH], MyWalSnd->flush))
! {
! walsndctl->lsn[SYNC_REP_WAIT_FLUSH] = MyWalSnd->flush;
! numflush = SyncRepWakeQueue(false, SYNC_REP_WAIT_FLUSH);
! }
! if (XLByteLT(walsndctl->lsn[SYNC_REP_WAIT_APPLY], MyWalSnd->apply))
! {
! walsndctl->lsn[SYNC_REP_WAIT_APPLY] = MyWalSnd->apply;
! numapply = SyncRepWakeQueue(false, SYNC_REP_WAIT_APPLY);
}
LWLockRelease(SyncRepLock);
! elog(DEBUG3, "released %d procs up to write %X/%X, %d procs up to flush %X/%X, "
! "%d procs up to apply %X/%X",
! numwrite,
! MyWalSnd->write.xlogid,
! MyWalSnd->write.xrecoff,
! numflush,
MyWalSnd->flush.xlogid,
! MyWalSnd->flush.xrecoff,
! numapply,
! MyWalSnd->apply.xlogid,
! MyWalSnd->apply.xrecoff);
/*
* If we are managing the highest priority standby, though we weren't
***************
*** 507,530 **** SyncRepGetStandbyPriority(void)
}
/*
! * Walk queue from head. Set the state of any backends that need to be woken,
! * remove them from the queue, and then wake them. Pass all = true to wake
! * whole queue; otherwise, just wake up to the walsender's LSN.
*
* Must hold SyncRepLock.
*/
int
! SyncRepWakeQueue(bool all)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
PGPROC *proc = NULL;
PGPROC *thisproc = NULL;
int numprocs = 0;
! Assert(SyncRepQueueIsOrderedByLSN());
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 530,555 ----
}
/*
! * Walk the specified queue from head. Set the state of any backends that
! * need to be woken, remove them from the queue, and then wake them.
! * Pass all = true to wake whole queue; otherwise, just wake up to
! * the walsender's LSN.
*
* Must hold SyncRepLock.
*/
int
! SyncRepWakeQueue(bool all, int mode)
{
volatile WalSndCtlData *walsndctl = WalSndCtl;
PGPROC *proc = NULL;
PGPROC *thisproc = NULL;
int numprocs = 0;
! Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
! Assert(SyncRepQueueIsOrderedByLSN(mode));
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 532,538 **** SyncRepWakeQueue(bool all)
/*
* Assume the queue is ordered by LSN
*/
! if (!all && XLByteLT(walsndctl->lsn, proc->waitLSN))
return numprocs;
/*
--- 557,563 ----
/*
* Assume the queue is ordered by LSN
*/
! if (!all && XLByteLT(walsndctl->lsn[mode], proc->waitLSN))
return numprocs;
/*
***************
*** 540,546 **** SyncRepWakeQueue(bool all)
* thisproc is valid, proc may be NULL after this.
*/
thisproc = proc;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
--- 565,571 ----
* thisproc is valid, proc may be NULL after this.
*/
thisproc = proc;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
***************
*** 588,594 **** SyncRepUpdateSyncStandbysDefined(void)
* wants synchronous replication, we'd better wake them up.
*/
if (!sync_standbys_defined)
! SyncRepWakeQueue(true);
/*
* Only allow people to join the queue when there are synchronous
--- 613,624 ----
* wants synchronous replication, we'd better wake them up.
*/
if (!sync_standbys_defined)
! {
! int i;
!
! for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
! SyncRepWakeQueue(true, i);
! }
/*
* Only allow people to join the queue when there are synchronous
***************
*** 605,620 **** SyncRepUpdateSyncStandbysDefined(void)
#ifdef USE_ASSERT_CHECKING
static bool
! SyncRepQueueIsOrderedByLSN(void)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
! &(WalSndCtl->SyncRepQueue),
offsetof(PGPROC, syncRepLinks));
while (proc)
--- 635,652 ----
#ifdef USE_ASSERT_CHECKING
static bool
! SyncRepQueueIsOrderedByLSN(int mode)
{
PGPROC *proc = NULL;
XLogRecPtr lastLSN;
+ Assert(mode >= 0 && mode < NUM_SYNC_REP_WAIT_MODE);
+
lastLSN.xlogid = 0;
lastLSN.xrecoff = 0;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
! &(WalSndCtl->SyncRepQueue[mode]),
offsetof(PGPROC, syncRepLinks));
while (proc)
***************
*** 628,634 **** SyncRepQueueIsOrderedByLSN(void)
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
--- 660,666 ----
lastLSN = proc->waitLSN;
! proc = (PGPROC *) SHMQueueNext(&(WalSndCtl->SyncRepQueue[mode]),
&(proc->syncRepLinks),
offsetof(PGPROC, syncRepLinks));
}
***************
*** 675,677 **** check_synchronous_standby_names(char **newval, void **extra, GucSource source)
--- 707,729 ----
return true;
}
+
+ void
+ assign_synchronous_commit(int newval, void *extra)
+ {
+ switch (newval)
+ {
+ case SYNCHRONOUS_COMMIT_REMOTE_WRITE:
+ SyncRepWaitMode = SYNC_REP_WAIT_WRITE;
+ break;
+ case SYNCHRONOUS_COMMIT_REMOTE_FLUSH:
+ SyncRepWaitMode = SYNC_REP_WAIT_FLUSH;
+ break;
+ case SYNCHRONOUS_COMMIT_REMOTE_APPLY:
+ SyncRepWaitMode = SYNC_REP_WAIT_APPLY;
+ break;
+ default:
+ SyncRepWaitMode = SYNC_REP_NO_WAIT;
+ break;
+ }
+ }
*** a/src/backend/replication/walsender.c
--- b/src/backend/replication/walsender.c
***************
*** 1410,1416 **** WalSndShmemInit(void)
/* First time through, so initialize */
MemSet(WalSndCtl, 0, WalSndShmemSize());
! SHMQueueInit(&(WalSndCtl->SyncRepQueue));
for (i = 0; i < max_wal_senders; i++)
{
--- 1410,1417 ----
/* First time through, so initialize */
MemSet(WalSndCtl, 0, WalSndShmemSize());
! for (i = 0; i < NUM_SYNC_REP_WAIT_MODE; i++)
! SHMQueueInit(&(WalSndCtl->SyncRepQueue[i]));
for (i = 0; i < max_wal_senders; i++)
{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 370,380 **** static const struct config_enum_entry constraint_exclusion_options[] = {
};
/*
! * Although only "on", "off", and "local" are documented, we
* accept all the likely variants of "on" and "off".
*/
static const struct config_enum_entry synchronous_commit_options[] = {
{"local", SYNCHRONOUS_COMMIT_LOCAL_FLUSH, false},
{"on", SYNCHRONOUS_COMMIT_ON, false},
{"off", SYNCHRONOUS_COMMIT_OFF, false},
{"true", SYNCHRONOUS_COMMIT_ON, true},
--- 370,382 ----
};
/*
! * Although only "on", "off", "write", "apply" and "local" are documented, we
* accept all the likely variants of "on" and "off".
*/
static const struct config_enum_entry synchronous_commit_options[] = {
{"local", SYNCHRONOUS_COMMIT_LOCAL_FLUSH, false},
+ {"write", SYNCHRONOUS_COMMIT_REMOTE_WRITE, false},
+ {"apply", SYNCHRONOUS_COMMIT_REMOTE_APPLY, false},
{"on", SYNCHRONOUS_COMMIT_ON, false},
{"off", SYNCHRONOUS_COMMIT_OFF, false},
{"true", SYNCHRONOUS_COMMIT_ON, true},
***************
*** 3164,3170 **** static struct config_enum ConfigureNamesEnum[] =
},
&synchronous_commit,
SYNCHRONOUS_COMMIT_ON, synchronous_commit_options,
! NULL, NULL, NULL
},
{
--- 3166,3172 ----
},
&synchronous_commit,
SYNCHRONOUS_COMMIT_ON, synchronous_commit_options,
! NULL, assign_synchronous_commit, NULL
},
{
*** a/src/include/access/xact.h
--- b/src/include/access/xact.h
***************
*** 55,61 **** typedef enum
{
SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */
SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */
! SYNCHRONOUS_COMMIT_REMOTE_FLUSH /* wait for local and remote flush */
} SyncCommitLevel;
/* Define the default setting for synchonous_commit */
--- 55,63 ----
{
SYNCHRONOUS_COMMIT_OFF, /* asynchronous commit */
SYNCHRONOUS_COMMIT_LOCAL_FLUSH, /* wait for local flush only */
! SYNCHRONOUS_COMMIT_REMOTE_WRITE, /* wait for local flush and remote write */
! SYNCHRONOUS_COMMIT_REMOTE_FLUSH, /* wait for local and remote flush */
! SYNCHRONOUS_COMMIT_REMOTE_APPLY /* wait for local flush and remote apply */
} SyncCommitLevel;
/* Define the default setting for synchonous_commit */
*** a/src/include/replication/syncrep.h
--- b/src/include/replication/syncrep.h
***************
*** 15,20 ****
--- 15,31 ----
#include "utils/guc.h"
+ #define SyncRepRequested() \
+ (max_wal_senders > 0 && synchronous_commit > SYNCHRONOUS_COMMIT_LOCAL_FLUSH)
+
+ /* SyncRepWaitMode */
+ #define SYNC_REP_NO_WAIT -1
+ #define SYNC_REP_WAIT_WRITE 0
+ #define SYNC_REP_WAIT_FLUSH 1
+ #define SYNC_REP_WAIT_APPLY 2
+
+ #define NUM_SYNC_REP_WAIT_MODE 3
+
/* syncRepState */
#define SYNC_REP_NOT_WAITING 0
#define SYNC_REP_WAITING 1
***************
*** 37,44 **** extern void SyncRepReleaseWaiters(void);
extern void SyncRepUpdateSyncStandbysDefined(void);
/* called by various procs */
! extern int SyncRepWakeQueue(bool all);
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
#endif /* _SYNCREP_H */
--- 48,56 ----
extern void SyncRepUpdateSyncStandbysDefined(void);
/* called by various procs */
! extern int SyncRepWakeQueue(bool all, int mode);
extern bool check_synchronous_standby_names(char **newval, void **extra, GucSource source);
+ extern void assign_synchronous_commit(int newval, void *extra);
#endif /* _SYNCREP_H */
*** a/src/include/replication/walsender_private.h
--- b/src/include/replication/walsender_private.h
***************
*** 14,19 ****
--- 14,20 ----
#include "access/xlog.h"
#include "nodes/nodes.h"
+ #include "replication/syncrep.h"
#include "storage/latch.h"
#include "storage/shmem.h"
#include "storage/spin.h"
***************
*** 68,82 **** extern WalSnd *MyWalSnd;
typedef struct
{
/*
! * Synchronous replication queue. Protected by SyncRepLock.
*/
! SHM_QUEUE SyncRepQueue;
/*
* Current location of the head of the queue. All waiters should have a
* waitLSN that follows this value. Protected by SyncRepLock.
*/
! XLogRecPtr lsn;
/*
* Are any sync standbys defined? Waiting backends can't reload the
--- 69,84 ----
typedef struct
{
/*
! * Synchronous replication queue with one queue per request type.
! * Protected by SyncRepLock.
*/
! SHM_QUEUE SyncRepQueue[NUM_SYNC_REP_WAIT_MODE];
/*
* Current location of the head of the queue. All waiters should have a
* waitLSN that follows this value. Protected by SyncRepLock.
*/
! XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE];
/*
* Are any sync standbys defined? Waiting backends can't reload the
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Done. Attached is the updated version of the patch.
Thanks.
I'll review this first, but can't start immediately. Please expect
something back in 2 days.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 16, 2012 at 4:17 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Done. Attached is the updated version of the patch.
Thanks.
I'll review this first, but can't start immediately. Please expect
something back in 2 days.
On initial review this looks fine.
I'll do a more thorough hands-on review now and commit if still OK.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Please add the Apply mode.
OK, will do.
Done. Attached is the updated version of the patch.
I notice that the Apply mode isn't fully implemented. I had in mind
that you would add the latch required to respond more quickly when
only the Apply pointer has changed.
Is there a reason not to use WaitLatchOrSocket() in WALReceiver? Or
was there another reason for not implementing that?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 23, 2012 at 4:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Please add the Apply mode.
OK, will do.
Done. Attached is the updated version of the patch.
I notice that the Apply mode isn't fully implemented. I had in mind
that you would add the latch required to respond more quickly when
only the Apply pointer has changed.Is there a reason not to use WaitLatchOrSocket() in WALReceiver? Or
was there another reason for not implementing that?
I agree that the feature you pointed is useful for the Apply mode. But
I'm afraid that implementing that feature is not easy and would make
the patch big and complicated, so I didn't implement the Apply mode first.
To make the walreceiver call WaitLatchOrSocket(), we would need to
merge it and libpq_select() into one function. But the former is the backend
function and the latter is the frontend one. Now I have no good idea to
merge them cleanly.
If we send back the reply as soon as the Apply pointer is changed, I'm
afraid quite lots of reply messages are sent frequently, which might
cause performance problem. This is also one of the reasons why I didn't
implement the quick-response feature. To address this problem, we might
need to change the master so that it sends the Wait pointer to the standby,
and change the standby so that it replies whenever the Apply pointer
catches up with the Wait one. This can reduce the number of useless
reply from the standby about the Apply pointer.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Jan 23, 2012 at 9:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Jan 23, 2012 at 4:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Please add the Apply mode.
OK, will do.
Done. Attached is the updated version of the patch.
I notice that the Apply mode isn't fully implemented. I had in mind
that you would add the latch required to respond more quickly when
only the Apply pointer has changed.Is there a reason not to use WaitLatchOrSocket() in WALReceiver? Or
was there another reason for not implementing that?I agree that the feature you pointed is useful for the Apply mode. But
I'm afraid that implementing that feature is not easy and would make
the patch big and complicated, so I didn't implement the Apply mode first.To make the walreceiver call WaitLatchOrSocket(), we would need to
merge it and libpq_select() into one function. But the former is the backend
function and the latter is the frontend one. Now I have no good idea to
merge them cleanly.
We can wait on the socket wherever it comes from. poll/select doesn't
care how we got the socket.
So we just need a common handler that calls either
walreceiver/libpqwalreceiver function as required to handle the
wakeup.
If we send back the reply as soon as the Apply pointer is changed, I'm
afraid quite lots of reply messages are sent frequently, which might
cause performance problem. This is also one of the reasons why I didn't
implement the quick-response feature. To address this problem, we might
need to change the master so that it sends the Wait pointer to the standby,
and change the standby so that it replies whenever the Apply pointer
catches up with the Wait one. This can reduce the number of useless
reply from the standby about the Apply pointer.
We send back one reply per incoming message. The incoming messages
don't know request state and checking that has a cost which I don't
think is an appropriate payment since we only need this info when the
link goes quiet.
When the link goes quiet we still need to send replies if we have
apply mode, but we only need to send apply messages if the lsn has
changed because of a commit. That will considerably reduce the
messages sent so I don't see a problem.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 23, 2012 at 6:28 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 23, 2012 at 9:02 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Jan 23, 2012 at 4:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 16, 2012 at 12:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
Please add the Apply mode.
OK, will do.
Done. Attached is the updated version of the patch.
I notice that the Apply mode isn't fully implemented. I had in mind
that you would add the latch required to respond more quickly when
only the Apply pointer has changed.Is there a reason not to use WaitLatchOrSocket() in WALReceiver? Or
was there another reason for not implementing that?I agree that the feature you pointed is useful for the Apply mode. But
I'm afraid that implementing that feature is not easy and would make
the patch big and complicated, so I didn't implement the Apply mode first.To make the walreceiver call WaitLatchOrSocket(), we would need to
merge it and libpq_select() into one function. But the former is the backend
function and the latter is the frontend one. Now I have no good idea to
merge them cleanly.We can wait on the socket wherever it comes from. poll/select doesn't
care how we got the socket.So we just need a common handler that calls either
walreceiver/libpqwalreceiver function as required to handle the
wakeup.
I'm afraid I could not understand your idea. Could you explain it in
more detail?
If we send back the reply as soon as the Apply pointer is changed, I'm
afraid quite lots of reply messages are sent frequently, which might
cause performance problem. This is also one of the reasons why I didn't
implement the quick-response feature. To address this problem, we might
need to change the master so that it sends the Wait pointer to the standby,
and change the standby so that it replies whenever the Apply pointer
catches up with the Wait one. This can reduce the number of useless
reply from the standby about the Apply pointer.We send back one reply per incoming message. The incoming messages
don't know request state and checking that has a cost which I don't
think is an appropriate payment since we only need this info when the
link goes quiet.When the link goes quiet we still need to send replies if we have
apply mode, but we only need to send apply messages if the lsn has
changed because of a commit. That will considerably reduce the
messages sent so I don't see a problem.
You mean to change the meaning of apply_location? Currently it indicates
the end + 1 of the last replayed WAL record, regardless of whether it's
a commit record or not. So too many replies can be sent per incoming
message because it might contain many WAL records. But you mean to
change apply_location only when a commit record is replayed?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Jan 23, 2012 at 10:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
To make the walreceiver call WaitLatchOrSocket(), we would need to
merge it and libpq_select() into one function. But the former is the backend
function and the latter is the frontend one. Now I have no good idea to
merge them cleanly.We can wait on the socket wherever it comes from. poll/select doesn't
care how we got the socket.So we just need a common handler that calls either
walreceiver/libpqwalreceiver function as required to handle the
wakeup.I'm afraid I could not understand your idea. Could you explain it in
more detail?
We either tell libpqwalreceiver about the latch, or we tell
walreceiver about the socket used by libpqwalreceiver.
In either case we share a pointer from one module to another.
If we send back the reply as soon as the Apply pointer is changed, I'm
afraid quite lots of reply messages are sent frequently, which might
cause performance problem. This is also one of the reasons why I didn't
implement the quick-response feature. To address this problem, we might
need to change the master so that it sends the Wait pointer to the standby,
and change the standby so that it replies whenever the Apply pointer
catches up with the Wait one. This can reduce the number of useless
reply from the standby about the Apply pointer.We send back one reply per incoming message. The incoming messages
don't know request state and checking that has a cost which I don't
think is an appropriate payment since we only need this info when the
link goes quiet.When the link goes quiet we still need to send replies if we have
apply mode, but we only need to send apply messages if the lsn has
changed because of a commit. That will considerably reduce the
messages sent so I don't see a problem.You mean to change the meaning of apply_location? Currently it indicates
the end + 1 of the last replayed WAL record, regardless of whether it's
a commit record or not. So too many replies can be sent per incoming
message because it might contain many WAL records. But you mean to
change apply_location only when a commit record is replayed?
There is no change to the meaning of apply_location. The only change
is that we send that message only when it has an updated value of
committed lsn.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jan 23, 2012 at 9:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Mon, Jan 23, 2012 at 10:03 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
To make the walreceiver call WaitLatchOrSocket(), we would need to
merge it and libpq_select() into one function. But the former is the backend
function and the latter is the frontend one. Now I have no good idea to
merge them cleanly.We can wait on the socket wherever it comes from. poll/select doesn't
care how we got the socket.So we just need a common handler that calls either
walreceiver/libpqwalreceiver function as required to handle the
wakeup.I'm afraid I could not understand your idea. Could you explain it in
more detail?We either tell libpqwalreceiver about the latch, or we tell
walreceiver about the socket used by libpqwalreceiver.In either case we share a pointer from one module to another.
The former seems difficult because it's not easy to link libpqwalreceiver.so
to the latch. I will consider about the latter.
If we send back the reply as soon as the Apply pointer is changed, I'm
afraid quite lots of reply messages are sent frequently, which might
cause performance problem. This is also one of the reasons why I didn't
implement the quick-response feature. To address this problem, we might
need to change the master so that it sends the Wait pointer to the standby,
and change the standby so that it replies whenever the Apply pointer
catches up with the Wait one. This can reduce the number of useless
reply from the standby about the Apply pointer.We send back one reply per incoming message. The incoming messages
don't know request state and checking that has a cost which I don't
think is an appropriate payment since we only need this info when the
link goes quiet.When the link goes quiet we still need to send replies if we have
apply mode, but we only need to send apply messages if the lsn has
changed because of a commit. That will considerably reduce the
messages sent so I don't see a problem.You mean to change the meaning of apply_location? Currently it indicates
the end + 1 of the last replayed WAL record, regardless of whether it's
a commit record or not. So too many replies can be sent per incoming
message because it might contain many WAL records. But you mean to
change apply_location only when a commit record is replayed?There is no change to the meaning of apply_location. The only change
is that we send that message only when it has an updated value of
committed lsn.
This means that apply_location might return the different location from
pg_last_xlog_replay_location() on the standby, though in 9.1 they return
the same. Which might confuse a user. No?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Tue, Jan 24, 2012 at 10:47 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
I'm afraid I could not understand your idea. Could you explain it in
more detail?We either tell libpqwalreceiver about the latch, or we tell
walreceiver about the socket used by libpqwalreceiver.In either case we share a pointer from one module to another.
The former seems difficult because it's not easy to link libpqwalreceiver.so
to the latch. I will consider about the latter.
Yes, it might be too hard, but lets look.
You mean to change the meaning of apply_location? Currently it indicates
the end + 1 of the last replayed WAL record, regardless of whether it's
a commit record or not. So too many replies can be sent per incoming
message because it might contain many WAL records. But you mean to
change apply_location only when a commit record is replayed?There is no change to the meaning of apply_location. The only change
is that we send that message only when it has an updated value of
committed lsn.This means that apply_location might return the different location from
pg_last_xlog_replay_location() on the standby, though in 9.1 they return
the same. Which might confuse a user. No?
The two values only match on a quiet system anyway, since both are
moving forwards.
They will still match on a quiet system.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Yes, it might be too hard, but lets look.
Your committer has timed out.... ;-)
committed write mode only
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jan 25, 2012 at 5:28 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
On Tue, Jan 24, 2012 at 11:00 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
Yes, it might be too hard, but lets look.
Your committer has timed out.... ;-)
committed write mode only
Thanks for the commit!
The apply mode is attractive, but I need more time to implement that completely.
I might not be able to complete that within this CF. So committing the
write mode
only is right decision, I think. If I have time after all of the
patches which I'm interested
in will have been committed, I will try the apply mode again, but
maybe for 9.3dev.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center