Latch for the WAL writer - further reducing idle wake-ups.
Attached patch latches up the WAL Writer, reducing wake-ups and thus
saving electricity in a way that is more-or-less analogous to my work
on the BGWriter:
I am hoping this gets into 9.2 . I am concious of the fact that this
is quite late, but it the patch addresses an open item, the concluding
part of a much wider feature. In any case, it is a useful patch, that
ought to be committed at some point. I should point out:
1. This functionality was covered by the group commit patch that I
worked on back in January, which was submitted in advance of the
commitfest deadline. However, an alternative implementation was
ultimately committed that did not consider WAL Writer wake-ups.
2. The WAL writer is the most important auxiliary process to latch-up.
Though it is tied with the BGWriter at 5 wake-ups per second by
default, I consider the WAL Writer to be more important than the
BGWriter because I find it much more plausible that the WAL Writer
really won't need to be around for much of the time, as with a
read-mostly work load. "Cloud" type deployments often have read-mostly
workloads, so we can still save some power even if the DB is actually
servicing lots of read queries. That being the case, it would be a
shame if we didn't get this last one in, as it adds a lot more value
than any of the other patches.
3. This is a fairly simple patch; as I've said, it works in a way that
is quite analogous to the BGWriter patch, applying lessons learned
there.
With this patch, my instrumentation shows that wake-ups when Postgres
reaches a fully idle state are just 2.7 per second for the entire
postgres process group, quite an improvement on the 7.6 per second in
HEAD. This is exactly what you'd expect from a reduction of 5 wake-ups
per second to 0.1 per second on average for the WAL Writer.
I have determined this with PowerTOP 1.13 on my Fedora 16 laptop. Here
is an example session, began after the cluster reached a fully idle
state, with this patch applied (if, alternatively, I want to see
things at per-process granularity, I can get that from PowerTOP 1.98
beta 1, which is available from my system's package manager):
[peter@peterlaptop powertop-1.13]$ sudo ./powertop -d --time=300
[sudo] password for peter:
PowerTOP 1.13 (C) 2007 - 2010 Intel Corporation
Collecting data for 300 seconds
Cn Avg residency
C0 (cpu running) ( 2.8%)
polling 0.0ms ( 0.0%)
C1 mwait 0.5ms ( 1.0%)
C2 mwait 0.9ms ( 0.6%)
C3 mwait 1.4ms ( 0.1%)
C4 mwait 6.7ms (95.4%)
P-states (frequencies)
2.61 Ghz 5.7%
1.80 Ghz 0.1%
1200 Mhz 0.1%
1000 Mhz 0.2%
800 Mhz 93.5%
Wakeups-from-idle per second : 171.3 interval: 300.0s
no ACPI power usage estimate available
Top causes for wakeups:
23.0% (134.5) chrome
***SNIP***
0.5% ( 2.7) postgres
***SNIP***
This is a rather low number, that will make us really competitive with
other RDBMSs in this area. Recall that we started from 11.5 wake-ups
for an idle Postgres cluster with a default configuration.
To put the 2.7 number in context, I measured MySQL's wake-ups at 2.2
last year (mysql-server version 5.1.56, Fedora 14), though I
subsequently saw much higher numbers (over 20 per second) for version
5.5.19 on Fedora 16, so you should probably take that with a grain of
salt - I don't know anything about MySQL, and so cannot really be sure
that I'm making an objective comparison in comparing that number with
the number of wake-ups Postgres has with a stock postgresql.conf.
I've employed the same trick used when a buffer is dirtied for the
BGWriter - most of the time, the SetLatch() calls will check a single
flag, and find it already set. We are careful to only "arm" the latch
with a call to ResetLatch() when it is really needed. Rather than
waiting for the clocksweep to be lapped, we wait for a set number of
iterations of consistent inactivity.
I've made the WAL Writer use its process latch, rather than the latch
that was previously within XLogCtl. This seems much more idiomatic, as
in doing so we reserve the right to register generic signal handlers.
With a non-process latch, we'd have to worry about signal invalidation
issues on an ongoing basis, since the handler wouldn't be calling
SetLatch() against the latch we waited on. I have also added a comment
in latch.h generally advising against ad-hoc shared latches where .
I took initial steps to quantify the performance hit from this patch.
A simple "insert.sql" pgbench-tools benchmark on my laptop, with a
generic configuration showed no problems, though I do not assume that
this conclusively proves the case. Results:
http://walwriterlatch.staticloud.com/
My choice of XLogInsert() as an additional site at which to call
SetLatch() was one that wasn't taken easily, and frankly I'm not
entirely confident that I couldn't have been just as effective while
placing the SetLatch() call in a less hot, perhaps higher-level
codepath. That said, MarkBufferDirty() is also a very hot code path,
and it's where one of the SetLatch() calls goes in the earlier
BGWriter patch, besides which I haven't been able to quantify any
performance hit as yet.
Thoughts?
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
walwriter_latch_v1_2012_05_02.patchapplication/octet-stream; name=walwriter_latch_v1_2012_05_02.patchDownload
diff src/backend/access/transam/xlog.c
index 8d0aabf..e5adfab
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** typedef struct XLogCtlData
*** 434,444 ****
Latch recoveryWakeupLatch;
/*
- * WALWriterLatch is used to wake up the WALWriter to write some WAL.
- */
- Latch WALWriterLatch;
-
- /*
* During recovery, we keep a copy of the latest checkpoint record here.
* Used by the background writer when it wants to create a restartpoint.
*
--- 434,439 ----
*************** begin:;
*** 1213,1218 ****
--- 1208,1220 ----
END_CRIT_SECTION();
+ /*
+ * Now that we've gone through with inserting WAL, nudge the WALWriter,
+ * provoking a set number of iterations.
+ */
+ if (ProcGlobal->walwriterLatch)
+ SetLatch(ProcGlobal->walwriterLatch);
+
return RecPtr;
}
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 1935,1941 ****
/*
* Nudge the WALWriter if we have a full page of WAL to write.
*/
! SetLatch(&XLogCtl->WALWriterLatch);
}
/*
--- 1937,1944 ----
/*
* Nudge the WALWriter if we have a full page of WAL to write.
*/
! if (ProcGlobal->walwriterLatch)
! SetLatch(ProcGlobal->walwriterLatch);
}
/*
*************** XLogFlush(XLogRecPtr record)
*** 2173,2188 ****
* case for async commits.)
*
* This routine is invoked periodically by the background walwriter process.
*/
! void
XLogBackgroundFlush(void)
{
XLogRecPtr WriteRqstPtr;
bool flexible = true;
/* XLOG doesn't need flushing during recovery */
if (RecoveryInProgress())
! return;
/* read LogwrtResult and update local state */
{
--- 2176,2194 ----
* case for async commits.)
*
* This routine is invoked periodically by the background walwriter process.
+ *
+ * Returns a value indicating if useful work was performed.
*/
! bool
XLogBackgroundFlush(void)
{
XLogRecPtr WriteRqstPtr;
bool flexible = true;
+ bool worked = false;
/* XLOG doesn't need flushing during recovery */
if (RecoveryInProgress())
! return false;
/* read LogwrtResult and update local state */
{
*************** XLogBackgroundFlush(void)
*** 2224,2230 ****
XLogFileClose();
}
}
! return;
}
#ifdef WAL_DEBUG
--- 2230,2236 ----
XLogFileClose();
}
}
! return false;
}
#ifdef WAL_DEBUG
*************** XLogBackgroundFlush(void)
*** 2247,2256 ****
--- 2253,2265 ----
WriteRqst.Write = WriteRqstPtr;
WriteRqst.Flush = WriteRqstPtr;
XLogWrite(WriteRqst, flexible, false);
+ worked = true;
}
LWLockRelease(WALWriteLock);
END_CRIT_SECTION();
+
+ return worked;
}
/*
*************** XLOGShmemInit(void)
*** 5101,5107 ****
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
- InitSharedLatch(&XLogCtl->WALWriterLatch);
/*
* If we are not in bootstrap mode, pg_control should already exist. Read
--- 5110,5115 ----
*************** WakeupRecovery(void)
*** 10479,10490 ****
{
SetLatch(&XLogCtl->recoveryWakeupLatch);
}
-
- /*
- * Manage the WALWriterLatch
- */
- Latch *
- WALWriterLatch(void)
- {
- return &XLogCtl->WALWriterLatch;
- }
--- 10487,10489 ----
diff src/backend/postmaster/walwriter.c
index 08ef946..eb09359
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 54,59 ****
--- 54,60 ----
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/pmsignal.h"
+ #include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
***************
*** 67,81 ****
--- 68,93 ----
int WalWriterDelay = 200;
/*
+ * Number of loops before each nap is extended to a hibernation, and time to
+ * sleep between walwriter rounds when hibernating.
+ */
+ #define LOOPS_UNTIL_HIBERNATE 100
+ #define HIBERNATE_MS 10000
+
+ /*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
static volatile sig_atomic_t shutdown_requested = false;
+ /* Prototypes for private functions */
+ static bool WalWriterNap(bool hibernating);
+
/* Signal handlers */
static void wal_quickdie(SIGNAL_ARGS);
static void WalSigHupHandler(SIGNAL_ARGS);
static void WalShutdownHandler(SIGNAL_ARGS);
+ static void walwriter_sigusr1_handler(SIGNAL_ARGS);
/*
* Main entry point for walwriter process
*************** WalWriterMain(void)
*** 88,95 ****
{
sigjmp_buf local_sigjmp_buf;
MemoryContext walwriter_context;
!
! InitLatch(WALWriterLatch()); /* initialize latch used in main loop */
/*
* If possible, make this process a group leader, so that the postmaster
--- 100,107 ----
{
sigjmp_buf local_sigjmp_buf;
MemoryContext walwriter_context;
! int left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
! bool hibernating;
/*
* If possible, make this process a group leader, so that the postmaster
*************** WalWriterMain(void)
*** 114,120 ****
pqsignal(SIGQUIT, wal_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
pqsignal(SIGUSR2, SIG_IGN); /* not used */
/*
--- 126,132 ----
pqsignal(SIGQUIT, wal_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, walwriter_sigusr1_handler);
pqsignal(SIGUSR2, SIG_IGN); /* not used */
/*
*************** WalWriterMain(void)
*** 218,229 ****
PG_SETMASK(&UnBlockSig);
/*
* Loop forever
*/
for (;;)
{
- ResetLatch(WALWriterLatch());
-
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
--- 230,246 ----
PG_SETMASK(&UnBlockSig);
/*
+ * Advertise our latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->walwriterLatch = &MyProc->procLatch;
+
+ /*
* Loop forever
*/
+ hibernating = false;
for (;;)
{
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*************** WalWriterMain(void)
*** 248,261 ****
/*
* Do what we're here for...
*/
! XLogBackgroundFlush();
! (void) WaitLatch(WALWriterLatch(),
! WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
! WalWriterDelay /* ms */);
}
}
/* --------------------------------
* signal handler routines
--- 265,394 ----
/*
* Do what we're here for...
*/
! if (hibernating)
! ResetLatch(&MyProc->procLatch);
! /*
! * If XLogBackgroundFlush() found useful work to do, reset iterations
! * left until hibernation.
! */
! if (XLogBackgroundFlush())
! left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
! else if (left_till_hibernate > 0)
! left_till_hibernate--;
!
! if (left_till_hibernate == 0 && !hibernating)
! {
! /*
! * XLogBackgroundFlush did nothing for a sufficient number of
! * iterations. Since there doesn't seem to be any work for the
! * walwriter to do, go into slower-paced "hibernation" mode, where
! * we sleep for much longer times than bgwriter_delay says. Fewer
! * wakeups saves electricity. If a backend inserts WAL, it will
! * wake us up by setting our latch.
! *
! * The latch is kept set during productive cycles where WAL is
! * written and/or synced, and only reset before going into a longer
! * sleep. That ensures that when there's a constant trickle of
! * activity, the SetLatch() calls that backends have to do will see
! * the latch as already set, and are not slowed down by having to
! * actually set the latch and signal us.
! */
! hibernating = true;
!
! /*
! * Take one more short nap and perform one more walwriter cycle -
! * control in another process might have reached XLogInsert() just
! * after we finished the previous walwriter cycle, while the latch
! * was still set. If we still find nothing to do after this cycle,
! * the next sleep will be longer.
! */
! WalWriterNap(false);
! continue;
! }
! else if (left_till_hibernate > 0 && hibernating)
! {
! /*
! * Woke up from hibernation, and may have found work to do. Set the
! * latch just in case it's not set yet (usually we wake up from
! * hibernation because a backend already set the latch, but not
! * necessarily).
! */
! SetLatch(&MyProc->procLatch);
! hibernating = false;
! }
!
! /*
! * If the latch is set, don't immediately go back to hibernation if no
! * work is performed in the first iteration.
! */
! if (WalWriterNap(hibernating))
! left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
}
}
+ /*
+ * WalWriterNap -- Nap for the configured time or until a signal is received.
+ *
+ * If 'hibernating' is false, sleeps for wal_writer_delay milliseconds.
+ * Otherwise sleeps longer, but also wakes up if the process latch is set.
+ *
+ * Returns a value indicating if return was due to latch being set.
+ */
+ static bool
+ WalWriterNap(bool hibernating)
+ {
+ long udelay;
+ bool latch_set = false;
+
+ /*
+ * Nap for the configured time, or sleep for 10 seconds if there is no
+ * walwriter activity required.
+ *
+ * If there was no work to do in the previous LOOPS_UNTIL_HIBERNATE cycles,
+ * take a longer nap.
+ */
+ if (hibernating)
+ {
+ int res = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ HIBERNATE_MS);
+
+ /*
+ * Only do a quick return if timeout was reached (or postmaster died)
+ * to ensure that no less than WalWriterDelay ms has passed between
+ * XLogBackgroundFlush calls - WaitLatch() might have returned
+ * instantaneously.
+ */
+ if (res & (WL_TIMEOUT | WL_POSTMASTER_DEATH))
+ return false;
+
+ /* WL_LATCH_SET is documented as reliable */
+ latch_set = (res & WL_LATCH_SET) != 0;
+ }
+
+ /*
+ * Nap for the configured time.
+ *
+ * On some platforms, signals won't interrupt the sleep. To ensure we
+ * respond reasonably promptly when someone signals us, break down the
+ * sleep into 1-second increments, and check for interrupts after each
+ * nap.
+ */
+ udelay = WalWriterDelay * 1000L;
+
+ while (udelay > 999999L)
+ {
+ if (got_SIGHUP || shutdown_requested)
+ break;
+ pg_usleep(1000000L);
+ udelay -= 1000000L;
+ }
+ if (!(got_SIGHUP || shutdown_requested))
+ pg_usleep(udelay);
+
+ return latch_set;
+ }
/* --------------------------------
* signal handler routines
*************** wal_quickdie(SIGNAL_ARGS)
*** 298,311 ****
static void
WalSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
! SetLatch(WALWriterLatch());
}
/* SIGTERM: set flag to exit normally */
static void
WalShutdownHandler(SIGNAL_ARGS)
{
shutdown_requested = true;
! SetLatch(WALWriterLatch());
}
--- 431,465 ----
static void
WalSigHupHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
got_SIGHUP = true;
! if (MyProc)
! SetLatch(&MyProc->procLatch);
!
! errno = save_errno;
}
/* SIGTERM: set flag to exit normally */
static void
WalShutdownHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
shutdown_requested = true;
! if (MyProc)
! SetLatch(&MyProc->procLatch);
!
! errno = save_errno;
! }
!
! /* SIGUSR1: used for latch wakeups */
! static void
! walwriter_sigusr1_handler(SIGNAL_ARGS)
! {
! int save_errno = errno;
!
! latch_sigusr1_handler();
!
! errno = save_errno;
}
diff src/backend/storage/lmgr/proc.c
index d1bf264..43c4846
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
*************** InitProcGlobal(void)
*** 187,192 ****
--- 187,193 ----
ProcGlobal->startupProcPid = 0;
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->bgwriterLatch = NULL;
+ ProcGlobal->walwriterLatch = NULL;
/*
* Create and initialize all the PGPROC structures we'll need (except for
diff src/include/access/xlog.h
index f8aecef..3c1a509
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** extern CheckpointStatsData CheckpointSta
*** 266,272 ****
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
extern void XLogFlush(XLogRecPtr RecPtr);
! extern void XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
--- 266,272 ----
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
extern void XLogFlush(XLogRecPtr RecPtr);
! extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
*************** extern TimeLineID GetRecoveryTargetTLI(v
*** 317,323 ****
extern bool CheckPromoteSignal(void);
extern void WakeupRecovery(void);
- extern Latch *WALWriterLatch(void);
/*
* Starting/stopping a base backup
--- 317,322 ----
diff src/include/storage/latch.h
index f97fedf..12f500a
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
***************
*** 25,31 ****
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it.
*
* There are three basic operations on a latch:
*
--- 25,37 ----
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it. Note that the use of the process latch (a field in
! * PGPROC) is generally preferred over an ad-hoc latch for auxiliary processes.
! * This is because generic signal handlers will call SetLatch on the process
! * latch only. Signals have the potential to invalidate the latch timeout on
! * certain platforms, resulting in a denial-of-service, so even if the process
! * latch cannot be used it is important to verify that all signal handlers
! * within the process call SetLatch().
*
* There are three basic operations on a latch:
*
diff src/include/storage/proc.h
index a66c484..478577f
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
*************** typedef struct PROC_HDR
*** 190,195 ****
--- 190,197 ----
PGPROC *autovacFreeProcs;
/* BGWriter process latch */
Latch *bgwriterLatch;
+ /* WALWriter process latch */
+ Latch *walwriterLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* The proc of the Startup process, since not in ProcArray */
Peter Geoghegan <peter@2ndquadrant.com> writes:
Attached patch latches up the WAL Writer, reducing wake-ups and thus
saving electricity in a way that is more-or-less analogous to my work
on the BGWriter:
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=6d90eaaa89a007e0d365f49d6436f35d2392cfeb
I am hoping this gets into 9.2 . I am concious of the fact that this
is quite late, but it the patch addresses an open item, the concluding
part of a much wider feature.
It is getting a bit late to be considering such changes for 9.2, but
I'm willing to review and commit this if there's not anybody who feels
strongly that it's too late. Personally I think it's in the nature of
cleanup and so fair game as long as we haven't formally started beta.
However I will confess to some bias about wanting to get the server's
idle wake-up rate down, because Fedora people have been bugging me
about that for a long time now. So I'm probably not the best person to
objectively evaluate whether we should hold this for 9.3. Comments?
Schedule questions aside, I'm disturbed by this bit:
My choice of XLogInsert() as an additional site at which to call
SetLatch() was one that wasn't taken easily, and frankly I'm not
entirely confident that I couldn't have been just as effective while
placing the SetLatch() call in a less hot, perhaps higher-level
codepath.
Adding any contention at all to XLogInsert doesn't seem like a smart
idea, even if you failed to measure any problem in the specific tests
you made. I wonder whether we could not improve matters by adding
an additional bool "wal_writer_needs_wakening" in the state that's
considered to be protected by WALInsertLock. XLogInsert would check
this while still holding the lock, and only consider that it needs to do
a SetLatch if the flag was set, whereupon it would clear it before
releasing the lock. In the normal case this would add one uncontended
fetch followed by boolean-test-and-jump to the work done while holding
the lock, which should be pretty negligible. Then, the WAL writer would
need to take WALInsertLock to set that flag, but presumably it should
only be doing that when there is no contention for the lock. (In fact,
we could have it do a ConditionalLockAcquire on WALInsertLock for the
purpose, and consider that failure means it shouldn't go to sleep after
all.)
Now this might sound pretty much equivalent to testing the latch's
is_set flag; perhaps it is and I'm worrying over nothing. But I'm
thinking that the wal_writer_needs_wakening flag would be in a cache
line that an acquirer of WALInsertLock would have to get ownership of
anyway, if it is adjacent to variables that XLogInsert has to manipulate
anyway. On the other hand, the WAL writer's process latch would be in
some other cache line that would also need to get passed around a lot,
if it's touched during every XLogInsert.
regards, tom lane
On Wed, May 2, 2012 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It is getting a bit late to be considering such changes for 9.2, but
I'm willing to review and commit this if there's not anybody who feels
strongly that it's too late. Personally I think it's in the nature of
cleanup and so fair game as long as we haven't formally started beta.
However I will confess to some bias about wanting to get the server's
idle wake-up rate down, because Fedora people have been bugging me
about that for a long time now. So I'm probably not the best person to
objectively evaluate whether we should hold this for 9.3. Comments?
Well, I feel that one of the weaknesses of our CommitFest process is
that changes like this (which are really pretty small) end up having
the same deadline as patches that are large (command triggers,
checksums, etc.); in fact, they sometimes end up having an earlier
deadline, because the people doing the big stuff end up continuing to
hack on it for another couple months while the door is shut to smaller
improvements. So I'm not going to object if you feel like slipping
this one in. I looked it over myself and I think it's broadly
reasonable, although I'm not too sure about the particular criteria
chosen for sending the WAL writer to sleep and waking it up again.
And like you I'd like to see some more improvement in this area.
Adding any contention at all to XLogInsert doesn't seem like a smart
idea, even if you failed to measure any problem in the specific tests
you made. I wonder whether we could not improve matters by adding
an additional bool "wal_writer_needs_wakening" in the state that's
considered to be protected by WALInsertLock.
I am skeptical about this, although it could be right. It could also
be better the way Peter did it; a fetch of an uncontended cache line
is pretty cheap. Another approach - which I think might be better
still - is to not bother kicking the WAL writer and let it wake up
when it wakes up. Maybe have it hibernate for 3 seconds instead of
10, or something like that. It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes:
... It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.
That's a good point. What about only kicking the WAL writer in code
paths where a backend found itself having to write/flush WAL for itself?
The added overhead is very surely negligible in such a situation.
regards, tom lane
On Wed, May 2, 2012 at 11:42 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.That's a good point. What about only kicking the WAL writer in code
paths where a backend found itself having to write/flush WAL for itself?
The added overhead is very surely negligible in such a situation.
Yeah, I think that would make sense, though I'd probably still argue
for a hibernation period not quite so long as ten seconds. Actually,
what I'd really like is for this to be adaptive: if we find that
there's no WAL to write, increase the time until the next wakeup by 10
ms until we hit the maximum of, say, 3 seconds. If we find that there
is WAL to write, cut the time until the next wakeup in half until we
hit a minimum of, say, 20ms. And, if we're forced to write/flush WAL
ourselves, or we async commit, kick the WAL writer in the pants and
wake him up right away. That way we're willing to get
super-aggressive when needed, but we don't stay there very long once
the pounding ends. Also, we avoid having a hard "cut" between regular
sleeps and deep hibernation; instead, we kind of gradually drift off.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 03.05.2012 03:41, Robert Haas wrote:
On Wed, May 2, 2012 at 7:21 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote:
Adding any contention at all to XLogInsert doesn't seem like a smart
idea, even if you failed to measure any problem in the specific tests
you made. I wonder whether we could not improve matters by adding
an additional bool "wal_writer_needs_wakening" in the state that's
considered to be protected by WALInsertLock.I am skeptical about this, although it could be right. It could also
be better the way Peter did it; a fetch of an uncontended cache line
is pretty cheap.
I'm very wary of adding any extra shared memory accesses to XLogInsert.
I spent a lot of time trying to eliminate them in my XLogInsert scaling
patch. It might be ok if the flag is usually not modified, and we don't
add any extra barrier instructions in there, but it would be better to
avoid it.
One simple idea would be to only try to set the latch every 100
XLogInsert calls in the backend. That would cut whatever contention it
might cause by a factor of 100, making it negligible.
Another approach - which I think might be better
still - is to not bother kicking the WAL writer and let it wake up
when it wakes up. Maybe have it hibernate for 3 seconds instead of
10, or something like that. It seems unlikely to cause any real
problem if WAL writer takes a couple seconds to get with the program
after a long period of inactivity; note that an async commit will kick
it anyway, and a sync commit will probably half to flush WAL whether
the WAL writer wakes up or not.
Yeah, that'd be even simpler.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Thu, May 3, 2012 at 2:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, May 2, 2012 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
It is getting a bit late to be considering such changes for 9.2, but
I'm willing to review and commit this if there's not anybody who feels
strongly that it's too late. Personally I think it's in the nature of
cleanup and so fair game as long as we haven't formally started beta.
However I will confess to some bias about wanting to get the server's
idle wake-up rate down, because Fedora people have been bugging me
about that for a long time now. So I'm probably not the best person to
objectively evaluate whether we should hold this for 9.3. Comments?Well, I feel that one of the weaknesses of our CommitFest process is
that changes like this (which are really pretty small) end up having
the same deadline as patches that are large (command triggers,
checksums, etc.); in fact, they sometimes end up having an earlier
deadline, because the people doing the big stuff end up continuing to
hack on it for another couple months while the door is shut to smaller
improvements. So I'm not going to object if you feel like slipping
this one in. I looked it over myself and I think it's broadly
reasonable, although I'm not too sure about the particular criteria
chosen for sending the WAL writer to sleep and waking it up again.
And like you I'd like to see some more improvement in this area.
I agree that it's ok to slip it in given that it's "finishing off a
patch from earlier". I think it's reasonable to hold it to a little
bit higher review stadards since it's that late in the cycle though,
such as two people reviewing it before it goes in (or 1 reviewer + 1
committer - and of course, unless it's a truly trivial patch). Which
it seems you both are doing now, so that makes it ok ;)
--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
On 3 May 2012 10:56, Magnus Hagander <magnus@hagander.net> wrote:
I agree that it's ok to slip it in given that it's "finishing off a
patch from earlier". I think it's reasonable to hold it to a little
bit higher review stadards since it's that late in the cycle though,
such as two people reviewing it before it goes in (or 1 reviewer + 1
committer - and of course, unless it's a truly trivial patch). Which
it seems you both are doing now, so that makes it ok ;)
Right. It's a simple, largely mechanical patch, that doesn't have any
behavioural changes, and is of strategic importance, so I thought it
was worthy of special consideration, without actually expecting it.
Attached patch removes the questionable SetLatch() call, under the
assumption that it's okay if the WALWriter, having entered hibernation
due to sustained inactivity (10 seconds) subsequently takes up to 5
seconds (2.5 on average) to notice that it has work to do. These
values may need to be tweaked. I have not bothered with making the
sleep time adaptive, because it's probably too late to do that.
This latest revision also covers the checkpointer. The code for that
is far simpler than that for the WAL Writer, so it doesn't
particularly feel like I'm pushing my luck by slipping that into
something to be slipped in. I should not have excluded it before,
since it accounts for another 2 wake-ups per second. All told, this
new revision sees Postgres wake-ups stabilise at 0.9 per second. With
the checkpointer code included, we roundly beat MySQL in this area,
which will be a nice advocacy message for 9.2, though I probably
shouldn't be quoted on that until I get the opportunity to go back and
make absolutely sure that I've been fair.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Attachments:
walwriter_latch_v2_2012_05_04.patchapplication/octet-stream; name=walwriter_latch_v2_2012_05_04.patchDownload
diff src/backend/access/transam/xlog.c
index b584cb0..20e7b6c
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** typedef struct XLogCtlData
*** 434,444 ****
Latch recoveryWakeupLatch;
/*
- * WALWriterLatch is used to wake up the WALWriter to write some WAL.
- */
- Latch WALWriterLatch;
-
- /*
* During recovery, we keep a copy of the latest checkpoint record here.
* Used by the background writer when it wants to create a restartpoint.
*
--- 434,439 ----
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 1935,1941 ****
/*
* Nudge the WALWriter if we have a full page of WAL to write.
*/
! SetLatch(&XLogCtl->WALWriterLatch);
}
/*
--- 1930,1937 ----
/*
* Nudge the WALWriter if we have a full page of WAL to write.
*/
! if (ProcGlobal->walwriterLatch)
! SetLatch(ProcGlobal->walwriterLatch);
}
/*
*************** XLogFlush(XLogRecPtr record)
*** 2173,2188 ****
* case for async commits.)
*
* This routine is invoked periodically by the background walwriter process.
*/
! void
XLogBackgroundFlush(void)
{
XLogRecPtr WriteRqstPtr;
bool flexible = true;
/* XLOG doesn't need flushing during recovery */
if (RecoveryInProgress())
! return;
/* read LogwrtResult and update local state */
{
--- 2169,2187 ----
* case for async commits.)
*
* This routine is invoked periodically by the background walwriter process.
+ *
+ * Returns a value indicating if useful work was performed.
*/
! bool
XLogBackgroundFlush(void)
{
XLogRecPtr WriteRqstPtr;
bool flexible = true;
+ bool worked = false;
/* XLOG doesn't need flushing during recovery */
if (RecoveryInProgress())
! return false;
/* read LogwrtResult and update local state */
{
*************** XLogBackgroundFlush(void)
*** 2224,2230 ****
XLogFileClose();
}
}
! return;
}
#ifdef WAL_DEBUG
--- 2223,2229 ----
XLogFileClose();
}
}
! return false;
}
#ifdef WAL_DEBUG
*************** XLogBackgroundFlush(void)
*** 2247,2256 ****
--- 2246,2258 ----
WriteRqst.Write = WriteRqstPtr;
WriteRqst.Flush = WriteRqstPtr;
XLogWrite(WriteRqst, flexible, false);
+ worked = true;
}
LWLockRelease(WALWriteLock);
END_CRIT_SECTION();
+
+ return worked;
}
/*
*************** XLOGShmemInit(void)
*** 5101,5107 ****
XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
SpinLockInit(&XLogCtl->info_lck);
InitSharedLatch(&XLogCtl->recoveryWakeupLatch);
- InitSharedLatch(&XLogCtl->WALWriterLatch);
/*
* If we are not in bootstrap mode, pg_control should already exist. Read
--- 5103,5108 ----
*************** WakeupRecovery(void)
*** 10478,10489 ****
{
SetLatch(&XLogCtl->recoveryWakeupLatch);
}
-
- /*
- * Manage the WALWriterLatch
- */
- Latch *
- WALWriterLatch(void)
- {
- return &XLogCtl->WALWriterLatch;
- }
--- 10479,10481 ----
diff src/backend/postmaster/checkpointer.c
index 2329b1a..806de44
*** a/src/backend/postmaster/checkpointer.c
--- b/src/backend/postmaster/checkpointer.c
***************
*** 51,56 ****
--- 51,57 ----
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/pmsignal.h"
+ #include "storage/proc.h"
#include "storage/shmem.h"
#include "storage/smgr.h"
#include "storage/spin.h"
*************** int CheckPointWarning = 30;
*** 144,149 ****
--- 145,155 ----
double CheckPointCompletionTarget = 0.5;
/*
+ * Number of seconds to sleep before checking for work.
+ */
+ #define HIBERNATE_MS 15000
+
+ /*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
*************** static void UpdateSharedMemoryConfig(voi
*** 178,183 ****
--- 184,190 ----
static void chkpt_quickdie(SIGNAL_ARGS);
static void ChkptSigHupHandler(SIGNAL_ARGS);
static void ReqCheckpointHandler(SIGNAL_ARGS);
+ static void chkpt_sigusr1_handler(SIGNAL_ARGS);
static void ReqShutdownHandler(SIGNAL_ARGS);
*************** CheckpointerMain(void)
*** 224,230 ****
pqsignal(SIGQUIT, chkpt_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
pqsignal(SIGUSR2, ReqShutdownHandler); /* request shutdown */
/*
--- 231,237 ----
pqsignal(SIGQUIT, chkpt_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, chkpt_sigusr1_handler);
pqsignal(SIGUSR2, ReqShutdownHandler); /* request shutdown */
/*
*************** CheckpointerMain(void)
*** 369,374 ****
--- 376,383 ----
pg_time_t now;
int elapsed_secs;
+ ResetLatch(&MyProc->procLatch);
+
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*************** CheckpointerMain(void)
*** 548,562 ****
*/
pgstat_send_bgwriter();
- /*
- * Nap for a while and then loop again. Later patches will replace
- * this with a latch loop. Keep it simple now for clarity.
- * Relatively long sleep because the bgwriter does cleanup now.
- */
- pg_usleep(500000L);
-
/* Check for archive_timeout and switch xlog files if necessary. */
CheckArchiveTimeout();
}
}
--- 557,568 ----
*/
pgstat_send_bgwriter();
/* Check for archive_timeout and switch xlog files if necessary. */
CheckArchiveTimeout();
+
+ (void) WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ HIBERNATE_MS);
}
}
*************** chkpt_quickdie(SIGNAL_ARGS)
*** 814,834 ****
--- 820,875 ----
static void
ChkptSigHupHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
got_SIGHUP = true;
+ if (MyProc)
+ SetLatch(&MyProc->procLatch);
+
+ errno = save_errno;
}
/* SIGINT: set flag to run a normal checkpoint right away */
static void
ReqCheckpointHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
checkpoint_requested = true;
+ if (MyProc)
+ SetLatch(&MyProc->procLatch);
+
+ errno = save_errno;
+ }
+
+ /*
+ * SIGUSR1: used for latch wakeups.
+ *
+ * ReqCheckpointHandler() is actually responsible for servicing checkpoint
+ * requests. This is including for consistency with other auxiliary processes
+ * only.
+ */
+ static void
+ chkpt_sigusr1_handler(SIGNAL_ARGS)
+ {
+ int save_errno = errno;
+
+ latch_sigusr1_handler();
+
+ errno = save_errno;
}
/* SIGUSR2: set flag to run a shutdown checkpoint and exit */
static void
ReqShutdownHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
shutdown_requested = true;
+ if (MyProc)
+ SetLatch(&MyProc->procLatch);
+
+ errno = save_errno;
}
diff src/backend/postmaster/walwriter.c
index 08ef946..6721a97
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 54,59 ****
--- 54,60 ----
#include "storage/ipc.h"
#include "storage/lwlock.h"
#include "storage/pmsignal.h"
+ #include "storage/proc.h"
#include "storage/smgr.h"
#include "utils/guc.h"
#include "utils/hsearch.h"
***************
*** 67,81 ****
--- 68,93 ----
int WalWriterDelay = 200;
/*
+ * Number of loops before each nap is extended to a full hibernation, and time
+ * to sleep between walwriter rounds when hibernating.
+ */
+ #define LOOPS_UNTIL_HIBERNATE 50
+ #define HIBERNATE_MS 5000
+
+ /*
* Flags set by interrupt handlers for later service in the main loop.
*/
static volatile sig_atomic_t got_SIGHUP = false;
static volatile sig_atomic_t shutdown_requested = false;
+ /* Prototypes for private functions */
+ static bool WalWriterNap(bool hibernating);
+
/* Signal handlers */
static void wal_quickdie(SIGNAL_ARGS);
static void WalSigHupHandler(SIGNAL_ARGS);
static void WalShutdownHandler(SIGNAL_ARGS);
+ static void walwriter_sigusr1_handler(SIGNAL_ARGS);
/*
* Main entry point for walwriter process
*************** WalWriterMain(void)
*** 88,95 ****
{
sigjmp_buf local_sigjmp_buf;
MemoryContext walwriter_context;
!
! InitLatch(WALWriterLatch()); /* initialize latch used in main loop */
/*
* If possible, make this process a group leader, so that the postmaster
--- 100,107 ----
{
sigjmp_buf local_sigjmp_buf;
MemoryContext walwriter_context;
! int left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
! bool hibernating;
/*
* If possible, make this process a group leader, so that the postmaster
*************** WalWriterMain(void)
*** 114,120 ****
pqsignal(SIGQUIT, wal_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
pqsignal(SIGUSR2, SIG_IGN); /* not used */
/*
--- 126,132 ----
pqsignal(SIGQUIT, wal_quickdie); /* hard crash time */
pqsignal(SIGALRM, SIG_IGN);
pqsignal(SIGPIPE, SIG_IGN);
! pqsignal(SIGUSR1, walwriter_sigusr1_handler);
pqsignal(SIGUSR2, SIG_IGN); /* not used */
/*
*************** WalWriterMain(void)
*** 218,229 ****
PG_SETMASK(&UnBlockSig);
/*
* Loop forever
*/
for (;;)
{
- ResetLatch(WALWriterLatch());
-
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
--- 230,246 ----
PG_SETMASK(&UnBlockSig);
/*
+ * Advertise our latch that backends can use to wake us up while we're
+ * sleeping.
+ */
+ ProcGlobal->walwriterLatch = &MyProc->procLatch;
+
+ /*
* Loop forever
*/
+ hibernating = false;
for (;;)
{
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*************** WalWriterMain(void)
*** 248,261 ****
/*
* Do what we're here for...
*/
! XLogBackgroundFlush();
! (void) WaitLatch(WALWriterLatch(),
! WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
! WalWriterDelay /* ms */);
}
}
/* --------------------------------
* signal handler routines
--- 265,394 ----
/*
* Do what we're here for...
*/
! if (hibernating)
! ResetLatch(&MyProc->procLatch);
! /*
! * If XLogBackgroundFlush() found useful work to do, reset iterations
! * left until hibernation.
! */
! if (XLogBackgroundFlush())
! left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
! else if (left_till_hibernate > 0)
! left_till_hibernate--;
!
! if (left_till_hibernate == 0 && !hibernating)
! {
! /*
! * XLogBackgroundFlush did nothing for a sufficient number of
! * iterations. Since there doesn't seem to be any work for the
! * walwriter to do, go into slower-paced "hibernation" mode, where
! * we sleep for much longer times than wal_writer_delay says. Fewer
! * wakeups saves electricity. If a backend inserts WAL, it will
! * wake us up by setting our latch.
! *
! * The latch is kept set during productive cycles where WAL is
! * written and/or synced, and only reset before going into a longer
! * sleep. That ensures that when there's a constant trickle of
! * activity, the SetLatch() calls that backends have to do will see
! * the latch as already set, and are not slowed down by having to
! * actually set the latch and signal us.
! */
! hibernating = true;
!
! /*
! * Take one more short nap and perform one more walwriter cycle -
! * control in another process might have reached XLogInsert() just
! * after we finished the previous walwriter cycle, while the latch
! * was still set. If we still find nothing to do after this cycle,
! * the next sleep will be longer.
! */
! WalWriterNap(false);
! continue;
! }
! else if (left_till_hibernate > 0 && hibernating)
! {
! /*
! * Woke up from hibernation, and may have found work to do. Set the
! * latch just in case it's not set yet (usually we wake up from
! * hibernation because a backend already set the latch, but not
! * necessarily).
! */
! SetLatch(&MyProc->procLatch);
! hibernating = false;
! }
!
! /*
! * If the latch is set, don't immediately go back to hibernation if no
! * work is performed in the first iteration.
! */
! if (WalWriterNap(hibernating))
! left_till_hibernate = LOOPS_UNTIL_HIBERNATE;
}
}
+ /*
+ * WalWriterNap -- Nap for the configured time or until a signal is received.
+ *
+ * If 'hibernating' is false, sleeps for wal_writer_delay milliseconds.
+ * Otherwise sleeps longer, but also wakes up if the process latch is set.
+ *
+ * Returns a value indicating if return was due to latch being set.
+ */
+ static bool
+ WalWriterNap(bool hibernating)
+ {
+ long udelay;
+ bool latch_set = false;
+
+ /*
+ * Nap for the configured time, or sleep for 10 seconds if there is no
+ * walwriter activity required.
+ *
+ * If there was no work to do in the previous LOOPS_UNTIL_HIBERNATE cycles,
+ * take a longer nap.
+ */
+ if (hibernating)
+ {
+ int res = WaitLatch(&MyProc->procLatch,
+ WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+ HIBERNATE_MS);
+
+ /*
+ * Only do a quick return if timeout was reached (or postmaster died)
+ * to ensure that no less than WalWriterDelay ms has passed between
+ * XLogBackgroundFlush calls - WaitLatch() might have returned
+ * instantaneously.
+ */
+ if (res & (WL_TIMEOUT | WL_POSTMASTER_DEATH))
+ return false;
+
+ /* WL_LATCH_SET is documented as reliable */
+ latch_set = (res & WL_LATCH_SET) != 0;
+ }
+
+ /*
+ * Nap for the configured time.
+ *
+ * On some platforms, signals won't interrupt the sleep. To ensure we
+ * respond reasonably promptly when someone signals us, break down the
+ * sleep into 1-second increments, and check for interrupts after each
+ * nap.
+ */
+ udelay = WalWriterDelay * 1000L;
+
+ while (udelay > 999999L)
+ {
+ if (got_SIGHUP || shutdown_requested)
+ break;
+ pg_usleep(1000000L);
+ udelay -= 1000000L;
+ }
+ if (!(got_SIGHUP || shutdown_requested))
+ pg_usleep(udelay);
+
+ return latch_set;
+ }
/* --------------------------------
* signal handler routines
*************** wal_quickdie(SIGNAL_ARGS)
*** 298,311 ****
static void
WalSigHupHandler(SIGNAL_ARGS)
{
got_SIGHUP = true;
! SetLatch(WALWriterLatch());
}
/* SIGTERM: set flag to exit normally */
static void
WalShutdownHandler(SIGNAL_ARGS)
{
shutdown_requested = true;
! SetLatch(WALWriterLatch());
}
--- 431,465 ----
static void
WalSigHupHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
got_SIGHUP = true;
! if (MyProc)
! SetLatch(&MyProc->procLatch);
!
! errno = save_errno;
}
/* SIGTERM: set flag to exit normally */
static void
WalShutdownHandler(SIGNAL_ARGS)
{
+ int save_errno = errno;
+
shutdown_requested = true;
! if (MyProc)
! SetLatch(&MyProc->procLatch);
!
! errno = save_errno;
! }
!
! /* SIGUSR1: used for latch wakeups */
! static void
! walwriter_sigusr1_handler(SIGNAL_ARGS)
! {
! int save_errno = errno;
!
! latch_sigusr1_handler();
!
! errno = save_errno;
}
diff src/backend/storage/lmgr/proc.c
index d1bf264..43c4846
*** a/src/backend/storage/lmgr/proc.c
--- b/src/backend/storage/lmgr/proc.c
*************** InitProcGlobal(void)
*** 187,192 ****
--- 187,193 ----
ProcGlobal->startupProcPid = 0;
ProcGlobal->startupBufferPinWaitBufId = -1;
ProcGlobal->bgwriterLatch = NULL;
+ ProcGlobal->walwriterLatch = NULL;
/*
* Create and initialize all the PGPROC structures we'll need (except for
diff src/include/access/xlog.h
index f8aecef..3c1a509
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
*************** extern CheckpointStatsData CheckpointSta
*** 266,272 ****
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
extern void XLogFlush(XLogRecPtr RecPtr);
! extern void XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
--- 266,272 ----
extern XLogRecPtr XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata);
extern void XLogFlush(XLogRecPtr RecPtr);
! extern bool XLogBackgroundFlush(void);
extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
extern int XLogFileInit(uint32 log, uint32 seg,
bool *use_existent, bool use_lock);
*************** extern TimeLineID GetRecoveryTargetTLI(v
*** 317,323 ****
extern bool CheckPromoteSignal(void);
extern void WakeupRecovery(void);
- extern Latch *WALWriterLatch(void);
/*
* Starting/stopping a base backup
--- 317,322 ----
diff src/include/storage/latch.h
index f97fedf..419644a
*** a/src/include/storage/latch.h
--- b/src/include/storage/latch.h
***************
*** 25,31 ****
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it.
*
* There are three basic operations on a latch:
*
--- 25,38 ----
* and must be initialized at postmaster startup by InitSharedLatch. Before
* a shared latch can be waited on, it must be associated with a process
* with OwnLatch. Only the process owning the latch can wait on it, but any
! * process can set it. Note that the use of the process latch (a field in
! * PGPROC) is generally preferred over an ad-hoc shared latch for auxiliary
! * processes. This is because generic signal handlers will call SetLatch on
! * the process latch only, so using any latch other than the process latch
! * effectively precludes ever registering a generic handler. Since signals have
! * the potential to invalidate the latch timeout on certain platforms,
! * resulting in a denial-of-service, it is important to verify that all signal
! * handlers within all WaitLatch() calling processes call SetLatch().
*
* There are three basic operations on a latch:
*
diff src/include/storage/proc.h
index 987bc08..46755e5
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
*************** typedef struct PROC_HDR
*** 190,195 ****
--- 190,197 ----
PGPROC *autovacFreeProcs;
/* BGWriter process latch */
Latch *bgwriterLatch;
+ /* WALWriter process latch */
+ Latch *walwriterLatch;
/* Current shared estimate of appropriate spins_per_delay value */
int spins_per_delay;
/* The proc of the Startup process, since not in ProcArray */
Peter Geoghegan <peter@2ndquadrant.com> writes:
This latest revision also covers the checkpointer. The code for that
is far simpler than that for the WAL Writer, so it doesn't
particularly feel like I'm pushing my luck by slipping that into
something to be slipped in.
Well ... maybe, or maybe not, or maybe you are just poking at a sore
spot that was already created by the patch to make a separate
checkpointer process. What bothers me in looking at this is that the
main loop of the checkpointer includes an AbsorbFsyncRequests() call,
which is now the only wakeup condition that isn't covered by latch
logic or a predictable time delay. A long sleep period could easily
result in overflow of the fsync request queue, which is not good for
performance. I'm inclined to think that we'd better add logic to
ForwardFsyncRequest() to set the latch once the queue is, say, more
than half full.
I also notice that the separate-checkpointer patch failed to rename
assorted things like BgWriterCommLock, BgWriterRequest,
BgWriterShmemStruct, which are all 100% inappropriately named now.
And it still contains various obsolete comments referring to itself
as the background writer. Will see about cleaning that up.
regards, tom lane
On 7 May 2012 18:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Peter Geoghegan <peter@2ndquadrant.com> writes:
This latest revision also covers the checkpointer. The code for that
is far simpler than that for the WAL Writer, so it doesn't
particularly feel like I'm pushing my luck by slipping that into
something to be slipped in.Well ... maybe, or maybe not, or maybe you are just poking at a sore
spot that was already created by the patch to make a separate
checkpointer process. What bothers me in looking at this is that the
main loop of the checkpointer includes an AbsorbFsyncRequests() call,
which is now the only wakeup condition that isn't covered by latch
logic or a predictable time delay. A long sleep period could easily
result in overflow of the fsync request queue, which is not good for
performance. I'm inclined to think that we'd better add logic to
ForwardFsyncRequest() to set the latch once the queue is, say, more
than half full.
OK
I also notice that the separate-checkpointer patch failed to rename
assorted things like BgWriterCommLock, BgWriterRequest,
BgWriterShmemStruct, which are all 100% inappropriately named now.
And it still contains various obsolete comments referring to itself
as the background writer. Will see about cleaning that up.
For want of a better name, keeping them the same seemed best.
If you have a suggested name change, I'd be happy to oblige.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes:
On 7 May 2012 18:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I also notice that the separate-checkpointer patch failed to rename
assorted things like BgWriterCommLock, BgWriterRequest,
BgWriterShmemStruct, which are all 100% inappropriately named now.
And it still contains various obsolete comments referring to itself
as the background writer. �Will see about cleaning that up.
For want of a better name, keeping them the same seemed best.
I was just thinking s/BgWriter/Checkpointer/, do you think that's too
long?
regards, tom lane
On 7 May 2012 19:44, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
On 7 May 2012 18:09, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I also notice that the separate-checkpointer patch failed to rename
assorted things like BgWriterCommLock, BgWriterRequest,
BgWriterShmemStruct, which are all 100% inappropriately named now.
And it still contains various obsolete comments referring to itself
as the background writer. Will see about cleaning that up.For want of a better name, keeping them the same seemed best.
I was just thinking s/BgWriter/Checkpointer/, do you think that's too
long?
CheckpointerCommLock
CheckpointerShmemStruct
work OK
CheckpointerRequest
sounds a little vague, but can be tweaked
It also leaves the situation that we have a catalog view called
pg_stat_bgwriter that would be accessing "checkpointer" things. That's
really the thorny one that I wasn't sure how to handle. Good example
of why we shouldn't expose internals too much.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes:
It also leaves the situation that we have a catalog view called
pg_stat_bgwriter that would be accessing "checkpointer" things. That's
really the thorny one that I wasn't sure how to handle. Good example
of why we shouldn't expose internals too much.
Yeah, that's a bit unfortunate but changing it doesn't seem like a good
idea. The names I intended to change are all internal.
regards, tom lane
On 7 May 2012 20:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
It also leaves the situation that we have a catalog view called
pg_stat_bgwriter that would be accessing "checkpointer" things. That's
really the thorny one that I wasn't sure how to handle. Good example
of why we shouldn't expose internals too much.Yeah, that's a bit unfortunate but changing it doesn't seem like a good
idea. The names I intended to change are all internal.
OK, I'll change just the internal names.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Peter Geoghegan <peter@2ndquadrant.com> writes:
Attached patch removes the questionable SetLatch() call, under the
assumption that it's okay if the WALWriter, having entered hibernation
due to sustained inactivity (10 seconds) subsequently takes up to 5
seconds (2.5 on average) to notice that it has work to do. These
values may need to be tweaked. I have not bothered with making the
sleep time adaptive, because it's probably too late to do that.
Now that I've actually read the patch, rather than just responding to
your description of it, I find myself entirely unhappy with the proposed
changes in the walwriter's sleep logic. You have introduced race
conditions (it is NOT okay to reset the latch somewhere below the top of
the loop) and degraded the walwriter's signal response time in normal
non-hibernation state, to solve a problem not in evidence; to wit that
backends spend too much time signaling the walwriter. Given the
location of the only existing SetLatch call for the purpose, I find that
proposition more than a bit doubtful. I see little or no value in
trying to keep the walwriter's procLatch set when it's not hibernating,
and I definitely don't think it is worth the risk of introducing bugs
when we're about to start beta. I'm intending to rip all that out and
go back to the plain "ResetLatch at the top of the loop, WaitLatch at
the bottom" design, with the hibernation logic just controlling the
timeout on the WaitLatch call.
regards, tom lane
On 8 May 2012 22:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now that I've actually read the patch, rather than just responding to
your description of it, I find myself entirely unhappy with the proposed
changes in the walwriter's sleep logic. You have introduced race
conditions (it is NOT okay to reset the latch somewhere below the top of
the loop)
Yes, there is some checking of flags before the potential ResetLatch()
call, which may be acted on. The code there is almost identical to
that of the extant bgwriter code. I was under the impression that this
did not amount to a race, though it's rather late now, and I'm feeling
under the weather, so I have not taken steps to verify that I have it
right. Arguably, you'd want somebody's SetLatch call to be ignored if
and degraded the walwriter's signal response time in normal
non-hibernation state, to solve a problem not in evidence; to wit that
backends spend too much time signaling the walwriter. Given the
location of the only existing SetLatch call for the purpose, I find that
proposition more than a bit doubtful.
I do too. The elaborate logic to reduce that overhead was nothing more
than a vestige of the first version. I apologise for making such a
basic error.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
Peter Geoghegan <peter@2ndquadrant.com> writes:
On 8 May 2012 22:35, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Now that I've actually read the patch, rather than just responding to
your description of it, I find myself entirely unhappy with the proposed
changes in the walwriter's sleep logic. �You have introduced race
conditions (it is NOT okay to reset the latch somewhere below the top of
the loop)
Yes, there is some checking of flags before the potential ResetLatch()
call, which may be acted on. The code there is almost identical to
that of the extant bgwriter code.
Um, yes, I noticed that shortly after sending my previous message.
I'm pretty unhappy about the current state of the bgwriter loop, too.
I rather wonder whether that coding explains the "postmaster failed to
shut down" errors that we've been seeing lately in the buildfarm.
Not noticing a shutdown signal promptly would go a long way towards
explaining that.
regards, tom lane
On 9 May 2012 00:21, Peter Geoghegan <peter@2ndquadrant.com> wrote:
Yes, there is some checking of flags before the potential ResetLatch()
call, which may be acted on. The code there is almost identical to
that of the extant bgwriter code. I was under the impression that this
did not amount to a race, though it's rather late now, and I'm feeling
under the weather, so I have not taken steps to verify that I have it
right. Arguably, you'd want somebody's SetLatch call to be ignored if
Sent too early. That should be: Arguably, you'd want somebody's
SetLatch call to be ignored under the circumstances that that could
happen in both the bgwriter, and the WALWriter within my recent patch.
--
Peter Geoghegan http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services
I've applied the walwriter/checkpointer patch, with the mentioned
re-simplification of the logic. While measuring that, I noticed that
the stats collector was now the biggest repeated-wakeup culprit, so
I took the time to latch-ify it as well. AFAICS we no longer have
anything that wakes up oftener than once every five seconds when the
system is idle, so life is looking pretty good in powertop land.
regards, tom lane
On further reflection I've realized that there's a really unpleasant
consequence of the walwriter change as-committed: it breaks the former
guarantee that async commits would reach disk within at most 3 times
the WalWriterDelay setting. They will still get written within at most
3 walwriter cycles, but a lone async commit occurring when the writer
is hibernating could see a delay much longer than before. This seems to
me to be unacceptable. Probably nobody cares that much about the exact
multiplier of 3, but if the delay could be an order of magnitude or two
more than that, that's going to make users of async commits unhappy.
So what we need here is for XLogSetAsyncXactLSN() to be able to boot the
walwriter out of hibernate mode. I still don't care in the least for
the original hack of using the state of the procLatch to indicate
whether the walwriter is hibernating, but we can add a separate flag
instead so as to avoid having every trip through XLogSetAsyncXactLSN
do a SetLatch call (which would be bad anyway since it would prevent
the walwriter from sleeping normally). Since XLogSetAsyncXactLSN
has to take the info_lck anyway, we might as well make this new flag
be protected by info_lck. The walwriter won't need to change the
flag's state very often, by definition, so that end of it isn't going
to cost anything noticeable.
regards, tom lane
Peter Geoghegan <peter@2ndquadrant.com> writes:
On 9 May 2012 00:21, Peter Geoghegan <peter@2ndquadrant.com> wrote:
Yes, there is some checking of flags before the potential ResetLatch()
call, which may be acted on. The code there is almost identical to
that of the extant bgwriter code. I was under the impression that this
did not amount to a race, though it's rather late now, and I'm feeling
under the weather, so I have not taken steps to verify that I have it
right. Arguably, you'd want somebody's SetLatch call to be ignored if
Sent too early. That should be: Arguably, you'd want somebody's
SetLatch call to be ignored under the circumstances that that could
happen in both the bgwriter, and the WALWriter within my recent patch.
I don't believe that for a moment, and even if it accidentally manages
to not fail in today's code, it's *way* too fragile for my taste.
I believe the bgwriter code needs to be fixed similarly to the way
I changed the walwriter patch, namely that there needs to be a separate
flag (not the latch's isSet state) advertising whether the bgwriter is
currently in hibernation mode. Now, unlike the walwriter, there isn't
any convenient place to keep such a flag where it can be inspected
inside an already-existing spinlock or LWLock guard. However, unlike
the walwriter there is not a correctness requirement for the bgwriter
to wake up promptly. So I think we could just put a bool
"bgwriter_sleeping" in ProcGlobal and accept the possibility of other
processes sometimes seeing stale values. Thoughts?
regards, tom lane
I wrote:
I believe the bgwriter code needs to be fixed similarly to the way
I changed the walwriter patch, namely that there needs to be a separate
flag (not the latch's isSet state) advertising whether the bgwriter is
currently in hibernation mode.
After further study of the bgwriter hibernation patch (commit
6d90eaaa89a007e0d365f49d6436f35d2392cfeb), I think that my worries about
race conditions in the use of the bgwriter's latch are really the least
of its problems. BgBufferSync embodies a rather carefully tuned
feedback control loop, and I think these changes broke it. In the first
place, that feedback loop is built on the assumption that BgBufferSync
is executed at fixed intervals, which isn't true anymore; and in the
second place, the loop is not driven so much by the rate of buffers
being dirtied as it is by the rate of buffers being allocated. To be
concrete, if there is a constant but slow rate of buffer consumption,
the bgwriter will eventually lap the strategy scan and then stay there,
resulting in BgBufferSync returning true every time even though actually
the system is doing things. This results in bgwriter.c deciding it's in
hibernation mode, whereupon we have a scenario where backends will be
signaling it all the time. The way BgWriterNap is coded, that means
BgBufferSync is not executed at a fixed BgWriterDelay interval, but at
variable intervals from BgWriterDelay up to BGWRITER_HIBERNATE_MS, which
pretty much destroys the predictability of the feedback loop.
My proposal for fixing this is that
(1) BgBufferSync should return true (OK to hibernate) only if it's
lapped the strategy scan *and* recent_alloc is zero, meaning no new
buffers were allocated anywhere since last time.
(2) We should remove the bgwriter wakening calls from MarkBufferDirty
and SetBufferCommitInfoNeedsSave, and instead place one in buffer
allocation.
(3) The bottom-of-loop logic in bgwriter should be along the lines of
rc = WaitLatch(..., BgWriterDelay);
if (rc == WL_TIMEOUT && can_hibernate)
{
set global flag to tell backends to kick bgwriter
if they allocate a buffer;
WaitLatch(..., BGWRITER_HIBERNATE_MS);
clear global flag;
}
In comparison to the existing code, this method guarantees immediate
response to any signal (latch-setting event), and it ensures that
if we extend the normal sleep time for the bgwriter, the extra sleep
covers only an interval in which no new buffer allocations happened.
That provision seems to me to justify pretending that that interval
simply didn't exist for the purposes of the feedback control loop,
which allows us to not have to rethink how that loop works.
Comments?
regards, tom lane
On 10.05.2012 00:34, Tom Lane wrote:
After further study of the bgwriter hibernation patch (commit
6d90eaaa89a007e0d365f49d6436f35d2392cfeb), I think that my worries about
race conditions in the use of the bgwriter's latch are really the least
of its problems. BgBufferSync embodies a rather carefully tuned
feedback control loop, and I think these changes broke it. In the first
place, that feedback loop is built on the assumption that BgBufferSync
is executed at fixed intervals, which isn't true anymore; and in the
second place, the loop is not driven so much by the rate of buffers
being dirtied as it is by the rate of buffers being allocated. To be
concrete, if there is a constant but slow rate of buffer consumption,
the bgwriter will eventually lap the strategy scan and then stay there,
resulting in BgBufferSync returning true every time even though actually
the system is doing things. This results in bgwriter.c deciding it's in
hibernation mode, whereupon we have a scenario where backends will be
signaling it all the time. The way BgWriterNap is coded, that means
BgBufferSync is not executed at a fixed BgWriterDelay interval, but at
variable intervals from BgWriterDelay up to BGWRITER_HIBERNATE_MS, which
pretty much destroys the predictability of the feedback loop.My proposal for fixing this is that
(1) BgBufferSync should return true (OK to hibernate) only if it's
lapped the strategy scan *and* recent_alloc is zero, meaning no new
buffers were allocated anywhere since last time.(2) We should remove the bgwriter wakening calls from MarkBufferDirty
and SetBufferCommitInfoNeedsSave, and instead place one in buffer
allocation.
Hmm, that means that if you don't dirty any pages, bgwriter will wake up
even though it has no real work to do. It only wakes up to advance its
scan. I guess that's OK, if the system is busy doing read-only things
anyway, a few extra wakeups don't matter.
Waking bgwriter in buffer allocation makes sense also because the buffer
allocation stats will be updated more promptly that way. At the moment,
if the bgwriter hibernates, the bgwriter only updates the stats every 10
seconds.
(3) The bottom-of-loop logic in bgwriter should be along the lines of
rc = WaitLatch(..., BgWriterDelay);
if (rc == WL_TIMEOUT&& can_hibernate)
{
set global flag to tell backends to kick bgwriter
if they allocate a buffer;
WaitLatch(..., BGWRITER_HIBERNATE_MS);
clear global flag;
}In comparison to the existing code, this method guarantees immediate
response to any signal (latch-setting event), and it ensures that
if we extend the normal sleep time for the bgwriter, the extra sleep
covers only an interval in which no new buffer allocations happened.
That provision seems to me to justify pretending that that interval
simply didn't exist for the purposes of the feedback control loop,
which allows us to not have to rethink how that loop works.Comments?
Seems reasonable. Would you like me to write a patch, or are you already
on it?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 10.05.2012 00:34, Tom Lane wrote:
After further study of the bgwriter hibernation patch (commit
6d90eaaa89a007e0d365f49d6436f35d2392cfeb), I think that my worries about
race conditions in the use of the bgwriter's latch are really the least
of its problems.
Seems reasonable. Would you like me to write a patch, or are you already
on it?
Done already, but please review the commit.
regards, tom lane