clog_redo causing very long recovery time
I'm working with a client that uses Postgres on what amounts to an
appliance.
The database is therefore subject to occasional torture such as, in this
particular case, running out of disk space while performing a million
plus queries (of mixed varieties, many using plpgsql with exception
handling -- more on that later), and eventually being power-cycled. Upon
restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during
recovery, which took almost 2 hours on their hardware. I should note
that this is on Postgres 8.3.x.
After studying the source, I can only see one possible way that this
could have occurred:
In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a
freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of
these calls can lead to a page being pushed out of shared buffers and to
disk, so given a lack of disk space, sufficient clog buffers, but lack
of subtrans buffers, this could happen. At that point the transaction id
does not get advanced, so clog zeros the same page, extendSUBTRANS()
fails again, rinse and repeat.
I believe in the case above, subtrans buffers were exhausted due to the
extensive use of plpgsql with exception handling.
I can simulate this failure with the attached debug-clog patch which
makes use of two pre-existing debug GUCs to selectively interject an
ERROR in between calls to ExtendCLOG() and ExtendSUBTRANS(). If you want
to test this yourself, apply this patch and use the function in
test_clog.sql to generate a million or so transactions. After the first
32K or before (based on when clog gets its first opportunity to zero a
new page) you should start seeing messages about injected ERRORs. Let a
few hundred thousand ERRORs go by, then kill postgres. Recovery will
take ages, because clog_redo is calling fsync hundreds of thousands of
times in order to zero the same page over and over.
The attached fix-clogredo diff is my proposal for a fix for this.
Thoughts/alternatives, etc appreciated.
Thanks,
Joe
--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
Attachments:
debug-clog-r02.difftext/plain; name=debug-clog-r02.diff; x-mac-creator=0; x-mac-type=0Download
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 52224b1..317bc2e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -36,6 +36,7 @@
#include "access/slru.h"
#include "access/transam.h"
#include "postmaster/bgwriter.h"
+#include "utils/guc.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -355,6 +356,9 @@ ExtendCLOG(TransactionId newestXact)
/* Zero the page and make an XLOG entry about it */
ZeroCLOGPage(pageno, true);
+ /* steal this variable for test -- means we've been here */
+ Debug_print_rewritten = true;
+
LWLockRelease(CLogControlLock);
}
diff --git a/src/backend/access/transam/varsup.c b/src/backend/access/transam/varsup.c
index 8838d42..e55a67b 100644
--- a/src/backend/access/transam/varsup.c
+++ b/src/backend/access/transam/varsup.c
@@ -21,6 +21,7 @@
#include "storage/pmsignal.h"
#include "storage/proc.h"
#include "utils/builtins.h"
+#include "utils/guc.h"
/* Number of OIDs to prefetch (preallocate) per XLOG write */
@@ -107,6 +108,11 @@ GetNewTransactionId(bool isSubXact)
* Extend pg_subtrans too.
*/
ExtendCLOG(xid);
+ if (Debug_print_rewritten && Debug_pretty_print)
+ {
+ Debug_print_rewritten = false;
+ elog(ERROR,"injected ERROR");
+ }
ExtendSUBTRANS(xid);
/*
fix-clogredo-pg-r03.difftext/plain; name=fix-clogredo-pg-r03.diff; x-mac-creator=0; x-mac-type=0Download
diff -cNr postgresql-8.3.13.orig/src/backend/access/transam/clog.c postgresql-8.3.13/src/backend/access/transam/clog.c
*** postgresql-8.3.13.orig/src/backend/access/transam/clog.c Tue Dec 14 03:51:20 2010
--- postgresql-8.3.13/src/backend/access/transam/clog.c Thu Apr 28 12:04:52 2011
***************
*** 74,79 ****
--- 75,81 ----
#define ClogCtl (&ClogCtlData)
+ static int last_pageno = -1;
static int ZeroCLOGPage(int pageno, bool writeXlog);
static bool CLOGPagePrecedes(int page1, int page2);
***************
*** 471,476 ****
--- 476,488 ----
memcpy(&pageno, XLogRecGetData(record), sizeof(int));
+ /* avoid repeatedly zeroing the same page */
+ if (InRecovery && pageno == last_pageno)
+ return;
+
+ /* save state */
+ last_pageno = pageno;
+
LWLockAcquire(CLogControlLock, LW_EXCLUSIVE);
slotno = ZeroCLOGPage(pageno, false);
Joseph Conway <mail@joeconway.com> writes:
I'm working with a client that uses Postgres on what amounts to an
appliance.
The database is therefore subject to occasional torture such as, in this
particular case, running out of disk space while performing a million
plus queries (of mixed varieties, many using plpgsql with exception
handling -- more on that later), and eventually being power-cycled. Upon
restart, clog_redo was called approx 885000 times (CLOG_ZEROPAGE) during
recovery, which took almost 2 hours on their hardware. I should note
that this is on Postgres 8.3.x.
After studying the source, I can only see one possible way that this
could have occurred:
In varsup.c:GetNewTracsactionId(), ExtendCLOG() needs to succeed on a
freshly zeroed clog page, and ExtendSUBTRANS() has to fail. Both of
these calls can lead to a page being pushed out of shared buffers and to
disk, so given a lack of disk space, sufficient clog buffers, but lack
of subtrans buffers, this could happen. At that point the transaction id
does not get advanced, so clog zeros the same page, extendSUBTRANS()
fails again, rinse and repeat.
I believe in the case above, subtrans buffers were exhausted due to the
extensive use of plpgsql with exception handling.
Hmm, interesting. I believe that it's not really a question of buffer
space or lack of it, but whether the OS will give us disk space when we
try to add a page to the current pg_subtrans file. In any case, the
point is that a failure between ExtendCLOG and incrementing nextXid
is bad news.
The attached fix-clogredo diff is my proposal for a fix for this.
That seems pretty grotty :-(
I think a more elegant fix might be to just swap the order of the
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries. I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.
It would probably also help to put in a comment admonishing people
to not add stuff right there. I see the SSI guys have fallen into
the same trap.
regards, tom lane
Excerpts from Tom Lane's message of vie may 06 00:22:43 -0300 2011:
I think a more elegant fix might be to just swap the order of the
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries. I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.
I'm pretty sure it would have never occured to me to consider such a
scenario.
--
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 05/05/2011 08:22 PM, Tom Lane wrote:
Joseph Conway <mail@joeconway.com> writes:
The attached fix-clogredo diff is my proposal for a fix for this.
That seems pretty grotty :-(
I think a more elegant fix might be to just swap the order of the
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries. I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.
Yes, that does seem much nicer :-)
It would probably also help to put in a comment admonishing people
to not add stuff right there. I see the SSI guys have fallen into
the same trap.
Right -- I think another similar problem exists in GetNewMultiXactId
where ExtendMultiXactOffset could succeed and write an XLOG entry and
then ExtendMultiXactMember could fail before advancing nextMXact. The
problem in this case is that they both write XLOG entries, so a simple
reversal doesn't help.
Joe
--
Joe Conway
credativ LLC: http://www.credativ.us
Linux, PostgreSQL, and general Open Source
Training, Service, Consulting, & 24x7 Support
Joe Conway <mail@joeconway.com> writes:
Right -- I think another similar problem exists in GetNewMultiXactId
where ExtendMultiXactOffset could succeed and write an XLOG entry and
then ExtendMultiXactMember could fail before advancing nextMXact. The
problem in this case is that they both write XLOG entries, so a simple
reversal doesn't help.
Hmm. Maybe we need a real fix then. I was just sitting here
speculating about whether we'd ever decide we need to WAL-log
pg_subtrans --- because if we did, my solution would fail.
I still think that the right fix is to avoid emitting redundant
XLOG records in the first place, rather than hacking recovery
to not process them. Possibly we could modify slru.c so that
it could be determined whether zeroing of the current page had
already happened. In a quick look, it looks like noting whether
latest_page_number had already been advanced to that page might
do the trick.
regards, tom lane
On 05/05/2011 09:00 PM, Tom Lane wrote:
Joe Conway <mail@joeconway.com> writes:
Right -- I think another similar problem exists in GetNewMultiXactId
where ExtendMultiXactOffset could succeed and write an XLOG entry and
then ExtendMultiXactMember could fail before advancing nextMXact. The
problem in this case is that they both write XLOG entries, so a simple
reversal doesn't help.Hmm. Maybe we need a real fix then. I was just sitting here
speculating about whether we'd ever decide we need to WAL-log
pg_subtrans --- because if we did, my solution would fail.I still think that the right fix is to avoid emitting redundant
XLOG records in the first place, rather than hacking recovery
to not process them. Possibly we could modify slru.c so that
it could be determined whether zeroing of the current page had
already happened. In a quick look, it looks like noting whether
latest_page_number had already been advanced to that page might
do the trick.
Thanks -- I'll test that out.
Joe
--
Joseph E Conway
President/CEO
credativ LLC
www.credativ.us
616 Burnham Street
El Cajon, CA 92019
Office: +1 619 270 8787
Mobile: +1 619 843 8340
On Fri, May 6, 2011 at 4:22 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The attached fix-clogredo diff is my proposal for a fix for this.
That seems pretty grotty :-(
I think a more elegant fix might be to just swap the order of the
ExtendCLOG and ExtendSUBTRANS calls in GetNewTransactionId. The
reason that would help is that pg_subtrans isn't WAL-logged, so if
we succeed doing ExtendSUBTRANS and then fail in ExtendCLOG, we
won't have written any XLOG entry, and thus repeated failures will not
result in repeated XLOG entries. I seem to recall having considered
exactly that point when the clog WAL support was first done, but the
scenario evidently wasn't considered when subtransactions were stuck
in :-(.
I agree with Tom about the need for a fix that prevents generation of
repeated WAL records.
OTOH, I also like Joe's fix in the recovery code to avoid responding
to repeated records.
Can we have both please?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes:
I agree with Tom about the need for a fix that prevents generation of
repeated WAL records.
OTOH, I also like Joe's fix in the recovery code to avoid responding
to repeated records.
Can we have both please?
Why? The patch in the recovery code is seriously ugly, and it won't
do anything useful once we've fixed the other end. Please notice also
that we'd need several instances of that kluge if we want to cover all
the SLRU-based cases.
regards, tom lane