Lots of FSM-related fragility in transaction commit
Joseph Shraibman reported some rather unpleasant behavior that
seems to be due to trying to drop a table whose FSM file has got
no permissions:
http://archives.postgresql.org/pgsql-general/2011-12/msg00246.php
One question is how the file got that way, and whether Postgres did
anything wrong to cause it to not have permissions. But what I'm on the
warpath about now is the abysmally bad error response. What ought to
happen if we fail to delete a file during a DROP TABLE is a WARNING,
no more. Not a PANIC that recurs during crash recovery.
I tried to reproduce this, and found a different but equally bad
behavior:
regression=# create table foo as select * from tenk1;
SELECT 10000
regression=# delete from foo where unique1%10 = 0;
DELETE 1000
regression=# vacuum foo;
VACUUM
regression=# select 'foo'::regclass::oid;
oid
-------
52860
(1 row)
[ go and "chmod 000 52860_fsm" ]
regression=# drop table foo;
WARNING: AbortTransaction while in COMMIT state
ERROR: could not open file "base/27666/52860_fsm": Permission denied
PANIC: cannot abort transaction 7413, it was already committed
ERROR: could not open file "base/27666/52860_fsm": Permission denied
PANIC: cannot abort transaction 7413, it was already committed
The connection to the server was lost. Attempting reset: Failed.
I thought that removing the unreadable file would let me restart,
but after "rm 52860_fsm" and trying again to start the server,
there's a different problem:
LOG: database system was interrupted while in recovery at 2011-12-08 00:56:11 EST
HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG: database system was not properly shut down; automatic recovery in progress
LOG: consistent recovery state reached at 0/5D71BA8
LOG: redo starts at 0/5100050
WARNING: page 18 of relation base/27666/52860 is uninitialized
CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
PANIC: WAL contains references to invalid pages
CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
LOG: startup process (PID 14507) was terminated by signal 6
LOG: aborting startup due to startup process failure
Note that this isn't even the same symptom Shraibman hit, since
apparently he was failing on replaying the commit record. How is it
that the main table file managed to have uninitialized pages?
I suspect that "redo visible" WAL processing is breaking one or more of
the fundamental WAL rules, perhaps by not generating a full-page image
when it needs to.
So this is really a whole lot worse than our behavior was in pre-FSM
days, and it needs to get fixed.
regards, tom lane
On 08.12.2011 08:20, Tom Lane wrote:
I thought that removing the unreadable file would let me restart,
but after "rm 52860_fsm" and trying again to start the server,
there's a different problem:LOG: database system was interrupted while in recovery at 2011-12-08 00:56:11 EST
HINT: This probably means that some data is corrupted and you will have to use the last backup for recovery.
LOG: database system was not properly shut down; automatic recovery in progress
LOG: consistent recovery state reached at 0/5D71BA8
LOG: redo starts at 0/5100050
WARNING: page 18 of relation base/27666/52860 is uninitialized
CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
PANIC: WAL contains references to invalid pages
CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
LOG: startup process (PID 14507) was terminated by signal 6
LOG: aborting startup due to startup process failureNote that this isn't even the same symptom Shraibman hit, since
apparently he was failing on replaying the commit record. How is it
that the main table file managed to have uninitialized pages?
I suspect that "redo visible" WAL processing is breaking one or more of
the fundamental WAL rules, perhaps by not generating a full-page image
when it needs to.So this is really a whole lot worse than our behavior was in pre-FSM
days, and it needs to get fixed.
This bug was actually introduced only recently. Notice how the log says
"consistent recovery state reached at 0/5D71BA8". This interacts badly
with Fujii's patch I committed last week:
commit 1e616f639156b2431725f7823c999486ca46c1ea
Author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Fri Dec 2 10:49:54 2011 +0200During recovery, if we reach consistent state and still have entries in the
invalid-page hash table, PANIC immediately. Immediate PANIC is much better
than waiting for end-of-recovery, which is what we did before, because the
end-of-recovery might not come until months later if this is a standby
server.
...
Recovery thinks it has reached consistency early on, so it PANICs as
soon as it sees a reference to a page that belongs to a table that was
later dropped.
The bug here is that we consider having immediately reached consistency
during crash recovery. That's a false notion: during crash recovery the
database isn't consistent until *all* WAL has been replayed. We should
not set reachedMinRecoveryPoint during crash recovery at all. And
perhaps the flag should be renamed to reachedConsistency, to make it
clear that during crash recovery it's not enough to reach the nominal
minRecoveryPoint.
That was harmless until last week, because reachedMinRecoveryPoint was
not used for anything unless you're doing archive recovery and hot
standby was enabled, but IMO the "consistent recovery state reached" log
message was misleading even then. I propose that we apply the attached
patch to master and backbranches.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
dont-set-reachedMinRecoveryPoint-in-crash-recovery-1.patchtext/x-diff; name=dont-set-reachedMinRecoveryPoint-in-crash-recovery-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9bec660..9d96044 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -562,7 +562,13 @@ static TimeLineID lastPageTLI = 0;
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
static bool updateMinRecoveryPoint = true;
-bool reachedMinRecoveryPoint = false;
+
+/*
+ * Have we reached a consistent database state? In crash recovery, we have
+ * to replay all the WAL, so reachedConsistency is never set. During archive
+ * recovery, the database is consistent once minRecoveryPoint is reached.
+ */
+bool reachedConsistency = false;
static bool InRedo = false;
@@ -6894,9 +6900,16 @@ static void
CheckRecoveryConsistency(void)
{
/*
+ * During crash recovery, we don't reach a consistent state until we've
+ * replayed all the WAL.
+ */
+ if (XLogRecPtrIsInvalid(minRecoveryPoint))
+ return;
+
+ /*
* Have we passed our safe starting point?
*/
- if (!reachedMinRecoveryPoint &&
+ if (!reachedConsistency &&
XLByteLE(minRecoveryPoint, EndRecPtr) &&
XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
{
@@ -6906,7 +6919,7 @@ CheckRecoveryConsistency(void)
*/
XLogCheckInvalidPages();
- reachedMinRecoveryPoint = true;
+ reachedConsistency = true;
ereport(LOG,
(errmsg("consistent recovery state reached at %X/%X",
EndRecPtr.xlogid, EndRecPtr.xrecoff)));
@@ -6919,7 +6932,7 @@ CheckRecoveryConsistency(void)
*/
if (standbyState == STANDBY_SNAPSHOT_READY &&
!LocalHotStandbyActive &&
- reachedMinRecoveryPoint &&
+ reachedConsistency &&
IsUnderPostmaster)
{
/* use volatile pointer to prevent code rearrangement */
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 350d434..b280434 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -85,7 +85,7 @@ log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
* linger in the hash table until the end of recovery and PANIC there,
* which might come only much later if this is a standby server.
*/
- if (reachedMinRecoveryPoint)
+ if (reachedConsistency)
{
report_invalid_page(WARNING, node, forkno, blkno, present);
elog(PANIC, "WAL contains references to invalid pages");
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fb56cd..1b31414 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -190,7 +190,7 @@ typedef enum
extern XLogRecPtr XactLastRecEnd;
-extern bool reachedMinRecoveryPoint;
+extern bool reachedConsistency;
/* these variables are GUC parameters related to XLOG */
extern int CheckPointSegments;
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 08.12.2011 08:20, Tom Lane wrote:
So this is really a whole lot worse than our behavior was in pre-FSM
days, and it needs to get fixed.
This bug was actually introduced only recently. Notice how the log says
"consistent recovery state reached at 0/5D71BA8". This interacts badly
with Fujii's patch I committed last week:
You're right, I was testing on HEAD not 9.1.x.
That was harmless until last week, because reachedMinRecoveryPoint was
not used for anything unless you're doing archive recovery and hot
standby was enabled, but IMO the "consistent recovery state reached" log
message was misleading even then. I propose that we apply the attached
patch to master and backbranches.
Looks sane to me, though I've not tested to see what effect it has on
the case I was testing.
regards, tom lane
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
On 08.12.2011 08:20, Tom Lane wrote:
I thought that removing the unreadable file would let me restart,
but after "rm 52860_fsm" and trying again to start the server,
there's a different problem:
LOG: consistent recovery state reached at 0/5D71BA8
LOG: redo starts at 0/5100050
WARNING: page 18 of relation base/27666/52860 is uninitialized
CONTEXT: xlog redo visible: rel 1663/27666/52860; blk 18
PANIC: WAL contains references to invalid pages
The bug here is that we consider having immediately reached consistency
during crash recovery.
That fixes only part of the problem I was on about, though: I don't
believe that this type of situation should occur in the first place.
We should not be throwing ERROR during post-commit attempts to unlink
the physical files of deleted relations.
After some investigation I believe that the problem was introduced
by the changes to support multiple "forks" in a relation. Specifically,
instead of just doing "smgrdounlink()" during post-commit, we now
do something like this:
for (fork = 0; fork <= MAX_FORKNUM; fork++)
{
if (smgrexists(srel, fork))
smgrdounlink(srel, fork, false);
}
and if you look into mdexists() you'll find that it throws ERROR for any
unexpected errno. This makes smgrdounlink's attempts to be robust
rather pointless.
I'm not entirely sure whether that behavior of mdexists is a good thing,
but it strikes me that the smgrexists call is pretty unnecessary here in
the first place. We should just get rid of it and do smgrdounlink
unconditionally. The only reason it's there is to keep smgrdounlink
from whinging about ENOENT on non-primary forks, which is simply stupid
for smgrdounlink to be doing anyway --- it is actually *more* willing to
complain about ENOENT on non-primary than primary forks, which has no
sense that I can detect.
Accordingly I propose the attached patch for HEAD, and similar changes
in all branches that have multiple forks. Note that this makes the
warning-report conditions identical for both code paths in mdunlink.
regards, tom lane