Re: Theory about XLogFlush startup failures
So the failure-to-start-up problem can be blamed entirely on 7.1's
failure to do anything with LSN fields in pg_log pages. I was able to
No, first reported problem can be blamed on RAM failures.
So I am still dissatisfied with doing elog(STOP) for this condition,
as I regard it as an overly strong reaction to corrupted data;
moreover, it does nothing to fix the problem and indeed gets in
the way of fixing the problem.
Totally agreed but..
I propose the attached patch.
What do you think?
...
+ if (XLByteLT(LogwrtResult.Flush, record)) + elog(InRecovery ? NOTICE : ERROR,
I suggest also to set some flag here if InRecovery,
to elog(STOP
DATA FILE(S) CORRUPTED!
RESTORE DATA FROM BACKUP OR
RESET WAL TO DUMP/MANUALLY FIX ERRORS
- or something like that -:) - after all data buffers
flushed.
What's wrong with this? It's not Ok automatically restart
knowing about errors in data.
Vadim
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
So I am still dissatisfied with doing elog(STOP) for this condition,
as I regard it as an overly strong reaction to corrupted data;
moreover, it does nothing to fix the problem and indeed gets in
the way of fixing the problem.
... It's not Ok automatically restart
knowing about errors in data.
Actually, I disagree. If we come across clearly corrupt data values
(eg, bad length word for a varlena item, or even tuple-header errors
such as a bad XID), we do not try to force the admin to restore the
database from backup, do we? A bogus LSN is bad, certainly, but it
is not the end of the world and does not deserve a panic reaction.
At worst it tells us that one data page is corrupt. A robust system
should report that and keep plugging.
What would be actually useful here is to report which page contains
the bad LSN, so that the admin could look at it and decide what to do.
xlog.c doesn't know that, unfortunately. I'd be more interested in
expending work to make that happen than in expending work to make
a dbadmin's life more difficult --- and I rank forced stops in the
latter category.
regards, tom lane
...
... It's not Ok automatically restart
knowing about errors in data.
...
At worst it tells us that one data page is corrupt. A robust system
should report that and keep plugging.
Hmm. I'm not sure that this needs an "either-or" resolution on the
general topic of error recovery. Back when I used Ingres, it had the
feature that corruption would mark the database as "readonly" (a mode
I'd like us to have -- even without errors -- to help support upgrades,
updates, and error handling). So an administrator could evaluate the
damage without having further damage caused, but could allow users to
rummage through database at the same time.
I have a hard time believing that we should *always* allow the database
to keep writing in the face of *any* detected error. I'm sure that is
not what Tom is saying, but in this case could further damage be caused
by subsequent writing when we *already* know that there is some
corruption? If so, we should consider supporting some sort of error
state that prevents further damage. Vadim's solution uses the only
current mechanism available, which is to force the database to shut down
until it can be evaluated. But if we had some stronger mechanisms to
support limited operation, that would might help in this case and would
certainly help in other situations too.
- Thomas
Thomas Lockhart <lockhart@fourpalms.org> writes:
If so, we should consider supporting some sort of error
state that prevents further damage.
This seems reasonable (though I'd still question whether a bad LSN is
sufficient reason to force the whole database into read-only mode).
Vadim's solution uses the only
current mechanism available, which is to force the database to shut down
until it can be evaluated.
But one of the big problems with his solution is that it gets in the way
of evaluating the problem. A read-only mode seems like a better way.
regards, tom lane