bug of recovery?
Hi,
Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table "invalid_page_tab". Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.
What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?
To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Sep26, 2011, at 11:50 , Fujii Masao wrote:
Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table "invalid_page_tab". Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?
Shouldn't references to invalid pages only occur before we reach a consistent
state? If so, the right fix would be to check whether all invalid page references
have been resolved after we've reached a consistent state, and to skip creating
restart points while there're unresolved page references.
best regards,
Florian Pflug
On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table "invalid_page_tab". Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?
That doesn't happen because the when we stop the server it will
restart from a valid restartpoint - one where there is no in-progress
multi-phase operation.
So when it replays it will always replay both parts of the operation.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndquadrant.com> writes:
On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?
That doesn't happen because the when we stop the server it will
restart from a valid restartpoint - one where there is no in-progress
multi-phase operation.
Not clear that that's true. The larger point though is that the
invalid-page table is only interesting during crash recovery --- once
you've reached a consistent state, it should be empty and remain so.
So I see no particular value in Fujii's proposal of logging the table to
disk during standby mode.
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.
regards, tom lane
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.
I believe we also need to prevent the creation of restart points before
we've reached consistency. If we're starting from an online backup,
and a checkpoint occurred between pg_start_backup() and pg_stop_backup(),
we currently create a restart point upon replaying that checkpoint's
xlog record. At that point, however, unresolved page references are
not an error, since a truncation that happened after the checkpoint
(but before pg_stop_backup()) might or might not be reflected in the
online backup.
best regards,
Florian Pflug
On Tue, Sep 27, 2011 at 7:28 AM, Florian Pflug <fgp@phlo.org> wrote:
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.I believe we also need to prevent the creation of restart points before
we've reached consistency. If we're starting from an online backup,
and a checkpoint occurred between pg_start_backup() and pg_stop_backup(),
we currently create a restart point upon replaying that checkpoint's
xlog record. At that point, however, unresolved page references are
not an error, since a truncation that happened after the checkpoint
(but before pg_stop_backup()) might or might not be reflected in the
online backup.
Preventing the creation of restartpoints before reaching consistent point
sounds fragile to the case where the backup takes very long time. It might
also take very long time to reach consistent point when replaying from that
backup. Which prevents also the removal of WAL files (e.g., streamed from
the master server) for a long time, and then might cause disk full failure.
ISTM that writing an invalid-page table to the disk for every restartpoints is
better approach. If an invalid-page table is never updated after we've
reached consistency point, we probably should make restartpoints write
that table only after that point. And, if a reference to an invalid
page is found
after the consistent point, we should emit error and cancel a recovery.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes:
ISTM that writing an invalid-page table to the disk for every restartpoints is
better approach.
I still say that's uncalled-for overkill. The invalid-page table is not
necessary for recovery, it's only a debugging cross-check. You're more
likely to introduce bugs than fix any by adding a mechanism like that.
regards, tom lane
On Tue, Sep 27, 2011 at 1:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Fujii Masao <masao.fujii@gmail.com> writes:
ISTM that writing an invalid-page table to the disk for every restartpoints is
better approach.I still say that's uncalled-for overkill. The invalid-page table is not
necessary for recovery, it's only a debugging cross-check.
If so, there is no risk even if the invalid-page table is lost and the check
is skipped unexpectedly?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On 26.09.2011 21:06, Simon Riggs wrote:
On Mon, Sep 26, 2011 at 10:50 AM, Fujii Masao<masao.fujii@gmail.com> wrote:
Currently, if a reference to an invalid page is found during recovery,
its information
is saved in hash table "invalid_page_tab". Then, if such a reference
is resolved,
its information is removed from the hash table. If there is unresolved
reference to
an invalid page in the hash table at the end of recovery, PANIC error occurs.What I'm worried about is that the hash table is volatile. If a user restarts
the server before reaching end of recovery, any information in the
hash table is lost,
and we wrongly miss the PANIC error case because we cannot find any unresolved
reference. That is, even if database is corrupted at the end of recovery,
a user might not be able to notice that. This looks like a serious problem. No?To prevent the above problem, we should write the contents of the hash table to
the disk for every restartpoints, I think. Then, when the server
starts recovery,
it should reload the hash table from the disk. Thought? Am I missing something?That doesn't happen because the when we stop the server it will
restart from a valid restartpoint - one where there is no in-progress
multi-phase operation.So when it replays it will always replay both parts of the operation.
I think you're mixing this up with the multi-page page split operations
in b-tree. This is different from that. What the "invalid_page_tab" is
for is the situation where you for example, insert to a page on table X,
and later table X is dropped, and then you crash. On WAL replay, you
will see the insert record, but the file for the table doesn't exist,
because the table was dropped. In that case we skip the insert, note
what happened in invalid_page_tab, and move on with recovery. When we
see the later record to drop the table, we know it was OK that the file
was missing earlier. But if we don't see it before end of recovery, we
PANIC, because then the file should've been there.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On 27.09.2011 00:28, Florian Pflug wrote:
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.I believe we also need to prevent the creation of restart points before
we've reached consistency.
Seems reasonable. We could still allow restartpoints when the hash table
is empty, though. And once we've reached consistency, we can throw an
error immediately in log_invalid_page(), instead of adding the entry in
the hash table.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, Sep 27, 2011 at 6:54 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
I think you're mixing this up with the multi-page page split operations in
b-tree. This is different from that. What the "invalid_page_tab" is for is
the situation where you for example, insert to a page on table X, and later
table X is dropped, and then you crash. On WAL replay, you will see the
insert record, but the file for the table doesn't exist, because the table
was dropped. In that case we skip the insert, note what happened in
invalid_page_tab, and move on with recovery. When we see the later record to
drop the table, we know it was OK that the file was missing earlier. But if
we don't see it before end of recovery, we PANIC, because then the file
should've been there.
OK, yes, I see. Thanks.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
On 27.09.2011 00:28, Florian Pflug wrote:
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.I believe we also need to prevent the creation of restart points before
we've reached consistency.Seems reasonable. We could still allow restartpoints when the hash table is empty, though. And once we've reached consistency, we can throw an error immediately in log_invalid_page(), instead of adding the entry in the hash table.
That mimics the way the rm_safe_restartpoint callbacks work, which is good.
Actually, why don't we use that machinery to implement this? There's currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create one that checks whether invalid_page_tab is empty.
best regards,
Florian Pflug
On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug <fgp@phlo.org> wrote:
On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
On 27.09.2011 00:28, Florian Pflug wrote:
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.I believe we also need to prevent the creation of restart points before
we've reached consistency.Seems reasonable. We could still allow restartpoints when the hash table is empty, though. And once we've reached consistency, we can throw an error immediately in log_invalid_page(), instead of adding the entry in the hash table.
That mimics the way the rm_safe_restartpoint callbacks work, which is good.
Actually, why don't we use that machinery to implement this? There's currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create one that checks whether invalid_page_tab is empty.
Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?
Also, according to other suggestions, the patch changes XLogCheckInvalidPages()
so that it's called as soon as we've reached a consistent state, and changes
log_invalid_page() so that it emits PANIC immediately if consistency is already
reached. These are very good changes, I think. Because they enable us to
notice serious problem which causes PANIC error immediately. Without these
changes, you unfortunately might notice that the standby database is corrupted
when failover happens. Though such a problem might rarely happen, I think it's
worth doing those changes.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
invalid_page_table_v1.patchtext/x-patch; charset=US-ASCII; name=invalid_page_table_v1.patchDownload
*** a/src/backend/access/transam/rmgr.c
--- b/src/backend/access/transam/rmgr.c
***************
*** 16,21 ****
--- 16,22 ----
#include "access/nbtree.h"
#include "access/xact.h"
#include "access/xlog_internal.h"
+ #include "access/xlogutils.h"
#include "catalog/storage.h"
#include "commands/dbcommands.h"
#include "commands/sequence.h"
***************
*** 25,31 ****
const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! {"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
--- 26,32 ----
const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! {"XLOG", xlog_redo, xlog_desc, NULL, NULL, xlog_safe_restartpoint},
{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 557,563 **** static TimeLineID lastPageTLI = 0;
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
static bool updateMinRecoveryPoint = true;
! static bool reachedMinRecoveryPoint = false;
static bool InRedo = false;
--- 557,563 ----
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
static bool updateMinRecoveryPoint = true;
! bool reachedMinRecoveryPoint = false;
static bool InRedo = false;
***************
*** 6840,6851 **** StartupXLOG(void)
LocalXLogInsertAllowed = -1;
/*
- * Check to see if the XLOG sequence contained any unresolved
- * references to uninitialized pages.
- */
- XLogCheckInvalidPages();
-
- /*
* Perform a checkpoint to update all our recovery activity to disk.
*
* Note that we write a shutdown checkpoint rather than an on-line
--- 6840,6845 ----
***************
*** 6982,6987 **** CheckRecoveryConsistency(void)
--- 6976,6987 ----
XLByteLE(minRecoveryPoint, EndRecPtr) &&
XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
{
+ /*
+ * Check to see if the XLOG sequence contained any unresolved
+ * references to uninitialized pages.
+ */
+ XLogCheckInvalidPages();
+
reachedMinRecoveryPoint = true;
ereport(LOG,
(errmsg("consistent recovery state reached at %X/%X",
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***************
*** 52,57 **** typedef struct xl_invalid_page
--- 52,73 ----
static HTAB *invalid_page_tab = NULL;
+ /* Report a reference to an invalid page */
+ static void
+ report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
+ BlockNumber blkno, bool present)
+ {
+ char *path = relpathperm(node, forkno);
+
+ if (present)
+ elog(elevel, "page %u of relation %s is uninitialized",
+ blkno, path);
+ else
+ elog(elevel, "page %u of relation %s does not exist",
+ blkno, path);
+ pfree(path);
+ }
+
/* Log a reference to an invalid page */
static void
log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
***************
*** 62,83 **** log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
bool found;
/*
* Log references to invalid pages at DEBUG1 level. This allows some
* tracing of the cause (note the elog context mechanism will tell us
* something about the XLOG record that generated the reference).
*/
if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
! {
! char *path = relpathperm(node, forkno);
!
! if (present)
! elog(DEBUG1, "page %u of relation %s is uninitialized",
! blkno, path);
! else
! elog(DEBUG1, "page %u of relation %s does not exist",
! blkno, path);
! pfree(path);
! }
if (invalid_page_tab == NULL)
{
--- 78,101 ----
bool found;
/*
+ * Once recovery has reached a consistent state, the invalid-page
+ * table should be empty and remain so. If a reference to an invalid
+ * page is found after consistency is reached, emit PANIC immediately
+ * instead of adding the entry in the invalid-page table.
+ */
+ if (reachedMinRecoveryPoint)
+ {
+ report_invalid_page(WARNING, node, forkno, blkno, present);
+ elog(PANIC, "WAL contains references to invalid pages");
+ }
+
+ /*
* Log references to invalid pages at DEBUG1 level. This allows some
* tracing of the cause (note the elog context mechanism will tell us
* something about the XLOG record that generated the reference).
*/
if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
! report_invalid_page(DEBUG1, node, forkno, blkno, present);
if (invalid_page_tab == NULL)
{
***************
*** 200,214 **** XLogCheckInvalidPages(void)
*/
while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
{
! char *path = relpathperm(hentry->key.node, hentry->key.forkno);
!
! if (hentry->present)
! elog(WARNING, "page %u of relation %s was uninitialized",
! hentry->key.blkno, path);
! else
! elog(WARNING, "page %u of relation %s did not exist",
! hentry->key.blkno, path);
! pfree(path);
foundone = true;
}
--- 218,225 ----
*/
while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
{
! report_invalid_page(WARNING, hentry->key.node, hentry->key.forkno,
! hentry->key.blkno, hentry->present);
foundone = true;
}
***************
*** 449,451 **** XLogTruncateRelation(RelFileNode rnode, ForkNumber forkNum,
--- 460,471 ----
{
forget_invalid_pages(rnode, forkNum, nblocks);
}
+
+ bool
+ xlog_safe_restartpoint(void)
+ {
+ if (invalid_page_tab != NULL &&
+ hash_get_num_entries(invalid_page_tab) > 0)
+ return false;
+ return true;
+ }
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 189,194 **** typedef enum
--- 189,196 ----
extern XLogRecPtr XactLastRecEnd;
+ extern bool reachedMinRecoveryPoint;
+
/* these variables are GUC parameters related to XLOG */
extern int CheckPointSegments;
extern int wal_keep_segments;
*** a/src/include/access/xlogutils.h
--- b/src/include/access/xlogutils.h
***************
*** 28,31 **** extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
--- 28,33 ----
extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
extern void FreeFakeRelcacheEntry(Relation fakerel);
+ extern bool xlog_safe_restartpoint(void);
+
#endif
On Thu, Sep 29, 2011 at 12:31 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug <fgp@phlo.org> wrote:
On Sep27, 2011, at 07:59 , Heikki Linnakangas wrote:
On 27.09.2011 00:28, Florian Pflug wrote:
On Sep26, 2011, at 22:39 , Tom Lane wrote:
It might be worthwhile to invoke XLogCheckInvalidPages() as soon as
we (think we) have reached consistency, rather than leaving it to be
done only when we exit recovery mode.I believe we also need to prevent the creation of restart points before
we've reached consistency.Seems reasonable. We could still allow restartpoints when the hash table is empty, though. And once we've reached consistency, we can throw an error immediately in log_invalid_page(), instead of adding the entry in the hash table.
That mimics the way the rm_safe_restartpoint callbacks work, which is good.
Actually, why don't we use that machinery to implement this? There's currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create one that checks whether invalid_page_tab is empty.
Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?Also, according to other suggestions, the patch changes XLogCheckInvalidPages()
so that it's called as soon as we've reached a consistent state, and changes
log_invalid_page() so that it emits PANIC immediately if consistency is already
reached. These are very good changes, I think. Because they enable us to
notice serious problem which causes PANIC error immediately. Without these
changes, you unfortunately might notice that the standby database is corrupted
when failover happens. Though such a problem might rarely happen, I think it's
worth doing those changes.
Patch does everything we agreed it should.
Good suggestion from Florian.
This worries me slightly now though because the patch makes us PANIC
in a place we didn't used to and once we do that we cannot restart the
server at all. Are we sure we want that? It's certainly a great way to
shake down errors in other code...
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Sep29, 2011, at 13:49 , Simon Riggs wrote:
This worries me slightly now though because the patch makes us PANIC
in a place we didn't used to and once we do that we cannot restart the
server at all. Are we sure we want that? It's certainly a great way to
shake down errors in other code...
The patch only introduces a new PANIC condition during archive recovery,
though. Crash recovery is unaffected, except that we no longer create
restart points before we reach consistency.
Also, if we hit an invalid page reference after reaching consistency,
the cause is probably either a bug in our recovery code, or (quite unlikely)
a corrupted WAL that passed the CRC check. In both cases, the likelyhood
of data-corruption seems high, so PANICing seems like the right thing to do.
best regards,
Florian Pflug
On Thu, Sep 29, 2011 at 11:12 PM, Florian Pflug <fgp@phlo.org> wrote:
On Sep29, 2011, at 13:49 , Simon Riggs wrote:
This worries me slightly now though because the patch makes us PANIC
in a place we didn't used to and once we do that we cannot restart the
server at all. Are we sure we want that? It's certainly a great way to
shake down errors in other code...The patch only introduces a new PANIC condition during archive recovery,
though. Crash recovery is unaffected, except that we no longer create
restart points before we reach consistency.Also, if we hit an invalid page reference after reaching consistency,
the cause is probably either a bug in our recovery code, or (quite unlikely)
a corrupted WAL that passed the CRC check. In both cases, the likelyhood
of data-corruption seems high, so PANICing seems like the right thing to do.
Fair enough.
We might be able to use FATAL or ERROR instead of PANIC because they
also cause all processes to exit when the startup process emits them.
For example, we now use FATAL to stop the server in recovery mode
when recovery is about to end before we've reached a consistent state.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Fri, Sep 30, 2011 at 2:09 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Thu, Sep 29, 2011 at 11:12 PM, Florian Pflug <fgp@phlo.org> wrote:
On Sep29, 2011, at 13:49 , Simon Riggs wrote:
This worries me slightly now though because the patch makes us PANIC
in a place we didn't used to and once we do that we cannot restart the
server at all. Are we sure we want that? It's certainly a great way to
shake down errors in other code...The patch only introduces a new PANIC condition during archive recovery,
though. Crash recovery is unaffected, except that we no longer create
restart points before we reach consistency.Also, if we hit an invalid page reference after reaching consistency,
the cause is probably either a bug in our recovery code, or (quite unlikely)
a corrupted WAL that passed the CRC check. In both cases, the likelyhood
of data-corruption seems high, so PANICing seems like the right thing to do.Fair enough.
We might be able to use FATAL or ERROR instead of PANIC because they
also cause all processes to exit when the startup process emits them.
For example, we now use FATAL to stop the server in recovery mode
when recovery is about to end before we've reached a consistent state.
I think we should issue PANIC if the source is a critical rmgr, or
just WARNING if from a non-critical rmgr, such as indexes.
Ideally, I think we should have a mechanism to allow indexes to be
marked corrupt. For example, a file that if present shows that the
index is corrupt and would be marked not valid. We can then create the
file and send a sinval message to force the index relcache to be
rebuilt showing valid set to false.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Sep 30, 2011 at 3:57 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I think we should issue PANIC if the source is a critical rmgr, or
just WARNING if from a non-critical rmgr, such as indexes.Ideally, I think we should have a mechanism to allow indexes to be
marked corrupt. For example, a file that if present shows that the
index is corrupt and would be marked not valid. We can then create the
file and send a sinval message to force the index relcache to be
rebuilt showing valid set to false.
This seems not to be specific to the invalid-page table problem.
All error cases from a non-critical rmgr should be treated not-PANIC.
So I think that the idea should be implemented separately from
the patch I've posted.
Anyway what if read-only query accesses the index marked invalid?
Just emit ERROR?
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
On Mon, Oct 3, 2011 at 6:23 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
So I think that the idea should be implemented separately from
the patch I've posted.
Agreed. I'll do a final review and commit today.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 29.09.2011 14:31, Fujii Masao wrote:
On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug<fgp@phlo.org> wrote:
Actually, why don't we use that machinery to implement this? There's currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need to create one that checks whether invalid_page_tab is empty.
Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?
I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually
been thinking that we will get rid of rm_safe_restartpoint altogether in
the future. The two things that still use it are the b-tree and gin, and
I'd like to change both of those to not require any post-recovery
cleanup step to finish multi-page operations, similar to what I did with
GiST in 9.1.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Mon, Oct 3, 2011 at 8:21 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 29.09.2011 14:31, Fujii Masao wrote:
On Tue, Sep 27, 2011 at 8:06 PM, Florian Pflug<fgp@phlo.org> wrote:
Actually, why don't we use that machinery to implement this? There's
currently no rm_safe_restartpoint callback for RM_XLOG_ID, so we'd just need
to create one that checks whether invalid_page_tab is empty.Okay, the attached patch prevents the creation of restartpoints by using
rm_safe_restartpoint callback if we've not reached a consistent state yet
and the invalid-page table is not empty. But the invalid-page table is not
tied to the specific resource manager, so using rm_safe_restartpoint for
that seems to slightly odd. Is this OK?I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually been
thinking that we will get rid of rm_safe_restartpoint altogether in the
future. The two things that still use it are the b-tree and gin, and I'd
like to change both of those to not require any post-recovery cleanup step
to finish multi-page operations, similar to what I did with GiST in 9.1.
I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.
I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually been
thinking that we will get rid of rm_safe_restartpoint altogether in the
future. The two things that still use it are the b-tree and gin, and I'd
like to change both of those to not require any post-recovery cleanup step
to finish multi-page operations, similar to what I did with GiST in 9.1.I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.
Though Heikki might be already working on that,... anyway,
the attached patch is the version which doesn't use rm_safe_restartpoint
machinery.
Regards,
--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
Attachments:
invalid_page_table_v2.patchtext/x-patch; charset=US-ASCII; name=invalid_page_table_v2.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 557,563 **** static TimeLineID lastPageTLI = 0;
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
static bool updateMinRecoveryPoint = true;
! static bool reachedMinRecoveryPoint = false;
static bool InRedo = false;
--- 557,563 ----
static XLogRecPtr minRecoveryPoint; /* local copy of
* ControlFile->minRecoveryPoint */
static bool updateMinRecoveryPoint = true;
! bool reachedMinRecoveryPoint = false;
static bool InRedo = false;
***************
*** 6841,6852 **** StartupXLOG(void)
LocalXLogInsertAllowed = -1;
/*
- * Check to see if the XLOG sequence contained any unresolved
- * references to uninitialized pages.
- */
- XLogCheckInvalidPages();
-
- /*
* Perform a checkpoint to update all our recovery activity to disk.
*
* Note that we write a shutdown checkpoint rather than an on-line
--- 6841,6846 ----
***************
*** 6983,6988 **** CheckRecoveryConsistency(void)
--- 6977,6988 ----
XLByteLE(minRecoveryPoint, EndRecPtr) &&
XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
{
+ /*
+ * Check to see if the XLOG sequence contained any unresolved
+ * references to uninitialized pages.
+ */
+ XLogCheckInvalidPages();
+
reachedMinRecoveryPoint = true;
ereport(LOG,
(errmsg("consistent recovery state reached at %X/%X",
***************
*** 7974,7980 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
volatile XLogCtlData *xlogctl = XLogCtl;
/*
! * Is it safe to checkpoint? We must ask each of the resource managers
* whether they have any partial state information that might prevent a
* correct restart from this point. If so, we skip this opportunity, but
* return at the next checkpoint record for another try.
--- 7974,7980 ----
volatile XLogCtlData *xlogctl = XLogCtl;
/*
! * Is it safe to restartpoint? We must ask each of the resource managers
* whether they have any partial state information that might prevent a
* correct restart from this point. If so, we skip this opportunity, but
* return at the next checkpoint record for another try.
***************
*** 7994,7999 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
--- 7994,8015 ----
}
/*
+ * Is it safe to restartpoint? We must check whether there are any
+ * unresolved references to invalid pages that might prevent
+ * a correct restart from this point. If so, we skip this opportunity,
+ * but return at the next checkpoint record for another try.
+ */
+ if (have_invalid_pages())
+ {
+ elog(trace_recovery(DEBUG2),
+ "could not record restart point at %X/%X because there "
+ "are unresolved references to invalid pages",
+ checkPoint->redo.xlogid,
+ checkPoint->redo.xrecoff);
+ return;
+ }
+
+ /*
* Copy the checkpoint record to shared memory, so that bgwriter can use
* it the next time it wants to perform a restartpoint.
*/
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***************
*** 52,57 **** typedef struct xl_invalid_page
--- 52,73 ----
static HTAB *invalid_page_tab = NULL;
+ /* Report a reference to an invalid page */
+ static void
+ report_invalid_page(int elevel, RelFileNode node, ForkNumber forkno,
+ BlockNumber blkno, bool present)
+ {
+ char *path = relpathperm(node, forkno);
+
+ if (present)
+ elog(elevel, "page %u of relation %s is uninitialized",
+ blkno, path);
+ else
+ elog(elevel, "page %u of relation %s does not exist",
+ blkno, path);
+ pfree(path);
+ }
+
/* Log a reference to an invalid page */
static void
log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
***************
*** 62,83 **** log_invalid_page(RelFileNode node, ForkNumber forkno, BlockNumber blkno,
bool found;
/*
* Log references to invalid pages at DEBUG1 level. This allows some
* tracing of the cause (note the elog context mechanism will tell us
* something about the XLOG record that generated the reference).
*/
if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
! {
! char *path = relpathperm(node, forkno);
!
! if (present)
! elog(DEBUG1, "page %u of relation %s is uninitialized",
! blkno, path);
! else
! elog(DEBUG1, "page %u of relation %s does not exist",
! blkno, path);
! pfree(path);
! }
if (invalid_page_tab == NULL)
{
--- 78,101 ----
bool found;
/*
+ * Once recovery has reached a consistent state, the invalid-page
+ * table should be empty and remain so. If a reference to an invalid
+ * page is found after consistency is reached, emit PANIC immediately
+ * instead of adding the entry in the invalid-page table.
+ */
+ if (reachedMinRecoveryPoint)
+ {
+ report_invalid_page(WARNING, node, forkno, blkno, present);
+ elog(PANIC, "WAL contains references to invalid pages");
+ }
+
+ /*
* Log references to invalid pages at DEBUG1 level. This allows some
* tracing of the cause (note the elog context mechanism will tell us
* something about the XLOG record that generated the reference).
*/
if (log_min_messages <= DEBUG1 || client_min_messages <= DEBUG1)
! report_invalid_page(DEBUG1, node, forkno, blkno, present);
if (invalid_page_tab == NULL)
{
***************
*** 181,186 **** forget_invalid_pages_db(Oid dbid)
--- 199,214 ----
}
}
+ /* Are there any unresolved references to invalid pages? */
+ bool
+ have_invalid_pages(void)
+ {
+ if (invalid_page_tab != NULL &&
+ hash_get_num_entries(invalid_page_tab) > 0)
+ return true;
+ return false;
+ }
+
/* Complain about any remaining invalid-page entries */
void
XLogCheckInvalidPages(void)
***************
*** 200,214 **** XLogCheckInvalidPages(void)
*/
while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
{
! char *path = relpathperm(hentry->key.node, hentry->key.forkno);
!
! if (hentry->present)
! elog(WARNING, "page %u of relation %s was uninitialized",
! hentry->key.blkno, path);
! else
! elog(WARNING, "page %u of relation %s did not exist",
! hentry->key.blkno, path);
! pfree(path);
foundone = true;
}
--- 228,235 ----
*/
while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
{
! report_invalid_page(WARNING, hentry->key.node, hentry->key.forkno,
! hentry->key.blkno, hentry->present);
foundone = true;
}
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 189,194 **** typedef enum
--- 189,196 ----
extern XLogRecPtr XactLastRecEnd;
+ extern bool reachedMinRecoveryPoint;
+
/* these variables are GUC parameters related to XLOG */
extern int CheckPointSegments;
extern int wal_keep_segments;
*** a/src/include/access/xlogutils.h
--- b/src/include/access/xlogutils.h
***************
*** 14,19 ****
--- 14,21 ----
#include "storage/bufmgr.h"
+ extern bool have_invalid_pages(void);
+
extern void XLogCheckInvalidPages(void);
extern void XLogDropRelation(RelFileNode rnode, ForkNumber forknum);
On Tue, Oct 4, 2011 at 7:43 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually been
thinking that we will get rid of rm_safe_restartpoint altogether in the
future. The two things that still use it are the b-tree and gin, and I'd
like to change both of those to not require any post-recovery cleanup step
to finish multi-page operations, similar to what I did with GiST in 9.1.I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.Though Heikki might be already working on that,... anyway,
the attached patch is the version which doesn't use rm_safe_restartpoint
machinery.
Heikki - I see you are down on the CF app to review this.
I'd been working on it as well, just forgot to let Greg know.
Did you start already? Should I stop?
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 04.10.2011 09:43, Fujii Masao wrote:
On Mon, Oct 3, 2011 at 4:32 PM, Simon Riggs<simon@2ndquadrant.com> wrote:
I don't think this should use the rm_safe_restartpoint machinery. As you
said, it's not tied to any specific resource manager. And I've actually been
thinking that we will get rid of rm_safe_restartpoint altogether in the
future. The two things that still use it are the b-tree and gin, and I'd
like to change both of those to not require any post-recovery cleanup step
to finish multi-page operations, similar to what I did with GiST in 9.1.I thought that was quite neat doing it that way, but there's no
specific reason to do it that way I guess. If you're happy to rewrite
the patch then I guess we're OK.I certainly would like to get rid of rm_safe_restartpoint in the
longer term, hopefully sooner.Though Heikki might be already working on that,...
Just haven't gotten around to it. It's a fair amount of work with little
user-visible benefit.
anyway,
the attached patch is the version which doesn't use rm_safe_restartpoint
machinery.
Thanks, committed.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com