WAL's single point of failure: latest CHECKPOINT record

Started by Tom Lanealmost 25 years ago4 messages
#1Tom Lane
tgl@sss.pgh.pa.us

As the WAL stuff is currently constructed, the system will refuse to
start up unless the checkPoint field of pg_control points at a valid
checkpoint record in the WAL log.

Now I know we write and fsync the checkpoint record before we rewrite
pg_control, but this still leaves me feeling mighty uncomfortable.
See past discussions about how fsync order doesn't necessarily mean
anything if the disk drive chooses to reorder writes. Since loss of
the checkpoint record means complete loss of the database, I think we
need to work harder here.

What I'm thinking is that pg_control should have pointers to the last
two checkpoint records, not only the last one. If we fail to read the
most recent checkpoint, try the one before it (which, obviously, means
we must keep the log files long enough that we still have that one too).
We can run forward from there and redo the intervening WAL records the
same as we would do anyway.

This would mean an initdb to change the format of pg_control. However
I already have a couple other reasons in favor of an initdb: the
record-length bug I mentioned yesterday, and the bogus CRC algorithm.
I'm not finished reviewing the WAL code, either :-(

regards, tom lane

#2Justin Clift
aa2@bigpond.net.au
In reply to: Tom Lane (#1)
Re: WAL's single point of failure: latest CHECKPOINT record

Hi all,

Out of curiosity, does anyone know of any projects that are presently
creating PostgreSQL database recovery tools?

For example database corruption recovery, Point In Time restoration, and
such things?

It might be a good project for GreatBridge to look into if no-one else
is doing it already.

Regards and best wishes,

Justin Clift
Database Administrator

Tom Lane wrote:

Show quoted text

As the WAL stuff is currently constructed, the system will refuse to
start up unless the checkPoint field of pg_control points at a valid
checkpoint record in the WAL log.

Now I know we write and fsync the checkpoint record before we rewrite
pg_control, but this still leaves me feeling mighty uncomfortable.
See past discussions about how fsync order doesn't necessarily mean
anything if the disk drive chooses to reorder writes. Since loss of
the checkpoint record means complete loss of the database, I think we
need to work harder here.

What I'm thinking is that pg_control should have pointers to the last
two checkpoint records, not only the last one. If we fail to read the
most recent checkpoint, try the one before it (which, obviously, means
we must keep the log files long enough that we still have that one too).
We can run forward from there and redo the intervening WAL records the
same as we would do anyway.

This would mean an initdb to change the format of pg_control. However
I already have a couple other reasons in favor of an initdb: the
record-length bug I mentioned yesterday, and the bogus CRC algorithm.
I'm not finished reviewing the WAL code, either :-(

regards, tom lane

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Justin Clift (#2)
Re: WAL's single point of failure: latest CHECKPOINT record

We really need point-in-time recovery, removal of the need to vacuum,
and more full-featured replication. Hopefully most can be addressed in
7.2.

Hi all,

Out of curiosity, does anyone know of any projects that are presently
creating PostgreSQL database recovery tools?

For example database corruption recovery, Point In Time restoration, and
such things?

It might be a good project for GreatBridge to look into if no-one else
is doing it already.

Regards and best wishes,

Justin Clift
Database Administrator

Tom Lane wrote:

As the WAL stuff is currently constructed, the system will refuse to
start up unless the checkPoint field of pg_control points at a valid
checkpoint record in the WAL log.

Now I know we write and fsync the checkpoint record before we rewrite
pg_control, but this still leaves me feeling mighty uncomfortable.
See past discussions about how fsync order doesn't necessarily mean
anything if the disk drive chooses to reorder writes. Since loss of
the checkpoint record means complete loss of the database, I think we
need to work harder here.

What I'm thinking is that pg_control should have pointers to the last
two checkpoint records, not only the last one. If we fail to read the
most recent checkpoint, try the one before it (which, obviously, means
we must keep the log files long enough that we still have that one too).
We can run forward from there and redo the intervening WAL records the
same as we would do anyway.

This would mean an initdb to change the format of pg_control. However
I already have a couple other reasons in favor of an initdb: the
record-length bug I mentioned yesterday, and the bogus CRC algorithm.
I'm not finished reviewing the WAL code, either :-(

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#4Ned Lilly
ned@greatbridge.com
In reply to: Bruce Momjian (#3)
7.2 tools (was: WAL's single point of failure: latest CHECKPOINT record)

Yes, there is backend functionality on tap for 7.2 (see TODO) that will need to
be in place before the tools Justin mentions can be properly built.

We're very interested in helping out with the tools, and will be talking to the
-hackers list more about our ideas once 7.1 is out the door.

Regards,
Ned

Bruce Momjian wrote:

Show quoted text

We really need point-in-time recovery, removal of the need to vacuum,
and more full-featured replication. Hopefully most can be addressed in
7.2.

Hi all,

Out of curiosity, does anyone know of any projects that are presently
creating PostgreSQL database recovery tools?

For example database corruption recovery, Point In Time restoration, and
such things?

It might be a good project for GreatBridge to look into if no-one else
is doing it already.

Regards and best wishes,

Justin Clift
Database Administrator