Proposed WAL changes

Started by Tom Lanealmost 25 years ago44 messages
#1Tom Lane
tgl@sss.pgh.pa.us

I have just sent to the pgsql-patches list a rather large set of
proposed diffs for the WAL code. These changes:

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).

* Suppress successive checkpoint records when nothing has been entered
in the WAL log since the last one. This is not so much to avoid I/O
as to make it actually useful to keep track of the last two
checkpoints. If the things are right next to each other then there's
not a lot of redundancy gained...

* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes.

* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.

* Change XID allocation to work more like OID allocation, so that we
can flush XID alloc info to the log before there is any chance an XID
will appear in heap files.

* Fix a number of minor bugs, such as off-by-one logic for XLOG file
wraparound at the 4 gig mark.

* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.

Before committing this stuff, I intend to prepare a contrib utility that
can be used to reset pg_control and pg_xlog. This is mainly for
disaster recovery purposes, but as a side benefit it will allow people
to update 7.1beta installations to this new code without doing initdb.
I need to update contrib/pg_controldata, too.

regards, tom lane

#2Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Tom Lane (#1)
RE: Proposed WAL changes

-----Original Message-----
From: Tom Lane

I have just sent to the pgsql-patches list a rather large set of
proposed diffs for the WAL code. These changes:

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record
is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control
itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).

Is archdir really a GUC parameter ?

Regards,
Hiroshi Inoue

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#2)
Re: Proposed WAL changes

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).

Is archdir really a GUC parameter ?

Why shouldn't it be? I see nothing wrong with changing it on-the-fly.

regards, tom lane

#4Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Tom Lane (#3)
AW: Proposed WAL changes

Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented

yet anyway).

Is archdir really a GUC parameter ?

Why shouldn't it be? I see nothing wrong with changing it on-the-fly.

Yes, I think this is a good change, like all others except XID assignment :-)

Andreas

#5Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#4)
RE: Proposed WAL changes

-----Original Message-----
From: Zeugswetter Andreas SB

Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented

yet anyway).

Is archdir really a GUC parameter ?

Why shouldn't it be? I see nothing wrong with changing it on-the-fly.

Yes, I think this is a good change, like all others except XID
assignment :-)

Could GUC parameters be changed permanently e.g. by SET command ?

For example,
1) start postmaster
2) set archdir to '....'
3) shutdown postmaster

Does PostgreSQL remember the archdir parameter ?

Regards,
Hiroshi Inoue

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#5)
Re: Proposed WAL changes

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Could GUC parameters be changed permanently e.g. by SET command ?

That's what postgresql.conf is for ...

regards, tom lane

#7Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Hiroshi Inoue (#5)
Re: Proposed WAL changes

Tom Lane wrote:

"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:

Could GUC parameters be changed permanently e.g. by SET command ?

That's what postgresql.conf is for ...

Do I have to send SIGHUP after changing postgresql.conf ?

Regards,
Hiroshi Inoue

#8Zeugswetter Andreas SB
ZeugswetterA@wien.spardat.at
In reply to: Hiroshi Inoue (#7)
AW: Proposed WAL changes

Could GUC parameters be changed permanently e.g. by SET command ?

For example,
1) start postmaster
2) set archdir to '....'
3) shutdown postmaster

I thought the intended way to change a GUC parameter permanently was to
edit data/postgresql.conf . No ?

Andreas

#9Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Zeugswetter Andreas SB (#8)
Re: AW: Proposed WAL changes

Zeugswetter Andreas SB wrote:

Could GUC parameters be changed permanently e.g. by SET command ?

For example,
1) start postmaster
2) set archdir to '....'
3) shutdown postmaster

I thought the intended way to change a GUC parameter permanently was to
edit data/postgresql.conf . No ?

What I've thought is to implement a new command to
change archdir under WAL's control.
If it's different from Vadim's plan I don't object.

Regards,
Hiroshi Inoue

#10Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Zeugswetter Andreas SB (#8)
Re: AW: Proposed WAL changes

I thought the intended way to change a GUC parameter permanently was to
edit data/postgresql.conf . No ?

What I've thought is to implement a new command to
change archdir under WAL's control.
If it's different from Vadim's plan I don't object.

Actually, I have no concrete plans for archdir yet - this
one is for WAL based BAR we should discuss in future.
So, I don't see why to remove archdir from pg_control now.

Vadim

#11Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Vadim Mikheev (#10)
Re: Proposed WAL changes

I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

proposed diffs for the WAL code. These changes:

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))
And what to do if pg_control was lost? (We already discussed that we
should read all logs from newest to oldest ones to find checkpoint).
And why to keep old log files with older checkpoint?

is kept in pg_control for possible use in disaster recovery (ie,
complete loss of pg_xlog). Also add a version number for pg_control

Mmmm, how recovery is possible if log was lost? All what could be done
with DB in the event of corrupted/lost log is dumping data from tables
*asis*, without any guarantee about consistency. How checkpoint' content
could be useful?

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

itself. Remove archdir from pg_control; it ought to be a GUC
parameter, not a special case (not that it's implemented yet anyway).

I would discuss WAL based BAR management before deciding how to
store/assign archdir. On the other hand it's easy to add archdir
to pg_control later -:)

* Change CRC scheme to a true 64-bit CRC, not a pair of 32-bit CRCs
on alternate bytes.

Great if you've found reliable CRC64 impl!

* Fix XLOG record length handling so that it will work at BLCKSZ = 32k.

Case I've overlooked -:(
(Though, I always considered BLCKSZ > 8K as temp hack -:))

* Change XID allocation to work more like OID allocation, so that we
can flush XID alloc info to the log before there is any chance an XID
will appear in heap files.

I didn't read you postings about this yet.

* Add documentation and clean up some coding infelicities; move file
format declarations out to include files where planned contrib
utilities can get at them.

Thanks for that!

Before committing this stuff, I intend to prepare a contrib utility that
can be used to reset pg_control and pg_xlog. This is mainly for
disaster recovery purposes, but as a side benefit it will allow people

Once again, I would call this "disaster *dump* purposes" -:)
After such operation DB shouldn't be used for anything but dump!

to update 7.1beta installations to this new code without doing initdb.
I need to update contrib/pg_controldata, too.

Vadim

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Vadim Mikheev (#10)
Re: AW: Proposed WAL changes

I thought the intended way to change a GUC parameter permanently was to
edit data/postgresql.conf . No ?

What I've thought is to implement a new command to
change archdir under WAL's control.
If it's different from Vadim's plan I don't object.

Actually, I have no concrete plans for archdir yet - this
one is for WAL based BAR we should discuss in future.
So, I don't see why to remove archdir from pg_control now.

Maybe we can get BAR in 7.1.X so maybe we should have the option to add
it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#7)
Re: Proposed WAL changes

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

Do I have to send SIGHUP after changing postgresql.conf ?

In general, yes. I think that for the (still vaporware) archdir option,
you might not need to: archdir will only be looked at by the checkpoint
subprocess, and I think that a newly spawned backend will reread
postgresql.conf anyway. Peter, is that correct?

regards, tom lane

#14Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vadim Mikheev (#10)
Re: AW: Proposed WAL changes

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:

So, I don't see why to remove archdir from pg_control now.

I didn't like the space consumption. I think it's important that the
pg_control data fit in less than 512 bytes so that it doesn't cross
physical sectors on the disk. This reduces the odds of being left
with a corrupted pg_control due to partial write during power loss.

That's a second-order consideration, possibly, but I can see no
redeeming social advantage whatever to having archdir in pg_control
rather than in postgresql.conf where all the other system parameters
live. Unless you've got one, it's coming out...

regards, tom lane

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Vadim Mikheev (#11)
Re: Proposed WAL changes

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:

I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

Done under separate cover.

proposed diffs for the WAL code. These changes:

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))

Then you lose --- but two checkpoints gives you twice the chance of
recovery (probably more, actually, since it's much more likely that
the previous checkpoint will have reached disk safely).

And what to do if pg_control was lost? (We already discussed that we
should read all logs from newest to oldest ones to find checkpoint).

If you have valid WAL files and broken pg_control, then reading the WAL
files is a way to recover. If you have valid pg_control and broken WAL
files, you have a big problem, but using pg_control to generate a new
empty WAL will at least let you get at your heap files.

And why to keep old log files with older checkpoint?

Not much point in remembering the older checkpoint location if the
associated WAL file is removed...

Mmmm, how recovery is possible if log was lost? All what could be done
with DB in the event of corrupted/lost log is dumping data from tables
*asis*, without any guarantee about consistency.

Exactly. That is still better than not being able to dump the data at
all.

* Change XID allocation to work more like OID allocation, so that we
can flush XID alloc info to the log before there is any chance an XID
will appear in heap files.

I didn't read you postings about this yet.

See later discussion --- Andreas convinced me that flushing NEXTXID
records to disk isn't really needed after all. (I didn't take the flush
out of my patch yet, but will do so.) I still want to leave the NEXTXID
records in there, though, because I think that XID and OID assignment
ought to work as nearly alike as possible.

Before committing this stuff, I intend to prepare a contrib utility that
can be used to reset pg_control and pg_xlog. This is mainly for
disaster recovery purposes, but as a side benefit it will allow people

Once again, I would call this "disaster *dump* purposes" -:)
After such operation DB shouldn't be used for anything but dump!

Fair enough. But we need it.

regards, tom lane

#16The Hermit Hacker
scrappy@hub.org
In reply to: Vadim Mikheev (#11)
Re: Proposed WAL changes

On Wed, 7 Mar 2001, Vadim Mikheev wrote:

I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is dated by Feb -:(

Huh?

http://www.postgresql.org/mhonarc/pgsql-patches/2001-03/index.html

#17Peter Eisentraut
peter_e@gmx.net
In reply to: Tom Lane (#13)
Re: Proposed WAL changes

Tom Lane writes:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

Do I have to send SIGHUP after changing postgresql.conf ?

In general, yes. I think that for the (still vaporware) archdir option,
you might not need to: archdir will only be looked at by the checkpoint
subprocess, and I think that a newly spawned backend will reread
postgresql.conf anyway. Peter, is that correct?

Nope. The configuration file is only read at postmaster start and after
SIGHUP. If any starting backend would read it automatically, the admin
could never be sure about his edits.

--
Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/

In reply to: Vadim Mikheev (#11)
Re: Proposed WAL changes

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

This is true in general. But, nevertheless, WAL can be written to
protect against predictable disk errors, when possible. Failing to
write a couple of disk blocks when the system crashes is a reasonably
predictable disk error. WAL should ideally be written to work
correctly in that situation.

Ian

---------------------------(end of broadcast)---------------------------
TIP 102: An atom-blaster is a good weapon, but it can point both ways.
-- Isaac Asimov

#19Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Ian Lance Taylor (#18)
RE: AW: Proposed WAL changes

What I've thought is to implement a new command to
change archdir under WAL's control.
If it's different from Vadim's plan I don't object.

Actually, I have no concrete plans for archdir yet - this
one is for WAL based BAR we should discuss in future.
So, I don't see why to remove archdir from pg_control now.

Maybe we can get BAR in 7.1.X so maybe we should have the
option to add it.

So, it's better to leave archdir in pg_control now - if we'll
decide that GUC is better place then we'll just ignore archdir
in pg_control. But if it will be better to have it in pg_control
then we'll not be able to add it there.

Vadim

#20Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#19)
RE: Proposed WAL changes

I have just sent to the pgsql-patches list a rather large set of

Please send it to me directly - pgsql-patches' archieve is

dated by Feb -:(

Huh?

http://www.postgresql.org/mhonarc/pgsql-patches/2001-03/index.html

But there was nothing there yesterday -:)
(Seems archive wasn't updated that time).

Vadim

#21Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#20)
RE: Proposed WAL changes

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

This is true in general. But, nevertheless, WAL can be written to
protect against predictable disk errors, when possible. Failing to
write a couple of disk blocks when the system crashes is a reasonably
predictable disk error. WAL should ideally be written to work
correctly in that situation.

But what can be done if fsync returns before pages flushed?

Vadim

#22Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#21)
RE: Proposed WAL changes

But what can be done if fsync returns before pages flushed?

No, it won't. When fsync returns, data is promised by the OS to be on
disk.

Seems you didn't follow discussions about this issue.

Vadim

#23Alex Pilosov
alex@pilosoft.com
In reply to: Mikheev, Vadim (#21)
RE: Proposed WAL changes

On Wed, 7 Mar 2001, Mikheev, Vadim wrote:

But what can be done if fsync returns before pages flushed?

No, it won't. When fsync returns, data is promised by the OS to be on
disk.

-alex

#24Noname
ncm@zembu.com
In reply to: Tom Lane (#15)
Re: Proposed WAL changes

On Wed, Mar 07, 2001 at 11:09:25AM -0500, Tom Lane wrote:

"Vadim Mikheev" <vmikheev@sectorbase.com> writes:

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))

Then you lose --- but two checkpoints gives you twice the chance of
recovery (probably more, actually, since it's much more likely that
the previous checkpoint will have reached disk safely).

Actually far more: if the checkpoints are minutes apart, even the
worst disk drive will certainly have flushed any blocks written for
the earlier checkpoint.

--
Nathan Myers
ncm@zembu.com

#25Noname
ncm@zembu.com
In reply to: Mikheev, Vadim (#21)
Re: Proposed WAL changes

On Wed, Mar 07, 2001 at 12:03:41PM -0800, Mikheev, Vadim wrote:

Ian wrote:

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

This is true in general. But, nevertheless, WAL can be written to
protect against predictable disk errors, when possible. Failing to
write a couple of disk blocks when the system crashes

or, more likely, when power drops; a system crash shouldn't keep the
disk from draining its buffers ...

is a reasonably predictable disk error. WAL should ideally be
written to work correctly in that situation.

But what can be done if fsync returns before pages flushed?

Just what Tom has done: preserve a little more history. If it's not
too expensive, then it doesn't hurt you when running on sound hardware,
but it offers a good chance of preventing embarrassments for (the
overwhelming fraction of) users on garbage hardware.

Nathan Myers
ncm@zembu.com

#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#19)
Re: AW: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

So, it's better to leave archdir in pg_control now - if we'll
decide that GUC is better place then we'll just ignore archdir
in pg_control. But if it will be better to have it in pg_control
then we'll not be able to add it there.

But what possible reason is there for keeping it in pg_control?
AFAICS that would just mean that we'd need special code for setting it,
instead of making use of all of Peter's hard work on GUC.

We should be moving away from special-case parameter mechanisms, not
inventing new ones without darn good reasons for them.

regards, tom lane

#27Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#19)
Re: AW: Proposed WAL changes

Tom Lane wrote:

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

So, it's better to leave archdir in pg_control now - if we'll
decide that GUC is better place then we'll just ignore archdir
in pg_control. But if it will be better to have it in pg_control
then we'll not be able to add it there.

But what possible reason is there for keeping it in pg_control?
AFAICS that would just mean that we'd need special code for setting it,
instead of making use of all of Peter's hard work on GUC.

I don't think it's appropriate to edit archdir by hand.

Regards,
Hiroshi Inoue

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#27)
Re: AW: Proposed WAL changes

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

But what possible reason is there for keeping it in pg_control?
AFAICS that would just mean that we'd need special code for setting it,
instead of making use of all of Peter's hard work on GUC.

I don't think it's appropriate to edit archdir by hand.

Why not? How is this a critical parameter (more critical than, say,
fsync enable)? I see no reason to forbid the administrator from
changing it ... indeed, I think an admin who found out he couldn't
change it on-the-fly would be justifiably unhappy. ("What do you
MEAN I can't change archdir? I'm about to run out of space in
/usr/logs/foobar!!!")

I agree that we don't want random users changing the value via SET and
then issuing a CHECKPOINT (which would use whatever they'd SET :-().
But that's easily managed by setting an appropriate protection level
on the GUC variable. Looks like SIGHUP level would be appropriate.

regards, tom lane

In reply to: Mikheev, Vadim (#21)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

I feel that the fact that

WAL can't help in the event of disk errors

is often overlooked.

This is true in general. But, nevertheless, WAL can be written to
protect against predictable disk errors, when possible. Failing to
write a couple of disk blocks when the system crashes is a reasonably
predictable disk error. WAL should ideally be written to work
correctly in that situation.

But what can be done if fsync returns before pages flushed?

When you write out critical information, you keep earlier versions of
it. On startup, if the critical information is corrupt, you use the
earlier versions of it. This helps protect against the scenario I
mentioned: a few disk blocks may not have been written when the power
goes out.

My impression is that that is what Tom is doing with his patches.

Ian

---------------------------(end of broadcast)---------------------------
TIP 77: A beautiful man is paradise for the eyes, hell for the soul, and
purgatory for the purse.

#30Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#19)
Re: AW: Proposed WAL changes

Tom Lane wrote:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

But what possible reason is there for keeping it in pg_control?
AFAICS that would just mean that we'd need special code for setting it,
instead of making use of all of Peter's hard work on GUC.

I don't think it's appropriate to edit archdir by hand.

Why not? How is this a critical parameter (more critical than, say,
fsync enable)?

I don't think 'fsync enable' is a critical parameter.
It's a dangerous parameter and it's not appropriate
as a GUC paramter either. Does it have any meaning
other than testing ? IMHO recovery system doesn't
allow any optimism and archdir is also a part of
recovery system though I'm not sure how critical
the parameter would be.

Regards,
Hiroshi Inoue

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hiroshi Inoue (#30)
Re: AW: Proposed WAL changes

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

Tom Lane wrote:

Why not? How is this a critical parameter (more critical than, say,
fsync enable)?

I don't think 'fsync enable' is a critical parameter.
It's a dangerous parameter and it's not appropriate
as a GUC paramter either.

That's also PGC_SIGHUP (recently fixed by me, it was set at a lower level
before).

Does it have any meaning other than testing ? IMHO recovery system
doesn't allow any optimism and archdir is also a part of recovery
system though I'm not sure how critical the parameter would be.

I still don't see your point. The admin *can* change these parameters
if he wishes. Why should we make it more difficult to do so than is
reasonably necessary? There is certainly no technical reason why we
should (say) force an initdb to change archdir.

regards, tom lane

#32Hiroshi Inoue
Inoue@tpf.co.jp
In reply to: Mikheev, Vadim (#19)
Re: AW: Proposed WAL changes

Tom Lane wrote:

Hiroshi Inoue <Inoue@tpf.co.jp> writes:

Tom Lane wrote:

Why not? How is this a critical parameter (more critical than, say,
fsync enable)?

Does it have any meaning other than testing ? IMHO recovery system
doesn't allow any optimism and archdir is also a part of recovery
system though I'm not sure how critical the parameter would be.

I still don't see your point. The admin *can* change these parameters
if he wishes. Why should we make it more difficult to do so than is
reasonably necessary? There is certainly no technical reason why we
should (say) force an initdb to change archdir.

I've never objected to change archdir on the fly.
Though GUC is profitable for general purpose it
could(must)n't be almighty. As for recovery
we must rely on DBA as less as possible.

Hiroshi Inoue

#33Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Hiroshi Inoue (#32)
RE: Proposed WAL changes

But what can be done if fsync returns before pages flushed?

When you write out critical information, you keep earlier versions of
it. On startup, if the critical information is corrupt, you use the
earlier versions of it. This helps protect against the scenario I
mentioned: a few disk blocks may not have been written when the power
goes out.

My impression is that that is what Tom is doing with his patches.

If fsync may return before data *actually* flushed then you may have
unlogged data page changes which breakes WAL rule and means corrupted
(inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
STATE. Now please explain me how saving positions of two checkpoints
(what Tom is doing) can help here?

Vadim

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#33)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

If fsync may return before data *actually* flushed then you may have
unlogged data page changes which breakes WAL rule and means corrupted
(inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
STATE. Now please explain me how saving positions of two checkpoints
(what Tom is doing) can help here?

If the WAL log is broken then there is (AFAICT) no way to *guarantee*
recovery to a consistent state. There might be changes in heap or index
pages that you cannot find a trace of in the readable part of the log,
and hence don't know you need to fix.

However, I think that Vadim is focusing too narrowly on what he can
guarantee, and neglecting the problem of providing reasonable behavior
when things are too broken to meet the guarantee. In particular,
complete refusal to start up when the WAL log is corrupted is not
reasonable nor acceptable. We have to be able to recover from that
and provide access to as-close-to-accurate data as we can manage.
I believe that backing up to the prior checkpoint and replaying as much
of the log as we can read is a useful fallback capability if the most
recent checkpoint record is unreadable.

In any case, "guarantees" that are based on the assumption that WAL
writes hit the disk before data writes do are going to fail anyway
if we are using disk drives that reorder writes. This is the real
reason why we can't think only about guarantees, but also about how
we will respond to "guaranteed impossible" failures.

If you can improve on this recovery idea, by all means do so. But
don't tell me that we don't need to worry about it, because we do.

regards, tom lane

#35Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#34)
RE: Proposed WAL changes

* Store two past checkpoint locations, not just one, in pg_control.
On startup, we fall back to the older checkpoint if the newer one
is unreadable. Also, a physical copy of the newest
checkpoint record

And what to do if older one is unreadable too?
(Isn't it like using 2 x CRC32 instead of CRC64 ? -:))

Then you lose --- but two checkpoints gives you twice the chance of
recovery (probably more, actually, since it's much more likely that
the previous checkpoint will have reached disk safely).

This is not correct. If log is corrupted somehow (checkpoint wasn't
flushed as promised) then you have no chance to *recover* because of
DB will be (most likely) in inconsistent state (data pages flushed
before corresponding log records etc). So, second checkpoint gives us
twice the chance to *restart* in normal way - read checkpoint and
rollforward from redo record, - not to *recover*. But this approach
twice increases on-line log size requirements and doesn't help to
handle cases when pg_control was corrupted. Note, I agreed that
disaster *restart* must be implemented, I just think that
"two checkpoints" approach is not the best way to follow.

From my POV, scanning logs is much better - it doesn't require

doubling size of on-line logs and allows to *restart* if pg_control
was lost/corrupted:

If there is no pg_control or it's corrupted or points to
unexistent/corrupted checkpoint record then scan logs from
newest to oldest one till we find last valid checkpoint record
or oldest valid log record and than redo from there.

See later discussion --- Andreas convinced me that flushing NEXTXID
records to disk isn't really needed after all. (I didn't
take the flush out of my patch yet, but will do so.) I still want
to leave the NEXTXID records in there, though, because I think that
XID and OID assignment ought to work as nearly alike as possible.

As I explained in short already: with UNDO we'll be able to reuse
XIDs after restart - ie there will be no point to have NEXTXID
records at all. And there is no point to add it now.
Does it fix anything? Isn't "fixing" all what we must do in beta?

Vadim

#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#35)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

"two checkpoints" approach is not the best way to follow.
From my POV, scanning logs is much better - it doesn't require
doubling size of on-line logs and allows to *restart* if pg_control
was lost/corrupted:

If there is no pg_control or it's corrupted or points to
unexistent/corrupted checkpoint record then scan logs from
newest to oldest one till we find last valid checkpoint record
or oldest valid log record and than redo from there.

And how well will that approach work if the last checkpoint record
got written near the start of a log segment file, and then the
checkpointer discarded all your prior log segments because "you don't
need those anymore"? If the checkpoint record gets corrupted,
you have no readable log at all.

Even if you prefer to scan the logs to find the latest checkpoint
(to which I have no objection), it's still useful for the system
to keep track of the checkpoint-before-latest and only delete log
segments that are older than that.

Of course checkpoint-before-latest is a somewhat arbitrary way of
deciding how far back we need to keep logs, but I think it's more
defensible than anything else we can implement with small work.

As I explained in short already: with UNDO we'll be able to reuse
XIDs after restart - ie there will be no point to have NEXTXID
records at all. And there is no point to add it now.
Does it fix anything? Isn't "fixing" all what we must do in beta?

If you really insist, I'll take it out again, but I'm unconvinced
that that's necessary.

regards, tom lane

#37Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#36)
RE: Proposed WAL changes

If there is no pg_control or it's corrupted or points to
unexistent/corrupted checkpoint record then scan logs from
newest to oldest one till we find last valid checkpoint record
or oldest valid log record and than redo from there.

And how well will that approach work if the last checkpoint record
got written near the start of a log segment file, and then the
checkpointer discarded all your prior log segments because "you don't
need those anymore"? If the checkpoint record gets corrupted,
you have no readable log at all.

The question - why should we have it? It is assumed that data files
are flushed before checkpoint appears in log. If this assumtion
is wrong due to *bogus* fsync/disk/whatever why should we increase
disk space requirements which will affect *good* systems too?
What will we buy with extra logs? Just some data we can't
guarantee consistency anyway?
It seems that you want guarantee more than me, Tom -:)

BTW, in some my tests size of on-line logs was ~ 200Mb with default
checkpoint interval. So, it's worth to care about on-line logs size.

As I explained in short already: with UNDO we'll be able to reuse
XIDs after restart - ie there will be no point to have NEXTXID
records at all. And there is no point to add it now.
Does it fix anything? Isn't "fixing" all what we must do in beta?

If you really insist, I'll take it out again, but I'm unconvinced
that that's necessary.

Please convince me that NEXTXID is necessary.
Why add anything that is not useful?

Vadim

In reply to: Mikheev, Vadim (#33)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

If fsync may return before data *actually* flushed then you may have
unlogged data page changes which breakes WAL rule and means corrupted
(inconsistent) database without ANY ABILITY TO RECOVERY TO CONSISTENT
STATE. Now please explain me how saving positions of two checkpoints
(what Tom is doing) can help here?

Maybe it can't. I don't know.

I do know that a system which can only function if all disk writes
complete in order is doomed when run on a Unix system with modern IDE
disks. It won't work correctly in the case of a sudden power outage.

Since many people run Postgres in just that sort of environment, it's
not a good idea to completely fail to handle it. Clearly a sudden
power outage can cause disk data to be corrupted. People can accept
that. They need to know what to do in that case.

For example, for typical Unix filesystems, after a sudden power
outage, people run fsck. fsck puts the filesystem back in shape. In
the process, it may find some files it can not fix; they are put in
/lost+found. In the process, it may lose some files entirely; too
bad. The point is, after fsck is run, the filesystem is known to be
good, and in most cases all the data will still be present.

Imagine if every time the power went out, people were required to
restore their disks from the last backup.

So, I think that what Postgres needs to do when it sees corruption is
just what fsck does: warn the user, put the database in a usable
state, and provide whatever information can be provided about lost or
inconsistent data. Users who need full bore gold plated consistency
will restore from their last good backup, and invest in a larger UPS.
Many users will simply keep going.

Ian

---------------------------(end of broadcast)---------------------------
TIP 43: The human animal differs from the lesser primates in his passion for
lists of "Ten Best".
-- H. Allen Smith

#39Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Mikheev, Vadim (#33)
Re: Proposed WAL changes

What do we debate?
I never told that we shouldn't worry about current WAL disability to restart.
And WAL already correctly works in situation of "failing to write a couple of disk
blocks when the system crashes".
My statement at first place that "WAL can't help in the event of disk errors"
was to remind that we should think over how much are we going to guarantee
and by what means in the event our *base* requirements were not answered
(guaranteed -:)). My POV is that two checkpoints increases disk space
requirements for *everyday usage* while buying near nothing because of data
consistency cannot be guaranteed anyway.
On the other hand this is the fastest way to implement WAL restart-ability
- *which is the real problem we have to fix*.

Vadim

#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#37)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

And how well will that approach work if the last checkpoint record
got written near the start of a log segment file, and then the
checkpointer discarded all your prior log segments because "you don't
need those anymore"? If the checkpoint record gets corrupted,
you have no readable log at all.

The question - why should we have it? It is assumed that data files
are flushed before checkpoint appears in log. If this assumtion
is wrong due to *bogus* fsync/disk/whatever why should we increase
disk space requirements which will affect *good* systems too?
What will we buy with extra logs? Just some data we can't
guarantee consistency anyway?
It seems that you want guarantee more than me, Tom -:)

No, but I want a system that's not brittle. You seem to be content to
design a system that is reliable as long as the WAL log is OK but loses
the entire database unrecoverably as soon as one bit goes bad in the
log. I'd like a slightly softer failure mode. WAL logs *will* go bad
(even without system crashes; what of unrecoverable disk read errors?)
and we ought to be able to deal with that with some degree of grace.
Yes, we lost our guarantee of consistency. That doesn't mean we should
not do the best we can with what we've got left.

BTW, in some my tests size of on-line logs was ~ 200Mb with default
checkpoint interval. So, it's worth to care about on-line logs size.

Okay, but to me that suggests we need a smarter log management strategy,
not a management strategy that throws away data we might wish we still
had (for manual analysis if nothing else). Perhaps the checkpoint
creation rule should be "every M seconds *or* every N megabytes of log,
whichever comes first". It'd be fairly easy to signal the postmaster to
start up a new checkpoint process when XLogWrite rolls over to a new log
segment, if the last checkpoint was further back than some number of
segments. Comments?

Please convince me that NEXTXID is necessary.
Why add anything that is not useful?

I'm not convinced that it's not necessary. In particular, consider the
case where we are trying to recover from a crash using an on-line
checkpoint as our last readable WAL entry. In the pre-NEXTXID code,
this checkpoint would contain the current XID counter and an
advanced-beyond-current OID counter. I think both of those numbers
should be advanced beyond current, so that there's some safety margin
against reusing XIDs/OIDs that were allocated by now-lost XLOG entries.
The OID code is doing this right, but the XID code wasn't.

Again, it's a question of brittleness. Yes, as long as everything
operates as designed and the WAL log never drops a bit, we don't need
it. But I want a safety margin for when things aren't perfect.

regards, tom lane

#41Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#40)
RE: Proposed WAL changes

It seems that you want guarantee more than me, Tom -:)

No, but I want a system that's not brittle. You seem to be content to
design a system that is reliable as long as the WAL log is OK
but loses the entire database unrecoverably as soon as one bit goes bad
in the log.

I don't see how absence of old checkpoint forces losing entire database.
You probably will get better consistency by re-applying modifications
which supposed to be in data files already but it seems questionable
to me.

BTW, in some my tests size of on-line logs was ~ 200Mb with default
checkpoint interval. So, it's worth to care about on-line logs size.

Okay, but to me that suggests we need a smarter log management strategy,
not a management strategy that throws away data we might wish we still
had (for manual analysis if nothing else).

This is what should be covered by archiving log files. Unimplemented -:(

Perhaps the checkpoint creation rule should be "every M seconds *or*
every N megabytes of log, whichever comes first".

I like this! Regardless usability of keeping older checkpoint (especially
in future, with log archiving) your rule is worth in any case.
(Nevertheless, keeping two checkpoints still increases disk requirements -:)
But seems I have to waive my objection if I didn't convince you - it's
really simplest way to get WAL restart-ability and I personally have no
ability to implement log scanning now).

Please convince me that NEXTXID is necessary.
Why add anything that is not useful?

I'm not convinced that it's not necessary. In particular,
consider the case where we are trying to recover from a crash using
an on-line checkpoint as our last readable WAL entry. In the pre-NEXTXID
code, this checkpoint would contain the current XID counter and an
advanced-beyond-current OID counter. I think both of those numbers should
be advanced beyond current, so that there's some safety margin against
reusing XIDs/OIDs that were allocated by now-lost XLOG entries.
The OID code is doing this right, but the XID code wasn't.

Again, it's a question of brittleness. Yes, as long as everything
operates as designed and the WAL log never drops a bit, we don't need
it. But I want a safety margin for when things aren't perfect.

Once again - my point is that in the event of lost log record one shouldn't
try to use existent database but just dump it, etc - ie no reason to keep
info about allocated XIDs. But keeping NEXTXID costs nothing, at least -:)

Vadim

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#41)
Re: Proposed WAL changes

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

No, but I want a system that's not brittle. You seem to be content to
design a system that is reliable as long as the WAL log is OK
but loses the entire database unrecoverably as soon as one bit goes bad
in the log.

I don't see how absence of old checkpoint forces losing entire database.

As the code stood last week that's what would happen, because the system
would not restart unless pg_control pointed to a valid checkpoint
record. I addressed that in a way that seemed good to me.

Now, from what you've said in this conversation you would rather have
the system scan XLOG to decide where to replay from if it cannot read
the last checkpoint record. That would be okay with me, but even with
that approach I do not think it's safe to truncate the log to nothing
as soon as we've written a checkpoint record. I want to see a reasonable
amount of log data there at all times. I don't insist that "reasonable
amount" necessarily means "back to the prior checkpoint" --- but that's
a simple and easy-to-implement interpretation.

You probably will get better consistency by re-applying modifications
which supposed to be in data files already but it seems questionable
to me.

It's not a guarantee, no, but it gives you a better probability of
recovering recent changes when things are hosed.

BTW, can we really trust checkpoint to mean that all data file changes
are down on disk? I see that the actual implementation of checkpoint is

write out all dirty shmem buffers;
sync();
if (IsUnderPostmaster)
sleep(2);
sync();
write checkpoint record to XLOG;
fsync XLOG;

Now HP's man page for sync() says

The writing, although scheduled, is not necessarily complete upon
return from sync.

I can assure you that 2 seconds is nowhere near enough to ensure that a
sync is complete on my workstation... and I doubt that "scheduled" means
"guaranteed to complete before any subsequently-requested I/O is done".
I think it's entirely possible that the checkpoint record will hit the
disk before the last heap buffer does.

Therefore, even without considering disk drive write reordering, I do
not believe that a checkpoint guarantees very much, and so I think it's
pretty foolish to delete the preceding XLOG data immediately afterwards.

Perhaps the checkpoint creation rule should be "every M seconds *or*
every N megabytes of log, whichever comes first".

I like this! Regardless usability of keeping older checkpoint (especially
in future, with log archiving) your rule is worth in any case.

Okay, I'll see if I can do something with this idea.

Other than what we've discussed, do you have any comments/objections to
my proposed patch? I've been holding off committing it so that you have
time to review it...

regards, tom lane

#43Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#42)
RE: Proposed WAL changes

Other than what we've discussed, do you have any
comments/objections to my proposed patch?
I've been holding off committing it so that you have
time to review it...

Sorry - I'm heading to meet Marc, Thomas & Geoff right now,
will try to comment asap.

Vadim

#44Christopher Masto
chris@netmonger.net
In reply to: Tom Lane (#1)
Re: Proposed WAL changes

Please forgive me if I'm misunderstanding something of these rather
complex issues. But I think this is an important question from the
perspective of a sytem administrator responsible for the safety and
uncorruptedness of his users' data.

If I understand correctly, it is possible, through power failure,
Postgres or OS crash, or disk failure, to end up with a situation
where Postgres cannot put the database in a consistent state. Rather
than failing to start at all, something will be put in place that
allows PG to partially recover and start up, so that you can get to
your data. I think that leaves DBAs wondering two things:

First, how will I know that my database is corrupted? I may not be
present to witness the power failure, unattended reboot, and automatic
restart/quasi-recovery. If the DB has become inconsistent, it is
critical that users do not continue to use it. I'm all for having
Postgres throw up its hands and refuse to start until I put it in
disaster-dump mode.

Secondly, since disaster-dump seems to be a good term for it, is there
some way to know the extent of the damage? I.e. in the typical case
of power failure, is the inconsistent part just the "recent" data,
and would it be possible to find out which records are part of that
damaged set?
--
Christopher Masto Senior Network Monkey NetMonger Communications
chris@netmonger.net info@netmonger.net http://www.netmonger.net

Free yourself, free your machine, free the daemon -- http://www.freebsd.org/