Point in Time Recovery

Started by Simon Riggsover 21 years ago183 messages
#1Simon Riggs
simon@2ndquadrant.com

Taking advantage of the freeze bubble allowed us... there are some last
minute features to add.

Summarising earlier thoughts, with some detailed digging and design from
myself in last few days - we're now in a position to add Point-in-Time
Recovery, on top of whats been achieved.

The target for the last record to recover to can be specified in 2 ways:
- by transactionId - not that useful, unless you have a means of
identifying what has happened from the log, then using that info to
specify how to recover - coming later - not in next few days :(
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
If we did, would that be portable?
Suggestions welcome, because I know very little of the details of
various *nix systems and win* on that topic.

Only COMMIT and ABORT records have timestamps, allowing us to circumvent
any discussion about partial transaction recovery and nested
transactions.

When we do recover, stopping at the timestamp is just half the battle.
We need to leave the xlog in which we stop in a state from which we can
enter production smoothly and cleanly. To do this, we could:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.

Also, I think it is straightforward to introduce control file duplexing,
with a second copy stored and maintained in the pg_xlog directory. This
would provide additional protection for pg_control, which takes on more
importance now that archive recovery is working. pg_xlog is a natural
home, since on busy systems it's on its own disk away from everything
else, ensuring that at least one copy survives. I can't see a downside
to that, but others might... We can introduce user specifiable
duplexing, in later releases.

For later, I envisage an off-line utility that can be used to inspect
xlog records. This could provide a number of features:
- validate archived xlogs, to check they are sound.
- produce summary reports, to allow identification of transactionIds and
the effects of particular transactions
- performance analysis to allow decisions to be made about whether group
commit features could be utilised to good effect
(Not now...)

Best regards, Simon Riggs

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.

regards, tom lane

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Point in Time Recovery

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.

- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

eh? Which way round? The second plan was the one where I would destroy
data by overwriting it, thats exactly why I preferred the first.

Actually, the files are always copied from archive, so re-recovery is
always an available option in the design thats been implemented.

No matter...

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file,

Now thats a much better plan...I suppose I just have to rack up the
recovery pointer to the first record on the first page of a new xlog
file, similar to first plan, but just fast-forwarding rather than
forwarding.

My only issue was to do with the secondary Checkpoint marker, which is
always reset to the place you just restored FROM, when you complete a
recovery. That could lead to a situation where you recover, then before
next checkpoint, fail and lose last checkpoint marker, then crash
recover from previous checkpoint (again), but this time replay the
records you were careful to avoid.

which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations.

Thats a good idea, if only because you so easily screw your test data
during multiple recovery situations. But if its good during testing, it
must be good in production too...since you may well perform
recovery...run for a while, then discover that you got it wrong first
time, then need to re-recover again. I already added that to my list of
gotchas and that would solve it.

I was going to say hats off to the Blue-hued ones, when I remembered
this little gem from last year
http://www.danskebank.com/link/ITreport20030403uk/$file/ITreport20030403uk.pdf

I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.

Well, I'm not sure about StartUpId....but certainly the high 2 bytes of
LogId looks pretty certain never to be anything but zeros. You have 2.4
x 10^14...which is 9,000 years at 1000 log file/sec
We could use the scheme you descibe:
add xFFFF to the logid every time you complete an archive recovery...so
the log files look like 0001000000000CE3 after youve recovered a load of
files that look like 0000000000000CE3

If you used StartUpID directly, you might just run out....but its very
unlikely you would ever perform 65000 recovery situations - unless
you've run the <expletive> code as often as I have :(.

Doing that also means we don't have to work out how to do that with
StartUpID. Of course, altering the length and makeup of the xlog files
is possible too, but that will cause other stuff to stop working....

[We'll have to give this a no-SciFi name, unless we want to make
in-roads into the Dr.Who fanbase :) Don't get them started. Better
still, dont give it a name at all.]

I'll sleep on that lot.

Best regards, Simon Riggs

#4Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#3)
Re: Point in Time Recovery

- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.

But what if you restore because after that PIT everything went haywire
including the log ? Did you mean to apply the remaining changes but not
commit those xids ? I think it is essential to only leave xlogs around
that allow a subsequent rollforward from the same old full backup.
Or is an instant new full backup required after restore ?

Andreas

#5Noname
spock@mgnet.de
In reply to: Zeugswetter Andreas SB SD (#4)
Re: Point in Time Recovery

On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote:

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

I'd also think that seconds are absolutely sufficient. From my daily
experience I can say that you're normally lucky to know the time
on one minute basis.
If you need to get closer, a log sniffer is unavoidable ...

Greetings, Klaus

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)

#6Richard Huxton
dev@archonet.com
In reply to: Simon Riggs (#3)
Re: Point in Time Recovery

Simon Riggs wrote:

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.

At the risk of irritating people, I'll repeat what I suggested a few
weeks ago...

Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
pitr_id if they record it).

IMHO time is seldom relevant, event boundaries are.

If you want to add special syntax for this, fine. If not, an INSERT
statement is a convenient way to do this anyway.

--
Richard Huxton
Archonet Ltd

#7Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#4)
Re: Point in Time Recovery

On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:

- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

OK, thanks. I'll just leave the time_t datatype just the way it is.

Best Regards, Simon Riggs

#8Simon Riggs
simon@2ndquadrant.com
In reply to: Richard Huxton (#6)
Re: Point in Time Recovery

On Tue, 2004-07-06 at 20:00, Richard Huxton wrote:

Simon Riggs wrote:

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.

Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.

At the risk of irritating people, I'll repeat what I suggested a few
weeks ago...

All feedback is good. Thanks.

Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
pitr_id if they record it).

It's a good plan, but the recovery is currently offline recovery and no
SQL is possible. So no way to insert, no way to access tables until
recovery completes. I like that plan and probably would have used it if
it was viable.

IMHO time is seldom relevant, event boundaries are.

Agreed, but time is the universally agreed way of describing two events
as being simultaneous. No other way to say "recover to the point when
the message queue went wild".

As of last post to Andreas, I've said I'll not bother changing the
granularity of the timestamp.

If you want to add special syntax for this, fine. If not, an INSERT
statement is a convenient way to do this anyway.

The special syntax isn't hugely important - I did suggest a kind of
SQL-like syntax previously, but thats gone now. Invoking recovery via a
command file IS, so we are able to tell the system its not in crash
recovery AND that when you've finished I want you to respond to crashes
without re-entering archive recovery.

Thanks for your comments. I'm not making this more complex than needs
be; in fact much of the code is very simple - its just the planning
that's complex.

Best regards, Simon Riggs

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#2)
Re: Point in Time Recovery

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.

Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems

Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]

Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible).
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.

When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.

At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).

Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).

How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.

Comments much appreciated, assuming you read this far...

Best regards, Simon Riggs

#10Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#9)
Re: Point in Time Recovery

Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Imho you should take a close look at StartUpId, I think it is exactly this
"large number". Maybe you can add +2 to intentionally leave a hole.

Once you increment, I think it is very essential to checkpoint and double
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.

Should we implement timelines?

Yes :-)

Andreas

#11Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#10)
Re: Point in Time Recovery

On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote:

Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Imho you should take a close look at StartUpId, I think it is exactly this
"large number". Maybe you can add +2 to intentionally leave a hole.

Once you increment, I think it is very essential to checkpoint and double
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.

Thanks for your thoughts - you have made me rethink this over some
hours. Trouble is, on this occasion, the other suggestion still seems
the best one, IMVHO.

If we number timelines based upon StartUpId, then there is still some
potential for conflict and this is what we're seeking to avoid.

Simply adding FFFF to the LogId puts the new timeline so far into the
previous timelines future that there isn't any problems. We only
increment the timeline when we recover, so we're not eating up the
numbers quickly. Simply adding a number means that there isn't any
conflict with any normal operations. The timelines aren't numbered
separately, so I'm not introducing a new version of
StartUpID...technically there isn't a new timeline, just a chunk of
geological time between them.

We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.

Best regards, Simon Riggs

#12Noname
spock@mgnet.de
In reply to: Simon Riggs (#11)
Re: Point in Time Recovery

On Thu, 8 Jul 2004, Simon Riggs wrote:

We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.

I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.

On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder "Why is it
doing that? Is this a bug or a feature?"

Just my 2ct :-))

Greetings, Klaus

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Noname (#12)
Re: Point in Time Recovery

On Thu, 2004-07-08 at 07:57, spock@mgnet.de wrote:

On Thu, 8 Jul 2004, Simon Riggs wrote:

We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.

I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.

On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder "Why is it
doing that? Is this a bug or a feature?"

OK, will do.

Best regards, Simon Riggs

#14Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#7)
Re: Point in Time Recovery

On 7/6/2004 3:58 PM, Simon Riggs wrote:

On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:

- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

TransactionID and timestamp is only sufficient if the transactions are
selected by their commit order. Especially in read committed mode,
consider this execution:

xid-1: start
xid-2: start
xid-2: update field x
xid-2: commit
xid-1: update field y
xid-1: commit

In this case, the update done by xid-1 depends on the row created by
xid-2. So logically xid-2 precedes xid-1, because it made its changes
earlier.

So you have to apply the log until you find the commit record of the
transaction you want apply last, and then stamp all transactions that
where in progress at that time as aborted.

Jan

OK, thanks. I'll just leave the time_t datatype just the way it is.

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#15Simon Riggs
simon@2ndquadrant.com
In reply to: Jan Wieck (#14)
Re: Point in Time Recovery

On Sat, 2004-07-10 at 15:17, Jan Wieck wrote:

On 7/6/2004 3:58 PM, Simon Riggs wrote:

On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:

- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)

Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?

Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.

TransactionID and timestamp is only sufficient if the transactions are
selected by their commit order. Especially in read committed mode,
consider this execution:

xid-1: start
xid-2: start
xid-2: update field x
xid-2: commit
xid-1: update field y
xid-1: commit

In this case, the update done by xid-1 depends on the row created by
xid-2. So logically xid-2 precedes xid-1, because it made its changes
earlier.

So you have to apply the log until you find the commit record of the
transaction you want apply last, and then stamp all transactions that
where in progress at that time as aborted.

Agreed.

I've implemented this exactly as you say....

This turns out to be very easy because:
- when looking where to stop we only ever stop at commit or aborts -
these are the only records that have timestamps anyway...
- any record that isn't specifically committed is not updated in the
clog and therefore not visible. The clog starts in indeterminate state,
0 and is then updated to either committed or aborted. Aborted and
indeterminate are handled similarly in the current code, to allow for
crash recovery - PITR doesn't change anything there.
So, PITR doesn't do anything that crash recovery doen't already do.
Crash recovery makes no attempt to keep track of in-progress
transactions and doesn't make a special journey to the clog to
specifically mark them as aborted - they just are by default.

So, what we mean by "stop at a transactionId" is "stop applying redo at
the commit/abort record for that transactionId." It has to be an exact
match, not a "greater than", for exactly the reason you mention. That
means that although we stop at the commit record of transactionId X, we
may also have applied records for transactions with later transactionIds
e.g. X+1, X+2...etc (without limit or restriction).

(I'll even admit that as first, I did think we could get away with the
"less than" test that you are warning me about. Overall, I've spent more
time on theory/analysis than on coding, on the idea that you can improve
poor code, but wrong code just needs to be thrown away).

Timestamps are more vague...When time is used, there might easily be 10+
transactions whose commit/abort records have identical timestamp values.
So we either stop at the first or last record depending upon whether we
specified inclusive or exclusive on the recovery target value.

The hard bit, IMHO, is what we do with the part of the log that we have
chosen not to apply....which has been discussed on list in detail also.

Thanks for keeping an eye out for possible errors - this one is
something I'd thought through and catered for (there are comments in my
current latest published code to that effect, although I have not yet
finished coding the clean-up-after-stopping part).

This implies nothing with regard to other possible errors or oversights
and so I very much welcome any questioning of this nature - I am as
prone to error as the next man. It's important we get this right.

Best regards, Simon Riggs

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#9)
Re: Point in Time Recovery

On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:

On Mon, 2004-07-05 at 22:46, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.

Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?

Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.

Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems

Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]

Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible).
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.

When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.

At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).

Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).

How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.

Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.

I'm still wrestling with the cleanup-after-stopping-at-point-in-time
code and have some important conclusions.

Moving forward on a timeline is somewhat tricky for xlogs, as shown
above,...but...

My earlier treatment seems to have neglected to include the clog also.
If we stop before end of log, then we also have potentially many (though
presumably at least one) committed transactions that we do not want to
be told about ever again.

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Please tell me that we can ignore the state of the clog, but I think we
can't - if a new xid re-used a previous xid that had committed AND then
we crashed...we would have inconsistent data. Unless we physically write
zeros to clog for every begin transaction after a recovery...err, no...

The only recourse that I can see is to "truncate the future" of the
clog, which would mean:
- keeping track of the highest xid provided by any record from the xlog,
in xact.c, xact_redo
- using that xid to write zeros to the clog after this point until EOF
- drop any clog segment files past the new "high" segment
- no idea how that effects NT or not...

The timeline idea works for xlog because once we've applied the xlog
records and checkpointed, we can discard the xlog records. We can't do
that with clog records (unless we followed recovery with a vacuum full -
which is possible, but not hugely desirable) - though this doesn't solve
the issue that xlog records don't have any prescribed position in the
file, clog records do.

Right now, I don't know where to start with the clog code and the
opportunity for code-overlap with NT seems way high. These problems can
be conquered, given time and "given enough eyeballs".

I'm all ears for some bright ideas...but I'm getting pretty wary that we
may introduce some unintended features if we try to get this stabilised
within two weeks. My current conclusion is: lets commit archive recovery
in this release, then wait until next dot release for full recovery
target features. We've hit all the features which were a priority and
the fundamental architecture is there, so i think it is time to be happy
with what we've got, for now.

Comments, please....remembering that I'd love it if I've missed
something that simplifies the task. Fire away.

Best regards, Simon Riggs

#17Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#16)
Re: Point in Time Recovery

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?

Andreas

#18Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#17)
Re: Point in Time Recovery

On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote:

The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).

Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?

Sounds like a good start...

When PITR ends, we need to stop mid-way through a file. Does that handle
that situation?

Simon Riggs

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#16)
Re: Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

Please tell me that we can ignore the state of the clog,

We can.

The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.

This does not apply to clog because it's not archived. It's no more
than a data file. If you think you have trouble recreating clog then
you have the same issues recreating data files.

regards, tom lane

#20Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#19)
Re: Point in Time Recovery

On Tue, 2004-07-13 at 15:29, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Please tell me that we can ignore the state of the clog,

We can.

In general, you are of course correct.

The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.

This does not apply to clog because it's not archived. It's no more
than a data file. If you think you have trouble recreating clog then
you have the same issues recreating data files.

I'm getting carried away with the improbable....but this is the rather
strange, but possible scenario I foresee:

A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint

As stands currently, when we restore the backup, controlfile says that
last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue
on past 5 until end of logs. Normally, end of logs isn't until after
4...

When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3. So we use the backup taken
at 4 to rollforward to a point in the past (from the backups
perspective). The backup taken at 4 may now have data and clog records
written by bgwriter.

Given that time between checkpoints is likely to be longer than
previously was the case...this becomes a non-zero situation.

I was trying to solve this problem head on, but the best way is to make
sure we never allow ourselves such a muddled situation:

ISTM the way to avoid this is to insist that we always rollforward
through at least one checkpoint to guarantee that this will not occur.

...then I can forget I ever mentioned the ****** clog again.

I'm ignoring this issue for now....whether it exists or not!

Best Regards, Simon Riggs

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#20)
Re: Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

I'm getting carried away with the improbable....but this is the rather
strange, but possible scenario I foresee:

A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint

When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3.

No, it isn't possible. The recovery *must* proceed at least as far as
wherever the end of the log was at the time the backup was completed.
Otherwise everything is broken, not only clog, because you may have disk
blocks in your backup that postdate where you stopped log replay.

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished. You only get to decide where to stop after that
point.

regards, tom lane

#22Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#21)
Re: Point in Time Recovery

On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished. You only get to decide where to stop after that
point.

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

Thanks for your help,

Best regards, Simon Riggs

#23Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#22)
Re: Point in Time Recovery

Simon Riggs wrote:

On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished. You only get to decide where to stop after that
point.

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

If it happens, could you use your previous full backup and the PITR logs
from before stop stopped logging, and then after? Is there a period
where they could not restore reliably?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#24Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#23)
Re: Point in Time Recovery

On Tue, 2004-07-13 at 23:42, Bruce Momjian wrote:

Simon Riggs wrote:

On Tue, 2004-07-13 at 22:19, Tom Lane wrote:

To have a consistent recovery at all, you must replay the log starting
from a checkpoint before the backup began and extending to the time that
the backup finished. You only get to decide where to stop after that
point.

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

If it happens, could you use your previous full backup and the PITR logs
from before stop stopped logging, and then after?

Yes.

Is there a period
where they could not restore reliably?

Good question. No is the answer.

The situation is that the backup isn't timestamped with respect to the
logs, so its possible to attempt to use the wrong backup for recovery.

The solution is procedural - make sure you timestamp your backup files,
so you know which ones to recover with...

Best Regards, Simon Riggs

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#22)
Re: Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

Right.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times. If it does not do so then
it's fatally flawed. (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)

regards, tom lane

#26Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#25)
Re: Point in Time Recovery

Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

Right.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times. If it does not do so then
it's fatally flawed. (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#27Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#23)
Re: [HACKERS] Point in Time Recovery

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs

#28Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#26)
Re: Point in Time Recovery

On Wed, 2004-07-14 at 00:01, Bruce Momjian wrote:

Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

So the situation is:
- You must only stop recovery at a point in time (in the logs) after the
backup had completed.

Right.

No way to enforce that currently, apart from procedurally. Not exactly
frequent, so I think I just document that and move on, eh?

The procedure that generates a backup has got to be responsible for
recording both the start and stop times. If it does not do so then
it's fatally flawed. (Note also that you had better be careful to get
the time as seen on the server machine's clock ... this could be a nasty
gotcha if the backup is run on a different machine, such as an NFS
server.)

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

No need. You just correlate the recovery target with the backup file
times. Mostly, you'll only ever use your last backup and won't need to
fuss with the times.

Backup should begin with a CHECKPOINT...then wait for that to complete,
just to make the backup as current as possible.

If you want to start purging your archives of old archived xlogs, you
can use the filedate (assuming you preserved that on your copy to
archive - but even if not, they'll be fairly close).

Best regards, Simon Riggs

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#26)
Re: Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

Ideally the procedure for making a backup would go something like:

1. Inquire of the server its current time and the WAL position of the
most recent checkpoint record (which is what you really need).

2. Make the backup.

3. Inquire of the server its current time and the current end-of-WAL
position.

4. Record items 1 and 3 along with the backup itself.

I think the current theory was you could fake #1 by copying pg_control
before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.

regards, tom lane

#30Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#29)
Re: Point in Time Recovery

On Wed, 2004-07-14 at 00:28, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

OK, but procedurally, how do you correlate the start/stop time of the
tar backup with the WAL numeric file names?

Ideally the procedure for making a backup would go something like:

1. Inquire of the server its current time and the WAL position of the
most recent checkpoint record (which is what you really need).

2. Make the backup.

3. Inquire of the server its current time and the current end-of-WAL
position.

4. Record items 1 and 3 along with the backup itself.

I think the current theory was you could fake #1 by copying pg_control
before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.

err...I think at this point we should review the PITR patch....

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts rollforward
from there. Effectively, restore assumes it has access to an infinite
timeline of logs....which clearly isn't the case, but its up to *you* to
check that you have the logs that go with the backups. (Or put another
way, if this sounds hard, buy some software that administers the
procedure for you). That's the mechanism that allows "infinite
recovery".

In brief, the code path is as identical as possible to the current crash
recovery situation...archive recovery restores the files from archive
when they are needed, just as if they had always been in pg_xlog, in a
way that ensures pg_xlog never runs out of space.

Recovery ends when: it reaches the recovery target you specified, or it
runs out of xlogs (first it runs out of archived xlogs, then tries to
find a more recent local copy if there is one).

I think the current theory was you could fake #1 by copying pg_control

before everything else, but this doesn't really help for step #3, so
it would probably be better to add some server functions to get this
info.

Not sure what you mean by "fake"....

Best Regards, Simon Riggs

#31Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Simon Riggs (#27)
Re: [HACKERS] Point in Time Recovery

Can you give us some suggestions of what kind of stuff to test? Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works? Does kill -9 simulate a 'power off'?

Chris

Simon Riggs wrote:

Show quoted text

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

#32Simon Riggs
simon@2ndquadrant.com
In reply to: Christopher Kings-Lynne (#31)
Re: [HACKERS] Point in Time Recovery

On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:

Can you give us some suggestions of what kind of stuff to test? Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works? Does kill -9 simulate a 'power off'?

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

I'd love some help assembling a test document with numbered tests...

Best regards, Simon Riggs

#33Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#32)
Re: Point in Time Recovery

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts
rollforward

Unfortunately this only works if pg_control was the first file to be
backed up (or by chance no checkpoint happened after backup start and
pg_control backup)

Other db's have commands for:
start/end external backup

Maybe we should add those two commands that would initially only do
the following:

start external backup:
- (checkpoint as an option)
- make a copy of pg_control
end external backup:
- record WAL position (helps choose an allowed minimum PIT)

Those commands would actually not be obligatory but recommended, and would
only help with the restore process.

Restore would then eighter take the existing pg_control backup, or ask
the dba where rollforward should start and create a corresponding pg_control.
A helper utility could list possible checkpoints in a given xlog.

Andreas

#34Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#32)
Re: [HACKERS] Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

I agree, power off is not really the point here. What we need to check
into is (a) the mechanics of archiving WAL segments and (b) the
process of restoring given a backup and a bunch of WAL segments.

regards, tom lane

#35Noname
markw@osdl.org
In reply to: Simon Riggs (#27)
Re: [HACKERS] Point in Time Recovery

On 14 Jul, Simon Riggs wrote:

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs

Simon,

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections. I'll copy the patch output here. Let me know if
you want to see the reject files or anything else:

$ patch -p0 < ../../../pitr-v5_1.diff
patching file backend/access/nbtree/nbtsort.c
Hunk #2 FAILED at 221.
1 out of 2 hunks FAILED -- saving rejects to file backend/access/nbtree/nbtsort.c.rej
patching file backend/access/transam/xlog.c
Hunk #11 FAILED at 1802.
Hunk #15 FAILED at 2152.
Hunk #16 FAILED at 2202.
Hunk #21 FAILED at 3450.
Hunk #23 FAILED at 3539.
Hunk #25 FAILED at 3582.
Hunk #26 FAILED at 3833.
Hunk #27 succeeded at 3883 with fuzz 2.
Hunk #28 FAILED at 4446.
Hunk #29 succeeded at 4470 with fuzz 2.
8 out of 29 hunks FAILED -- saving rejects to file backend/access/transam/xlog.c.rej
patching file backend/postmaster/Makefile
patching file backend/postmaster/postmaster.c
Hunk #3 succeeded at 1218 with fuzz 2 (offset 70 lines).
Hunk #4 succeeded at 1827 (offset 70 lines).
Hunk #5 succeeded at 1874 (offset 70 lines).
Hunk #6 succeeded at 1894 (offset 70 lines).
Hunk #7 FAILED at 1985.
Hunk #8 succeeded at 2039 (offset 70 lines).
Hunk #9 succeeded at 2236 (offset 70 lines).
Hunk #10 succeeded at 2996 with fuzz 2 (offset 70 lines).
1 out of 10 hunks FAILED -- saving rejects to file backend/postmaster/postmaster.c.rej
patching file backend/storage/smgr/md.c
Hunk #1 succeeded at 162 with fuzz 2.
patching file backend/utils/misc/guc.c
Hunk #1 succeeded at 342 (offset 9 lines).
Hunk #2 succeeded at 1387 (offset 9 lines).
patching file backend/utils/misc/postgresql.conf.sample
Hunk #1 succeeded at 113 (offset 10 lines).
patching file bin/initdb/initdb.c
patching file include/access/xlog.h
patching file include/storage/pmsignal.h

#36Simon Riggs
simon@2ndquadrant.com
In reply to: Noname (#35)
Re: [HACKERS] Point in Time Recovery

On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote:

On 14 Jul, Simon Riggs wrote:

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections. I'll copy the patch output here. Let me know if
you want to see the reject files or anything else:

I'm on it. Sorry 'bout that all - midnight fingers.

#37Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#33)
Re: Point in Time Recovery

On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts
rollforward

Unfortunately this only works if pg_control was the first file to be
backed up (or by chance no checkpoint happened after backup start and
pg_control backup)

Other db's have commands for:
start/end external backup

OK...this idea has come up a few times. Here's my take:

- OS and hardware facilities exist now to make instant copies of sets of
files. Some of these are open source, others not. If you use these, you
have no requirement for this functionality....but these alone are no
replacement for archive recovery.... I accept that some people may not
wish to go to the expense or effort to use those options, but in my mind
these are the people that will not be using archive_mode anyway.

- all we would really need to do is to stop the bgwriter from doing
anything during backup. pgcontrol is only updated at checkpoint. The
current xlog is updated constantly, but this need not be copied because
we are already archiving it as soon as its full. That leaves the
bgwriter, which is now responsible for both lazy writing and
checkpoints.
So, put a switch into bgwriter to halt for a period, then turn it back
on at the end. Could be a SIGHUP GUC...or...

...and with my greatest respects....

- please could somebody else code that?... my time is limited

Best regards, Simon Riggs

#38Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#36)
5 attachment(s)
Re: [HACKERS] Point in Time Recovery

On Wed, 2004-07-14 at 20:33, Simon Riggs wrote:

On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote:

On 14 Jul, Simon Riggs wrote:

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections. I'll copy the patch output here. Let me know if
you want to see the reject files or anything else:

I'm on it. Sorry 'bout that all - midnight fingers.

Latest version, pitr_v5_2.patch...

- Updated to cvs tip
- Additional tip changes located and patched
- Full re-test of both recover to point in time and recover to xid
- 2 additional bug fixes
- corrected recovery.conf sample
- Patch test
- Patch manually inspected

(pgarch.c, pgarch.h and README identical to previous post)

Go for it...

Best regards, Simon

Attachments:

pitr_v5_2.patchtext/x-patch; charset=; name=pitr_v5_2.patchDownload
Index: backend/access/nbtree/nbtsort.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/access/nbtree/nbtsort.c,v
retrieving revision 1.83
diff -c -r1.83 nbtsort.c
*** backend/access/nbtree/nbtsort.c	11 Jul 2004 18:01:45 -0000	1.83
--- backend/access/nbtree/nbtsort.c	14 Jul 2004 22:41:19 -0000
***************
*** 67,72 ****
--- 67,73 ----
  #include "miscadmin.h"
  #include "storage/smgr.h"
  #include "utils/tuplesort.h"
+ #include "access/xlog.h"
  
  
  /*
***************
*** 220,235 ****
  
  	wstate.index = btspool->index;
  	/*
! 	 * We need to log index creation in WAL iff WAL archiving is enabled
  	 * AND it's not a temp index.
- 	 *
- 	 * XXX when WAL archiving is actually supported, this test will likely
- 	 * need to change; and the hardwired extern is cruddy anyway ...
  	 */
  	{
! 		extern char XLOG_archive_dir[];
! 
! 		wstate.btws_use_wal = XLOG_archive_dir[0] && !wstate.index->rd_istemp;
  	}
  	/* reserve the metapage */
  	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
--- 221,231 ----
  
  	wstate.index = btspool->index;
  	/*
! 	 * We need to log index creation in WAL if WAL archiving is enabled
  	 * AND it's not a temp index.
  	 */
  	{
!  		wstate.btws_use_wal = XLogArchiveMode && !wstate.index->rd_istemp;
  	}
  	/* reserve the metapage */
  	wstate.btws_pages_alloced = BTREE_METAPAGE + 1;
Index: backend/access/transam/xlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/access/transam/xlog.c,v
retrieving revision 1.147
diff -c -r1.147 xlog.c
*** backend/access/transam/xlog.c	1 Jul 2004 00:49:50 -0000	1.147
--- backend/access/transam/xlog.c	14 Jul 2004 22:41:24 -0000
***************
*** 36,47 ****
  #include "storage/proc.h"
  #include "storage/sinval.h"
  #include "storage/spin.h"
  #include "utils/builtins.h"
  #include "utils/guc.h"
  #include "utils/relcache.h"
  #include "miscadmin.h"
  
- 
  /*
   * This chunk of hackery attempts to determine which file sync methods
   * are available on the current platform, and to choose an appropriate
--- 36,47 ----
  #include "storage/proc.h"
  #include "storage/sinval.h"
  #include "storage/spin.h"
+ #include "storage/pmsignal.h"
  #include "utils/builtins.h"
  #include "utils/guc.h"
  #include "utils/relcache.h"
  #include "miscadmin.h"
  
  /*
   * This chunk of hackery attempts to determine which file sync methods
   * are available on the current platform, and to choose an appropriate
***************
*** 85,96 ****
  
  
  /* User-settable parameters */
  int			CheckPointSegments = 3;
  int			XLOGbuffers = 8;
  char	   *XLOG_sync_method = NULL;
  const char	XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR;
- char		XLOG_archive_dir[MAXPGPATH];		/* null string means
- 												 * delete 'em */
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
--- 85,98 ----
  
  
  /* User-settable parameters */
+ bool 			XLogArchiveMode = false;
+ bool 			XLogArchiveDEBUG = false;
+ char 			*XLogArchiveDest;
+ char 			*XLogArchiveProgram;
  int			CheckPointSegments = 3;
  int			XLOGbuffers = 8;
  char	   *XLOG_sync_method = NULL;
  const char	XLOG_sync_method_default[] = DEFAULT_SYNC_METHOD_STR;
  
  #ifdef WAL_DEBUG
  bool		XLOG_DEBUG = false;
***************
*** 127,132 ****
--- 129,157 ----
  
  /* Are we doing recovery by reading XLOG? */
  bool		InRecovery = false;
+ bool        InArchiveRecovery = false;
+ bool        restoreFromArchive = false;
+ bool        InRecoveryCleanup = false;
+ 
+ FILE	   	*archiveStatusLogFD;
+ 
+ static void readRecoveryCommandFile(void);
+ 
+ static 	char recoveryCommandFile[MAXPGPATH];
+ 
+ static  char recoveryRestoreProgram[MAXPGPATH];
+ static  char recoveryTargetTimeStr[MAXPGPATH];
+ static  char recoveryTargetXidStr[MAXPGPATH];
+ 
+ static  bool recoveryTarget = true;
+ static  bool recoveryTargetExact = false;
+ static  bool recoveryTargetInclusive = true;
+ 
+ static  bool recoveryDebugLog = false;
+ static  bool recoveryUseLocalTime = true;
+ 
+ static  TransactionId   recoveryTargetXid;
+ static  time_t          recoveryTargetTime;
  
  /*
   * MyLastRecPtr points to the start of the last XLOG record inserted by the
***************
*** 393,398 ****
--- 418,424 ----
  
  /* File path names */
  static char XLogDir[MAXPGPATH];
+ static char archiveStatusDir[MAXPGPATH];
  static char ControlFilePath[MAXPGPATH];
  
  /*
***************
*** 434,439 ****
--- 460,468 ----
  
  static bool InRedo = false;
  
+ static void XLogArchiveNotify(uint32 log, uint32 seg);
+ static bool XLogArchiveDone(char xlog[MAXPGPATH]);
+ static void XLogArchiveCleanup(char xlog[32]);
  
  static bool AdvanceXLInsertBuffer(void);
  static bool WasteXLInsertBuffer(void);
***************
*** 444,449 ****
--- 473,479 ----
  					   bool find_free, int max_advance,
  					   bool use_lock);
  static int	XLogFileOpen(uint32 log, uint32 seg, bool econt);
+ static void RestoreRecoveryXlog(char *path, uint32 log, uint32 seg);
  static void PreallocXlogFiles(XLogRecPtr endptr);
  static void MoveOfflineLogs(uint32 log, uint32 seg, XLogRecPtr endptr);
  static XLogRecord *ReadRecord(XLogRecPtr *RecPtr, int emode, char *buffer);
***************
*** 455,464 ****
  static void ReadControlFile(void);
  static char *str_time(time_t tnow);
  static void issue_xlog_fsync(void);
- #ifdef WAL_DEBUG
  static void xlog_outrec(char *buf, XLogRecord *record);
- #endif
- 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
--- 485,491 ----
***************
*** 912,917 ****
--- 939,1076 ----
  }
  
  /*
+  * XLogArchiveNotify
+  *
+  * Writes an archive notification file to the archiveStatusDir
+  *
+  * The name of the notification file is the message that will be picked up
+  * by the archiver, e.g. we write archiveStatusDir/00000001000000C6.ready 
+  * and the archiver then knows to archive XLogDir/00000001000000C6,
+  * then when complete, rename it to archiveStatusDir/00000001000000C6.done
+  *
+  * Called only when in XLogArchiveMode by one backend process
+  * or by Archive Restore in various situations
+  */
+ static void 
+ XLogArchiveNotify(uint32 log, uint32 seg)
+ {
+ 	char		archiveStatusLog[32];
+ 	char		archiveStatusLogPath[MAXPGPATH];
+ 
+ /* insert an otherwise empty file called <XLOG>.ready */
+ 	sprintf(archiveStatusLog, "%08X%08X.ready", log, seg);
+ 	snprintf(archiveStatusLogPath, MAXPGPATH, "%s/%s", archiveStatusDir, archiveStatusLog);
+    	archiveStatusLogFD = NULL;
+ 	archiveStatusLogFD = AllocateFile(archiveStatusLogPath, "w");
+ 	if (archiveStatusLogFD == NULL) {
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 	   		     errmsg("could not write archive_status file \"%s\": %m",
+ 				    archiveStatusLogPath)));
+         return;
+     }
+ 	FreeFile(archiveStatusLogFD);
+ 
+ /* the existence of this file is the message to the archiver to identify
+  * which files require archiving
+  *
+  * if this file is written OK, we then signal the ARCHIVER to do its thang
+  */
+ 
+ 	if (XLogArchiveDEBUG) {
+     	sprintf(archiveStatusLog, "%08X%08X", log, seg);
+ 		ereport(LOG,
+     		(errmsg("transaction log file \"%s\" is now ready to be archived", archiveStatusLog)));
+     }
+ 
+     /*
+      * don't send a signal if we know that the archiver isn't there (yet)
+      * - the archiver will see the archive_status file as soon as it starts 
+      */
+     if (!InArchiveRecovery)
+         SendPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER);
+ 
+ 	return;
+ }
+ 
+ /*
+  * XLogArchiveDone
+  *
+  * Searches for an archive notification file in archiveStatusDir
+  * 
+  * Reads archiveStatusDir looking for a specific filename. If that filename ends with .done
+  * then we know that the filename refers to an xlog in XLogDir that is safe to
+  * recycle. If the filename ends .ready then thats OK, else we have an error.
+  * 
+  * Called only when in XLogArchiveMode by bgwriter (when performing checkpoint)
+  *
+  * XXX code is rehacked from an earlier version, so needs streamlining
+  */
+ static bool 
+ XLogArchiveDone(char xlog[32])
+ {
+ 	char		archiveStatusLogPath[MAXPGPATH];
+ 	struct stat tmpStatBuf;
+ 
+ /* If <XLOG>.done exists then return true
+  */
+ 	snprintf(archiveStatusLogPath, MAXPGPATH, "%s/%s.done", archiveStatusDir, xlog);
+ 	if (stat(archiveStatusLogPath, &tmpStatBuf) == 0) {
+ 		return true;
+ 	} 
+ 	else
+ 		{
+ /*
+  * else if <XLOG>.ready exists then return false and issue WARNING
+  * ...this indicates archiver is either not working at all or
+  * if it is, then its just way too slow or incorrectly configured
+  */
+ 			snprintf(archiveStatusLogPath, MAXPGPATH, "%s/%s.ready", archiveStatusDir, xlog);
+         	if (stat(archiveStatusLogPath, &tmpStatBuf) == 0) {
+ 			    ereport(WARNING,
+     				(errcode(ERRCODE_WARNING),
+ 	       	 	    errmsg("archiving of log file \"%s\" has not yet started", 
+ 						archiveStatusLogPath),
+                     errhint("you should check whether your archive_program"
+                             "\"%s\" is still functioning correctly", XLogArchiveProgram)));
+ 			    return false;
+ 			}
+ 			else
+ 			{
+ /* else issue a WARNING.... a notification file should exist, except when you
+  * have just enabled archive_mode
+  * assume that error is simply a missing file...
+  */ 
+ 			    ereport(WARNING,
+     				(errcode_for_file_access(),
+     			 	errmsg("archive_status file \"%s\" is missing: %m",
+     						archiveStatusLogPath),
+                     errhint("if you have just enabled archive_mode you should:"
+                             "take a full physical backup, then remove \"%s\"",xlog)));
+ 			    return false;
+ 			}
+ 		}
+ }
+ 
+ /*
+  * XLogArchiveCleanup
+  *
+  * Cleanup an archive notification file for a particular xlog in XLogDir
+  * 
+  * Called only when in XLogArchiveMode by bgwriter (when performing checkpoint)
+  *
+  */
+ static void
+ XLogArchiveCleanup(char xlog[32])
+ {
+ 	char	archiveStatusLogPath[MAXPGPATH];
+ 
+ 	snprintf(archiveStatusLogPath, MAXPGPATH, "%s/%s.done", archiveStatusDir, xlog);
+ 	unlink(archiveStatusLogPath);
+ 
+ }
+ 
+ /*
   * Advance the Insert state to the next buffer page, writing out the next
   * buffer if it still contains unwritten data.
   *
***************
*** 1260,1265 ****
--- 1419,1430 ----
  		{
  			issue_xlog_fsync();
  			LogwrtResult.Flush = LogwrtResult.Write;	/* end of current page */
+ 
+             /* 
+              * write archive_status file to notify that xlog is ready to archive
+              */
+             if (XLogArchiveMode)
+                 XLogArchiveNotify(openLogId, openLogSeg);
  		}
  
  		if (ispartialpage)
***************
*** 1619,1625 ****
  					   bool use_lock)
  {
  	char		path[MAXPGPATH];
! 	struct stat stat_buf;
  
  	XLogFileName(path, log, seg);
  
--- 1784,1790 ----
  					   bool use_lock)
  {
  	char		path[MAXPGPATH];
! 	struct stat tmpStatBuf;
  
  	XLogFileName(path, log, seg);
  
***************
*** 1637,1643 ****
  	else
  	{
  		/* Find a free slot to put it in */
! 		while (stat(path, &stat_buf) == 0)
  		{
  			if (--max_advance < 0)
  			{
--- 1802,1808 ----
  	else
  	{
  		/* Find a free slot to put it in */
! 		while (stat(path, &tmpStatBuf) == 0)
  		{
  			if (--max_advance < 0)
  			{
***************
*** 1686,1692 ****
  	char		path[MAXPGPATH];
  	int			fd;
  
! 	XLogFileName(path, log, seg);
  
  	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | XLOG_SYNC_BIT,
  					   S_IRUSR | S_IWUSR);
--- 1851,1859 ----
  	char		path[MAXPGPATH];
  	int			fd;
  
!     XLogFileName(path, log, seg);
!  	if (restoreFromArchive)
!         RestoreRecoveryXlog(path, log, seg);
  
  	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | XLOG_SYNC_BIT,
  					   S_IRUSR | S_IWUSR);
***************
*** 1706,1715 ****
--- 1873,2104 ----
  				   path, log, seg)));
  	}
  
+     /*
+      * if we switched back to local xlogs after having been
+      * restoring from archive, we need to make sure that the
+      * local files dont get removed by end-of-recovery checkpoint
+      * in case we need to re-run the recovery
+      *
+      * we want to copy these away as soon as possible, so set
+      * the archive_status flag to .ready for them
+      * in case admin isn't cautious enough to have done this anyway
+      */
+  	if (InArchiveRecovery && !restoreFromArchive) {
+        	snprintf(path, MAXPGPATH, "%s/%08X%08X.ready", archiveStatusDir, log, seg);
+        	archiveStatusLogFD = NULL;
+         archiveStatusLogFD = AllocateFile(path, "w");
+ 	    if (archiveStatusLogFD == NULL)
+             ereport(ERROR,
+ 	   	       (errcode_for_file_access(),
+ 	           errmsg("could not write archive_status file \"%s\" ",
+ 			        path)));
+ 	    FreeFile(archiveStatusLogFD);
+     }        
+  
  	return (fd);
  }
  
  /*
+  * Get next logfile segment to allow recovery
+  *
+  */
+ static void
+ RestoreRecoveryXlog(char *path, uint32 log, uint32 seg)
+ {
+     char tmpXlog[32];
+     char restoreXlog[32];
+     char tmppath[MAXPGPATH];
+     char xlogRestoreCmd[MAXPGPATH];
+     char recoveryXlog[MAXPGPATH];
+     char lastrecoXlog[MAXPGPATH];
+     int         rc;
+ 	struct stat tmpStatBuf;
+     uint32 prevlog, prevseg;
+ 
+     /* 
+      * If a RecoveryFile exists, then we know we are in media recovery
+      * in which case we choose to recover files from archive, even
+      * though a file of that name may already exist in XLogDir
+      *
+      * By doing this, we do not effect crash recovery code path
+      * when we are not in archive_mode
+      *
+      * We take the archived file because, at the point we took backup,
+      * the current xlog will most probably be only partially full, 
+      * so we MUST refer to the full version of this file and 
+      * NOT the version of the file that exists with the backup.
+      *
+      * We could try to optimize this slightly by checking the local
+      * copy lastchange timestamp against the archived copy, 
+      * but we have no API to do this, nor can we guarantee that the
+      * lastchange timestamp was preserved correctly when we copied
+      * to archive. Our aim is robustness, so we elect not to do this.
+      *
+      * Try to copy full xlog from archive to pg_xlog, if it is available
+      * If that succeeds, we pass the RecoveryXlog filepath back for opening
+      * If that fails, then we try to read a local file if one exists.
+      * This allows us to cater for situations where the current xlog
+      * is still available locally and hasn't yet made it to archive.
+      * This could happen if:
+      * - we decide to recover database to undo user data changes
+      * - we have XLogDir on a different disk and the main DataDir drive
+      *   fails, leaving us with just the XLogDir
+      *
+      * Notice that we don't actually overwrite any files when we copy back
+      * from archive because the recoveryRestoreProgram may inadvertently
+      * restore inappropriate xlogs, or they may be corrupt, so we may
+      * have to fallback to the segments remaining in current XLogDir later.
+      * The copy-from-archive xlog is always the same, ensuring that we
+      * don't run out of disk space on long recoveries.
+      *
+      * simon@2ndquadrant.com
+      */
+     
+         snprintf(recoveryXlog, MAXPGPATH, "%s/RECOVERYXLOG", XLogDir);
+     	snprintf(lastrecoXlog, MAXPGPATH, "%s/LASTRECOXLOG", XLogDir);
+ 
+         if (stat(recoveryXlog, &tmpStatBuf) == 0) {            
+             /*
+              * save a copy of the last xlog, before we try to restore
+              * if the restore fails, we will need it to become current xlog
+              */
+             rc = rename(recoveryXlog, lastrecoXlog);
+             if (rc !=0)
+         		ereport(WARNING,
+             		(errcode_for_file_access(),
+                     errmsg("rename failed \"%s\" \"%s\":%m",
+                         recoveryXlog, lastrecoXlog)));
+             /*
+              * if it fails, ignore it - we'll create one soon...
+              */
+         }
+ 
+         /*
+          * Copy xlog from archive_dest to XLogDir
+          */
+         sprintf(restoreXlog, "%08X%08X", log, seg);
+       	snprintf(xlogRestoreCmd, MAXPGPATH, recoveryRestoreProgram, 
+                    XLogArchiveDest, restoreXlog, recoveryXlog);
+         if (XLogArchiveDEBUG)
+ 			ereport(LOG,
+                 (errmsg("restore_program=\"%s\"",
+                     xlogRestoreCmd)));
+ 
+         rc = system(xlogRestoreCmd);
+         if (rc!=0) {
+             /*
+              * remember, we rollforward UNTIL the restore fails
+              * so failure here is just part of the process...
+              * that makes it difficult to determine whether the restore
+              * failed because there isn't an archive to restore, or
+              * because the administrator has specified the restore
+              * program incorrectly...
+              * we could try to restore the testfile that the archiver writes
+              * when it starts up, but the absence of that file isn't
+              * very reliable evidence that the restore itself is broken,
+              * so just trust that the administrator has it correctly,
+              * XXX enhance that later
+              */
+ 			ereport(LOG,
+                 (errmsg("could not restore \"%s\" from archive",
+                     restoreXlog)));
+             /*
+              * if an archived file is not available, there might just be 
+              * a partially full version of this file still in XLogDir
+              * so return this as the filename to open.
+              * In many recovery scenarios we expect this to fail also...
+              * ...note that once we return to local mode, we could
+              * find more than one xlog there that were not in the archive,
+              * if for example, the archiver had stopped working prior
+              * to the event which led to recovery
+              */
+             snprintf(recoveryXlog, MAXPGPATH, "%s/%s", XLogDir, restoreXlog);
+             restoreFromArchive = false;
+             if (stat(recoveryXlog, &tmpStatBuf) == 0) {
+     			ereport(LOG,
+                     (errmsg("recovery will continue using local copy of \"%s\"",
+                         restoreXlog)));
+             }
+             /*
+              * if this file isn't available, then we need to setup the previous
+              * restored xlog to be the last and current xlog, if it exists
+              * remember: we've been restoring from recoverXlog, which isn't
+              * named the same as the normal xlog chain...
+              * also remember to output a corresponding archive_status of .done
+              */
+             else if ((stat(lastrecoXlog, &tmpStatBuf) == 0) && log==0 && seg > 0) {
+                 prevlog = log;
+                 prevseg = seg;
+         	    PrevLogSeg(prevlog, prevseg);
+                 XLogFileName(tmppath, prevlog, prevseg);
+                 if (XLogArchiveDEBUG)
+         			ereport(LOG,
+                         (errmsg("moving last restored xlog to \"%s\"", tmppath)));
+                 rc = rename(lastrecoXlog, tmppath);
+                 if (rc!=0) {
+             		ereport(LOG,
+                 		(errcode_for_file_access(),
+                         errmsg("rename failed \"%s\" \"%s\":%m",
+                             lastrecoXlog, tmppath)));
+             	    ereport(PANIC,
+         		        (errcode_for_file_access(),
+             	        errmsg("could not open file \"%s\" (log file %u, segment %u): %m",
+                             tmpXlog, log, seg)));
+                 }
+ 
+                 /* 
+                  * write out an archive_status file for previous xlog
+                  * to allow xlog to be recycled when recovered database
+                  * is all up and working again
+                  * ...looks wrong, but checkpointer is smart enough
+                  * not to archive the current xlog!
+                  */
+             	sprintf(tmpXlog, "%08X%08X", prevlog, prevseg);
+             	snprintf(tmppath, MAXPGPATH, "%s/%s.done", archiveStatusDir, tmpXlog);
+                	archiveStatusLogFD = NULL;
+             	archiveStatusLogFD = AllocateFile(tmppath, "w");
+ 	            if (archiveStatusLogFD == NULL)
+                     ereport(ERROR,
+ 	       			    (errcode_for_file_access(),
+ 	       		         errmsg("could not write archive_status file \"%s\" ",
+ 	       			        tmppath)));
+ 	            FreeFile(archiveStatusLogFD);
+             }
+             /* 
+              * there is NO else here...we just return the filename
+              * knowing that it isn't there...which then throws the usual error,
+              * will end with a clear message as to why...but not a problem
+              */
+         }
+         else {
+         /* restore success */
+             /* 
+              * if backup restored an xlog, yet we didnt use the local copy
+              * because we used the xlog version of that name from the
+              * archive instead, we need to write out an archive_status for
+              * it to show it can be recycled later
+              */
+             XLogFileName(tmppath, log, seg);
+             if (stat(tmppath, &tmpStatBuf) == 0) {
+                	sprintf(tmpXlog, "%08X%08X", log, seg);
+                	snprintf(tmppath, MAXPGPATH, "%s/%s.done", archiveStatusDir, tmpXlog);
+                	archiveStatusLogFD = NULL;
+                	archiveStatusLogFD = AllocateFile(tmppath, "w");
+         	    if (archiveStatusLogFD == NULL)
+             		ereport(ERROR,
+         				(errcode_for_file_access(),
+         	   		     errmsg("could not write archive_status file \"%s\": %m",
+     	  			        tmppath)));
+     	        FreeFile(archiveStatusLogFD);
+             }
+    			ereport(LOG,
+                 (errmsg("restored log file \"%s\" from archive", restoreXlog)));
+         }
+         strcpy(path, recoveryXlog);                
+         return;
+ }
+ 
+ /*
   * Preallocate log files beyond the specified log endpoint, according to
   * the XLOGfile user parameter.
   */
***************
*** 1747,1752 ****
--- 2136,2142 ----
  	struct dirent *xlde;
  	char		lastoff[32];
  	char		path[MAXPGPATH];
+     bool        recycle=false;
  
  	XLByteToPrevSeg(endptr, endlogId, endlogSeg);
  
***************
*** 1762,1786 ****
  	errno = 0;
  	while ((xlde = readdir(xldir)) != NULL)
  	{
  		if (strlen(xlde->d_name) == 16 &&
  			strspn(xlde->d_name, "0123456789ABCDEF") == 16 &&
  			strcmp(xlde->d_name, lastoff) <= 0)
  		{
  			snprintf(path, MAXPGPATH, "%s/%s", XLogDir, xlde->d_name);
! 			if (XLOG_archive_dir[0])
! 			{
! 				ereport(LOG,
! 						(errmsg("archiving transaction log file \"%s\"",
! 								xlde->d_name)));
! 				elog(WARNING, "archiving log files is not implemented");
! 			}
! 			else
  			{
  				/*
  				 * Before deleting the file, see if it can be recycled as
  				 * a future log segment.  We allow recycling segments up
! 				 * to XLOGfileslop segments beyond the current XLOG
! 				 * location.
  				 */
  				if (InstallXLogFileSegment(endlogId, endlogSeg, path,
  										   true, XLOGfileslop,
--- 2152,2194 ----
  	errno = 0;
  	while ((xlde = readdir(xldir)) != NULL)
  	{
+ 		/* if correct length and alphanumeric makeup of file looks correct
+ 		 * use the alphanumeric sorting property of the filenames to decide
+ 		 * which ones are earlier than the lastoff transaction log
+ 		 * ...maybe should read lastwrite datetime of lastoff, then check that
+ 		 * only files last written earlier than this are removed/recycled
+ 		 */
  		if (strlen(xlde->d_name) == 16 &&
  			strspn(xlde->d_name, "0123456789ABCDEF") == 16 &&
  			strcmp(xlde->d_name, lastoff) <= 0)
  		{
  			snprintf(path, MAXPGPATH, "%s/%s", XLogDir, xlde->d_name);
! 			if (XLogArchiveMode) {
!                 if (InRecoveryCleanup)
!                     /*
!                      * this allows recycling of transaction logs
!                      * during the shutdown checkpoint at end of recovery
!                      * - we may have restored logs that were not used
!                      * in the recovery sequence, and so will not have
!                      * had an archive_status file written for them. 
!                      * - end-of-recovery doesn't clean up ALL xlogs,
!                      * which is why we also write archive_status files
!                      * as well as doing this
!                      */
!                     recycle=true;
!                 else
!                     recycle=XLogArchiveDone(xlde->d_name);
!             }
!             else
!                 recycle=false;
! 
! 			if ( recycle )
  			{
  				/*
  				 * Before deleting the file, see if it can be recycled as
  				 * a future log segment.  We allow recycling segments up
! 				 * until there are XLOGfileslop segments beyond the
! 				 * current XLOG location, otherwise they are removed.
  				 */
  				if (InstallXLogFileSegment(endlogId, endlogSeg, path,
  										   true, XLOGfileslop,
***************
*** 1794,1803 ****
  				{
  					/* No need for any more future segments... */
  					ereport(LOG,
! 						  (errmsg("removing transaction log file \"%s\"",
  								  xlde->d_name)));
  					unlink(path);
  				}
  			}
  		}
  		errno = 0;
--- 2202,2212 ----
  				{
  					/* No need for any more future segments... */
  					ereport(LOG,
! 						  (errmsg("too many transaction log files, removing \"%s\"",
  								  xlde->d_name)));
  					unlink(path);
  				}
+                 XLogArchiveCleanup(xlde->d_name);
  			}
  		}
  		errno = 0;
***************
*** 2255,2260 ****
--- 2664,2670 ----
  {
  	/* Init XLOG file paths */
  	snprintf(XLogDir, MAXPGPATH, "%s/pg_xlog", DataDir);
+ 	snprintf(archiveStatusDir, MAXPGPATH, "%s/archive_status", XLogDir);
  	snprintf(ControlFilePath, MAXPGPATH, "%s/global/pg_control", DataDir);
  }
  
***************
*** 2772,2777 ****
--- 3182,3345 ----
  }
  
  /*
+  * read in restore command from recovery.conf
+  *
+  * XXX longer term intention is to expand this to 
+  * cater for additional parameters and controls
+  * possibly using a bison grammar to control it
+  */
+ static void
+ readRecoveryCommandFile(void)
+ {
+     FILE     *fd;
+     char    *tok1 = NULL;
+     char    *tok2 = NULL;
+     char    cmdline[MAXPGPATH];
+     bool    syntaxError = false;
+     struct  tm tm;
+ 
+     fd = AllocateFile(recoveryCommandFile, "r");
+ 	if (fd == NULL) {
+     		ereport(FATAL,
+     			(errcode_for_file_access(),
+ 				errmsg("could not open recovery command file \"%s\"",recoveryCommandFile)));
+     		return;
+     }
+ 
+   	ereport(LOG,
+     		(errmsg("recovery command file found...")));
+ 
+     /*  
+      * expecting |restore_program = Qcommand stringQ|
+      * e.g.      |restore_program = 'cp %s/%s %s'|
+      * where | denote the beginning and end of the string
+      */
+     while (!syntaxError && fgets(cmdline, MAXPGPATH, fd) != NULL) {
+ 
+          /* ignore # comments ... */
+         if(strncmp(cmdline, "#", 1) == 0) continue;
+ 
+         tok1 = strtok(cmdline, "'");
+         tok2 = strtok(NULL, "'");
+     
+         if (tok1 != NULL && tok2 != NULL) {
+             tok1 = strtok(cmdline, " =");
+             if (tok1 == NULL) {
+                 syntaxError = true;
+                 continue;
+             }
+             if (strcmp(tok1,"restore_program") == 0) {
+                 strcpy(recoveryRestoreProgram, tok2);
+             	ereport(LOG,
+                     (errmsg("restore_program = %s", recoveryRestoreProgram)));
+                     recoveryTarget = false;
+             }
+             else if (strcmp(tok1,"recovery_target_xid") == 0) {
+                     /* 
+                      * if recovery_target_xid specified, then this overrides
+                      * recovery_target_time, if it is also set
+                      */
+                     strcpy(recoveryTargetXidStr, tok2);
+                     recoveryTargetXid = (TransactionId) strtoul(tok2, NULL, 10);
+                     if (errno == EINVAL || errno == ERANGE ) {
+                        ereport(FATAL,
+                             (errmsg("recovery_target_xid is not a valid xid \"%s\"",
+                                 recoveryTargetTimeStr)));
+                     }
+                 	ereport(LOG,(errmsg("recovery_target_xid = %s", recoveryTargetXidStr)));
+                     recoveryTarget = true;
+                     recoveryTargetExact = true;
+                  }
+             else if (strcmp(tok1,"recovery_target_time") == 0) {
+                     if (recoveryTargetExact)
+                         continue;
+                     strcpy(recoveryTargetTimeStr, tok2);
+                     recoveryTarget = true;
+                     recoveryTargetExact = false;
+                     /* 
+                      * convert the time string given
+                      * by the user to the time_t format.
+                      * The input string format has to be like 
+                      * '2004-07-13-16:09:07'.
+                      */
+                     if (strptime(recoveryTargetTimeStr, 
+                             "%Y-%m-%d-%H:%M:%S", &tm) == NULL ) {
+                         ereport(FATAL,
+                             (errmsg("recovery_target_time has a format error \"%s\"",
+                                 recoveryTargetTimeStr),
+                             errhint("correct format is YYYY-MM-DD-hh:mm:ss")));
+                     }
+  	                recoveryTargetTime = mktime(&tm);
+                     if (recoveryTargetTime < 0) {
+                         ereport(FATAL,
+                             (errmsg("recovery_target_time is an invalid time \"%s\"",
+                                 recoveryTargetTimeStr),
+                             errhint("correct format is YYYY-MM-DD-hh:mm:ss")));
+                     }
+                    	ereport(LOG,(errmsg("recovery_target_time = %s", 
+                         recoveryTargetTimeStr)));
+                   }
+             else if (strcmp(tok1,"recovery_target_inclusive") == 0) {
+                     /* 
+                      * does nothing if a recovery_target is not also set
+                      */
+                     if (strcmp(tok2, "true") == 0)
+                         recoveryTargetInclusive = true;
+                     else
+                         recoveryTargetInclusive = false;
+                     if (recoveryTargetInclusive)
+                         strcpy(tok2, "true");
+                     else
+                         strcpy(tok2, "false");
+                 	ereport(LOG,(errmsg("recovery_target_inclusive = %s", tok2)));
+                  }
+             else if (strcmp(tok1,"recovery_debug_log") == 0) {
+                     if (strcmp(tok2, "true") == 0)
+                         recoveryDebugLog = true;
+                     else
+                         recoveryDebugLog = false;
+                     if (recoveryDebugLog)
+                         strcpy(tok2, "true");
+                     else
+                         strcpy(tok2, "false");
+                 	ereport(LOG,(errmsg("recovery_debug_log = %s", tok2)));
+                  }
+             else if (strcmp(tok1,"use_local_time") == 0) {
+                     if (strcmp(tok2, "true") == 0)
+                         recoveryUseLocalTime = true;
+                     else
+                         recoveryUseLocalTime = false;
+                     if (recoveryUseLocalTime)
+                         strcpy(tok2, "true");
+                     else
+                         strcpy(tok2, "false");
+                 	ereport(LOG,(errmsg("use_local_time = %s", tok2)));
+                  }
+             else {
+                	ereport(LOG,
+                     (errmsg("unrecognized parameter \"%s\"", tok1)));
+                 syntaxError = true; 
+             }
+         }
+         else {
+             ereport(LOG,
+                 (errmsg("incorrectly formatted parameter \"%s\"", cmdline),
+                  errhint("parameter = 'value'")));
+             syntaxError = true;        
+         }
+     }
+ 
+ 	FreeFile(fd);
+ 
+     if (syntaxError) {
+         ereport(FATAL,
+ 		  (errmsg("fatal error in \"%s\"", recoveryCommandFile)));
+     }
+     
+     return;
+ }
+ 
+ /*
   * This must be called ONCE during postmaster or standalone-backend startup
   */
  void
***************
*** 2787,2792 ****
--- 3355,3361 ----
  	XLogRecord *record;
  	char	   *buffer;
  	uint32		freespace;
+    	struct stat tmpStatBuf;
  
  	/* Use malloc() to ensure record buffer is MAXALIGNED */
  	buffer = (char *) malloc(_INTL_MAXLOGRECSZ);
***************
*** 2833,2838 ****
--- 3402,3437 ----
  		pg_usleep(60000000L);
  #endif
  
+     /*
+      * Check now for recovery.conf
+      *
+      * if this file exists, it demonstrates the intention of the administrator
+      * to recover this database using archived xlogs
+      *
+      * we do this now because the first xlog is about to be opened for the
+      * first time. We've read the checkpoint pointer from the control file
+      * and we are about to use that to open the xlog it points to, and
+      * will begin rollforward recovery from that point
+      */
+   	snprintf(recoveryCommandFile, MAXPGPATH, "%s/recovery.conf", DataDir);
+     if (stat(recoveryCommandFile, &tmpStatBuf) == 0) {
+ 
+      	readRecoveryCommandFile();
+         /*
+          * clearly indicate our state
+          */
+         InArchiveRecovery = true;
+         /*
+          * set initial state for checking transaction logs
+          * this may change if the archive runs dry while still InArchiveRecovery
+          */
+         restoreFromArchive = true;
+ 
+       	ereport(LOG,
+     		(errmsg("starting archive recovery")));
+ 
+     }
+ 
  	/*
  	 * Get the last valid checkpoint record.  If the latest one according
  	 * to pg_control is broken, try the next-to-last one.
***************
*** 2863,2874 ****
  	LastRec = RecPtr = checkPointLoc;
  	memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  	wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
! 
  	ereport(LOG,
  			(errmsg("redo record is at %X/%X; undo record is at %X/%X; shutdown %s",
  					checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
  					checkPoint.undo.xlogid, checkPoint.undo.xrecoff,
! 					wasShutdown ? "TRUE" : "FALSE")));
  	ereport(LOG,
  			(errmsg("next transaction ID: %u; next OID: %u",
  					checkPoint.nextXid, checkPoint.nextOid)));
--- 3462,3481 ----
  	LastRec = RecPtr = checkPointLoc;
  	memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  	wasShutdown = (record->xl_info == XLOG_CHECKPOINT_SHUTDOWN);
!     /*
!      * we report the state of the control_file, not the checkpoint, why?
!      * wasShutdown refers to whether the last checkpoint was a 
!      * shutdown checkpoint, NOT whether the database was shutdown
!      * correctly according to control file. This distinction is only
!      * important InArchiveRecovery, since otherwise we could
!      * report that the database was shutdown, when the control file disagrees
!      */
  	ereport(LOG,
  			(errmsg("redo record is at %X/%X; undo record is at %X/%X; shutdown %s",
  					checkPoint.redo.xlogid, checkPoint.redo.xrecoff,
  					checkPoint.undo.xlogid, checkPoint.undo.xrecoff,
!                     (ControlFile->state == DB_SHUTDOWNED) ? "TRUE" : "FALSE")));
! 
  	ereport(LOG,
  			(errmsg("next transaction ID: %u; next OID: %u",
  					checkPoint.nextXid, checkPoint.nextOid)));
***************
*** 2915,2921 ****
  	/* REDO */
  	if (InRecovery)
  	{
! 		int			rmid;
  
  		ereport(LOG,
  				(errmsg("database system was not properly shut down; "
--- 3522,3537 ----
  	/* REDO */
  	if (InRecovery)
  	{
! 		int			      rmid;
!     	char		      recoveryLogPath[MAXPGPATH];
!         bool              recoveryContinue = true;
!         bool              recoveryApply = true;
!         int               recoveryLogFD = -1;
!         char              *recbuf = NULL;
!        	uint8	          record_info;
! 		xl_xact_commit    *recordXactCommitData;
! 		xl_xact_abort     *recordXactAbortData;
!         time_t            recordXtime;
  
  		ereport(LOG,
  				(errmsg("database system was not properly shut down; "
***************
*** 2948,2953 ****
--- 3564,3589 ----
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
+ #ifdef WAL_DEBUG
+             if (XLOG_DEBUG)
+                recoveryDebugLog = true;
+ #endif
+             if (XLogArchiveDEBUG)            
+                recoveryDebugLog = true;
+ 
+             if (recoveryDebugLog) {
+            		recbuf = (char *) malloc(BLCKSZ);
+                 snprintf(recoveryLogPath, MAXPGPATH, "%s/recovery.log", DataDir);
+                 unlink(recoveryLogPath);
+                 recoveryLogFD = BasicOpenFile(recoveryLogPath, O_RDWR | O_CREAT | O_EXCL,
+ 					S_IRUSR | S_IWUSR);
+                 if (recoveryLogFD < 0)
+                     recoveryDebugLog = false;
+             }
+ 
+             /*
+              * main redo apply loop
+              */
  			do
  			{
  				/* nextXid must be beyond record's xid */
***************
*** 2958,2990 ****
  					TransactionIdAdvance(ShmemVariableCache->nextXid);
  				}
  
! #ifdef WAL_DEBUG
! 				if (XLOG_DEBUG)
  				{
! 					char		buf[8192];
! 
! 					sprintf(buf, "REDO @ %X/%X; LSN %X/%X: ",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff,
  							EndRecPtr.xlogid, EndRecPtr.xrecoff);
! 					xlog_outrec(buf, record);
! 					strcat(buf, " - ");
! 					RmgrTable[record->xl_rmid].rm_desc(buf,
  								record->xl_info, XLogRecGetData(record));
! 					elog(LOG, "%s", buf);
  				}
- #endif
  
  				if (record->xl_info & XLR_BKP_BLOCK_MASK)
  					RestoreBkpBlocks(record, EndRecPtr);
  
- 				RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);
- 				record = ReadRecord(NULL, LOG, buffer);
- 			} while (record != NULL);
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
  			LastRec = ReadRecPtr;
  			InRedo = false;
  		}
  		else
  			ereport(LOG,
--- 3594,3729 ----
  					TransactionIdAdvance(ShmemVariableCache->nextXid);
  				}
  
! 				if (recoveryDebugLog && record->xl_rmid == 1)
  				{
! 					sprintf(recbuf, "\nREDO %X/%X;LSN %X/%X: ",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff,
  							EndRecPtr.xlogid, EndRecPtr.xrecoff);
! 					sprintf(recbuf + strlen(recbuf), "rm %u;info %u ",
! 							record->xl_rmid, record->xl_info);
! 					xlog_outrec(recbuf, record);
! 					strcat(recbuf, "--");
!                     RmgrTable[record->xl_rmid].rm_desc(recbuf,
  								record->xl_info, XLogRecGetData(record));
!                     write(recoveryLogFD, recbuf, strlen(recbuf));
  				}
  
  				if (record->xl_info & XLR_BKP_BLOCK_MASK)
  					RestoreBkpBlocks(record, EndRecPtr);
+                 /*
+                  * Have we reached our recovery target?
+                  * check whether this is the last record we should apply -
+                  * we can only stop at transaction end record types,
+                  * COMMIT or ABORT, which are only ever associated
+                  * with the XLOG resource manager (1)
+                  */
+                 if (recoveryTarget == true   &&
+                     record->xl_rmid == 1) {
+ 
+                    	record_info = record->xl_info & ~XLR_INFO_MASK;
+ 
+                 	if (record_info == XLOG_XACT_COMMIT ||
+                 	    record_info == XLOG_XACT_ABORT) {
+ 
+                         if (recoveryTargetExact) {
+                             /* 
+                              * there can be only one transaction end record
+                              * with this exact transactionid
+                              *
+                              * when testing for an xid, we MUST test for 
+                              * equality only, since transactions are numbered
+                              * in the order they start, not the order they
+                              * complete. A higher numbered xid will complete
+                              * before you about 50% of the time...
+                              */
+                             if (record->xl_xid == recoveryTargetXid) {
+                                 recoveryContinue = false;
+                                 recoveryApply = recoveryTargetInclusive;
+                             }
+                             if (!recoveryContinue) {
+                                 ereport(LOG,
+                                     (errmsg("rollforward stopping at commit/abort:"
+                                             " transactionId %u",
+                                         record->xl_xid)));
+                             }
+                         }
+                         else {
+                             /* 
+                              * there can be many transactions that
+                              * share the same commit time, so
+                              * we stop after the last one, if we are
+                              * inclusive, or stop at the first one
+                              * if we are exclusive
+                              *
+                              * timestamp is first part of record body
+                              */
+                            	if (record_info == XLOG_XACT_COMMIT) {
+                                 recordXactCommitData = 
+                                     (xl_xact_commit *) XLogRecGetData(record);
+                                 recordXtime = recordXactCommitData->xtime;
+                             }
+                             else {
+                                 recordXactAbortData = 
+                                     (xl_xact_abort *) XLogRecGetData(record);
+                                 recordXtime = recordXactAbortData->xtime;
+                             }
+  
+                             if (recoveryTargetInclusive) {
+                                 if ((unsigned int)(recordXtime) >
+                                     (unsigned int)(recoveryTargetTime))
+                                     recoveryContinue = false;
+                                     /*
+                                      * because we have already just included the
+                                      * records that match (==), and we now want
+                                      * to stop just beyond
+                                      */
+                                     recoveryApply = false;
+                              }
+                             else {
+                                if ((unsigned int)(recordXtime) >=
+                                     (unsigned int)(recoveryTargetTime))
+                                     recoveryContinue = false;
+                                     recoveryApply = false;
+                             }
+                             if (!recoveryContinue) {
+                                 struct tm  *tm = localtime(&recordXtime);
+                                 ereport(LOG,
+                                     (errmsg("rollforward stopping at commit/abort:"
+                                             " time %04u-%02u-%02u %02u:%02u:%02u",
+             				            tm->tm_year + 1900, tm->tm_mon + 1, tm->tm_mday,
+ 			 	                        tm->tm_hour, tm->tm_min, tm->tm_sec)));
+                             }
+                         }
+                     }
+                 }
+ 
+                 if (recoveryContinue || recoveryApply) {
+     				RmgrTable[record->xl_rmid].rm_redo(EndRecPtr, record);
+        				record = ReadRecord(NULL, LOG, buffer);
+                 }
+ 
+ 			} while (record != NULL && recoveryContinue);
+             /*
+              * end of main redo apply
+              */
+ 
+             if (recoveryLogFD >= 0) {
+                 close(recoveryLogFD);
+                 if (recoveryDebugLog)
+                     free(recbuf);
+             }
  
  			ereport(LOG,
  					(errmsg("redo done at %X/%X",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
  			LastRec = ReadRecPtr;
  			InRedo = false;
+             if (InArchiveRecovery) {
+                 InRecoveryCleanup = true;
+                 InArchiveRecovery = false;
+                 /* LastRec.xlogid = LastRec.xlogid + TIMELINE_INCREMENT */
+             }
+             Assert(restoreFromArchive == false);
  		}
  		else
  			ereport(LOG,
***************
*** 3106,3118 ****
  		 * Note that we write a shutdown checkpoint.  This is correct since
  		 * the records following it will use SUI one more than what is
  		 * shown in the checkpoint's ThisStartUpID.
! 		 *
  		 * In case we had to use the secondary checkpoint, make sure that it
  		 * will still be shown as the secondary checkpoint after this
  		 * CreateCheckPoint operation; we don't want the broken primary
! 		 * checkpoint to become prevCheckPoint...
  		 */
! 		ControlFile->checkPoint = checkPointLoc;
  		CreateCheckPoint(true, true);
  
  		/*
--- 3845,3864 ----
  		 * Note that we write a shutdown checkpoint.  This is correct since
  		 * the records following it will use SUI one more than what is
  		 * shown in the checkpoint's ThisStartUpID.
! 		 */
! 		ControlFile->checkPoint = checkPointLoc;
! 		/*
  		 * In case we had to use the secondary checkpoint, make sure that it
  		 * will still be shown as the secondary checkpoint after this
  		 * CreateCheckPoint operation; we don't want the broken primary
! 		 * checkpoint to become prevCheckPoint...unless we have just
!          * performed an archive recovery, in which case reset the
!          * secondary checkpoint as well...to avoid going back over
!          * old ground
  		 */
!         if (InRecoveryCleanup)
!     		ControlFile->prevCheckPoint = checkPointLoc;
! 
  		CreateCheckPoint(true, true);
  
  		/*
***************
*** 3149,3154 ****
--- 3895,3906 ----
  	 * Okay, we're officially UP.
  	 */
  	InRecovery = false;
+     if (InRecoveryCleanup) {
+         unlink(recoveryCommandFile);
+         InRecoveryCleanup = false;
+ 		ereport(LOG,
+ 			(errmsg("archive recovery complete")));
+     }
  
  	ControlFile->state = DB_IN_PRODUCTION;
  	ControlFile->time = time(NULL);
***************
*** 3706,3719 ****
  		strcat(buf, "UNKNOWN");
  }
  
- #ifdef WAL_DEBUG
  static void
  xlog_outrec(char *buf, XLogRecord *record)
  {
  	int			bkpb;
  	int			i;
  
! 	sprintf(buf + strlen(buf), "prev %X/%X; xprev %X/%X; xid %u",
  			record->xl_prev.xlogid, record->xl_prev.xrecoff,
  			record->xl_xact_prev.xlogid, record->xl_xact_prev.xrecoff,
  			record->xl_xid);
--- 4458,4470 ----
  		strcat(buf, "UNKNOWN");
  }
  
  static void
  xlog_outrec(char *buf, XLogRecord *record)
  {
  	int			bkpb;
  	int			i;
  
! 	sprintf(buf + strlen(buf), "prv %X/%X;xprv %X/%X;xid %u",
  			record->xl_prev.xlogid, record->xl_prev.xrecoff,
  			record->xl_xact_prev.xlogid, record->xl_xact_prev.xrecoff,
  			record->xl_xid);
***************
*** 3731,3738 ****
  	sprintf(buf + strlen(buf), ": %s",
  			RmgrTable[record->xl_rmid].rm_name);
  }
- #endif /* WAL_DEBUG */
- 
  
  /*
   * GUC support
--- 4482,4487 ----
Index: backend/commands/tablecmds.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/commands/tablecmds.c,v
retrieving revision 1.119
diff -c -r1.119 tablecmds.c
*** backend/commands/tablecmds.c	11 Jul 2004 23:13:53 -0000	1.119
--- backend/commands/tablecmds.c	14 Jul 2004 22:41:46 -0000
***************
*** 5398,5413 ****
  	FlushRelationBuffers(rel, 0);
  
  	/*
! 	 * We need to log the copied data in WAL iff WAL archiving is enabled
  	 * AND it's not a temp rel.
- 	 *
- 	 * XXX when WAL archiving is actually supported, this test will likely
- 	 * need to change; and the hardwired extern is cruddy anyway ...
  	 */
  	{
! 		extern char XLOG_archive_dir[];
! 
! 		use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
  	}
  
  	nblocks = RelationGetNumberOfBlocks(rel);
--- 5398,5408 ----
  	FlushRelationBuffers(rel, 0);
  
  	/*
! 	 * We need to log the copied data in WAL if WAL archiving is enabled
  	 * AND it's not a temp rel.
  	 */
  	{
! 		use_wal = XLogArchiveMode && !rel->rd_istemp;
  	}
  
  	nblocks = RelationGetNumberOfBlocks(rel);
Index: backend/postmaster/Makefile
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/postmaster/Makefile,v
retrieving revision 1.15
diff -c -r1.15 Makefile
*** backend/postmaster/Makefile	29 May 2004 22:48:19 -0000	1.15
--- backend/postmaster/Makefile	14 Jul 2004 22:41:46 -0000
***************
*** 12,18 ****
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = postmaster.o bgwriter.o pgstat.o
  
  all: SUBSYS.o
  
--- 12,18 ----
  top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
! OBJS = postmaster.o bgwriter.o pgstat.o pgarch.o
  
  all: SUBSYS.o
  
Index: backend/postmaster/postmaster.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/postmaster/postmaster.c,v
retrieving revision 1.411
diff -c -r1.411 postmaster.c
*** backend/postmaster/postmaster.c	12 Jul 2004 19:14:56 -0000	1.411
--- backend/postmaster/postmaster.c	14 Jul 2004 22:41:48 -0000
***************
*** 117,123 ****
  #include "utils/ps_status.h"
  #include "bootstrap/bootstrap.h"
  #include "pgstat.h"
! 
  
  /*
   * List of active backends (or child processes anyway; we don't actually
--- 117,123 ----
  #include "utils/ps_status.h"
  #include "bootstrap/bootstrap.h"
  #include "pgstat.h"
! #include "pgarch.h"
  
  /*
   * List of active backends (or child processes anyway; we don't actually
***************
*** 198,203 ****
--- 198,204 ----
  /* PIDs of special child processes; 0 when not running */
  static pid_t StartupPID = 0,
  			BgWriterPID = 0,
+             PgArchPID = 0,
  			PgStatPID = 0;
  
  /* Startup/shutdown state */
***************
*** 1217,1222 ****
--- 1218,1228 ----
  				kill(BgWriterPID, SIGUSR2);
  		}
  
+ 		/* If we have lost the archiver, try to start a new one */
+ 		if (XLogArchiveMode && PgArchPID == 0 && 
+             StartupPID == 0 && !FatalError && Shutdown == NoShutdown)
+ 			PgArchPID = pgarch_start();
+  
  		/* If we have lost the stats collector, try to start a new one */
  		if (PgStatPID == 0 &&
  			StartupPID == 0 && !FatalError && Shutdown == NoShutdown)
***************
*** 1821,1826 ****
--- 1827,1835 ----
  			/* Tell pgstat to shut down too; nothing left for it to do */
  			if (PgStatPID != 0)
  				kill(PgStatPID, SIGQUIT);
+ 			/* Tell pgarch to shut down too; nothing left for it to do */
+ 			if (PgArchPID != 0)
+ 				kill(PgArchPID, SIGQUIT);
  			break;
  
  		case SIGINT:
***************
*** 1865,1870 ****
--- 1874,1882 ----
  			/* Tell pgstat to shut down too; nothing left for it to do */
  			if (PgStatPID != 0)
  				kill(PgStatPID, SIGQUIT);
+ 			/* Tell pgarch to shut down too; nothing left for it to do */
+ 			if (PgArchPID != 0)
+ 				kill(PgArchPID, SIGQUIT);
  			break;
  
  		case SIGQUIT:
***************
*** 1882,1887 ****
--- 1894,1901 ----
  				kill(BgWriterPID, SIGQUIT);
  			if (PgStatPID != 0)
  				kill(PgStatPID, SIGQUIT);
+ 			if (PgArchPID != 0)
+ 				kill(PgArchPID, SIGQUIT);
  			if (DLGetHead(BackendList))
  				SignalChildren(SIGQUIT);
  			ExitPostmaster(0);
***************
*** 1971,1978 ****
  			 */
  			if (Shutdown > NoShutdown && BgWriterPID != 0)
  				kill(BgWriterPID, SIGUSR2);
! 			else if (PgStatPID == 0 && Shutdown == NoShutdown)
! 				PgStatPID = pgstat_start();
  
  			continue;
  		}
--- 1985,1996 ----
  			 */
  			if (Shutdown > NoShutdown && BgWriterPID != 0)
  				kill(BgWriterPID, SIGUSR2);
! 			else if (Shutdown == NoShutdown) {
!                     if (PgStatPID == 0)
!         				PgStatPID = pgstat_start();
!                     if (PgArchPID == 0)
!         				PgArchPID = pgarch_start();
!             }
  
  			continue;
  		}
***************
*** 2021,2026 ****
--- 2039,2060 ----
  		}
  
  		/*
+ 		 * Was it the archiver?  If so, just try to start a new
+ 		 * one; no need to force reset of the rest of the system.  (If fail,
+ 		 * we'll try again in future cycles of the main loop.)
+ 		 */
+ 		if (PgArchPID != 0 && pid == PgArchPID)
+ 		{
+ 			PgArchPID = 0;
+ 			if (exitstatus != 0)
+ 				LogChildExit(LOG, gettext("archiver process"),
+ 							 pid, exitstatus);
+ 			if (StartupPID == 0 && !FatalError && Shutdown == NoShutdown)
+ 				PgArchPID = pgarch_start();
+ 			continue;
+ 		}
+ 
+ 		/*
  		 * Else do standard backend child cleanup.
  		 */
  		CleanupProc(pid, exitstatus);
***************
*** 2202,2207 ****
--- 2236,2252 ----
  		kill(PgStatPID, SIGQUIT);
  	}
  
+ 	/* Force a power-cycle of the pgarch processes too */
+ 	/* (Shouldn't be necessary, but just for luck) */
+ 	if (PgArchPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 "SIGQUIT",
+ 								 (int) PgArchPID)));
+ 		kill(PgArchPID, SIGQUIT);
+ 	}
+ 
  	FatalError = true;
  }
  
***************
*** 2951,2956 ****
--- 2996,3013 ----
  		if (Shutdown <= SmartShutdown)
  			SignalChildren(SIGUSR1);
  	}
+  
+  	if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER))
+  	{
+  		/*
+  		 * Send SIGUSR1 to ARCHIVER process, to wake it up and begin
+  		 * archiving next transaction log file. Backend should only
+          * send if in XLogArchiveMode...
+  		 */
+  		if (XLogArchiveMode && Shutdown == NoShutdown) {
+             kill(PgArchPID,SIGUSR1);
+ 		}
+     }
  
  	PG_SETMASK(&UnBlockSig);
  
Index: backend/utils/misc/guc.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/utils/misc/guc.c,v
retrieving revision 1.219
diff -c -r1.219 guc.c
*** backend/utils/misc/guc.c	12 Jul 2004 02:22:51 -0000	1.219
--- backend/utils/misc/guc.c	14 Jul 2004 22:41:52 -0000
***************
*** 342,347 ****
--- 342,363 ----
  
  static struct config_bool ConfigureNamesBool[] =
  {
+  	{
+  		{"archive_mode", PGC_POSTMASTER, WAL_SETTINGS,
+  			gettext_noop("Enable archiving of full transaction log files to a specified archival destination."),
+  			NULL
+  		},
+  		&XLogArchiveMode,
+  		false, NULL, NULL
+  	},
+  	{
+  		{"archive_debug", PGC_SIGHUP, WAL_SETTINGS,
+  			gettext_noop("Provide debug output for archive activities."),
+  			NULL
+  		},
+  		&XLogArchiveDEBUG,
+  		false, NULL, NULL
+  	},
  	{
  		{"enable_seqscan", PGC_USERSET, QUERY_TUNING_METHOD,
  			gettext_noop("Enables the planner's use of sequential-scan plans."),
***************
*** 1371,1376 ****
--- 1387,1410 ----
  
  static struct config_string ConfigureNamesString[] =
  {
+  	{
+  		{"archive_dest", PGC_POSTMASTER, WAL_SETTINGS,
+  			gettext_noop("Specifies where to archive WAL logs."),
+  			gettext_noop("A directory or specific location for archiving transation log files from PostgreSQL")
+  		},
+  		&XLogArchiveDest,
+  		"", NULL, NULL
+  	},
+  
+  	{
+  		{"archive_program", PGC_POSTMASTER, WAL_SETTINGS,
+  			gettext_noop("Archive program"),
+  			gettext_noop("The external program that will be called to execute the archival process")
+  		},
+  		&XLogArchiveProgram,
+  		"", NULL, NULL
+  	},
+ 
  	{
  		{"client_encoding", PGC_USERSET, CLIENT_CONN_LOCALE,
  			gettext_noop("Sets the client's character set encoding."),
Index: backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.115
diff -c -r1.115 postgresql.conf.sample
*** backend/utils/misc/postgresql.conf.sample	11 Jul 2004 21:48:25 -0000	1.115
--- backend/utils/misc/postgresql.conf.sample	14 Jul 2004 22:41:52 -0000
***************
*** 113,118 ****
--- 113,129 ----
  
  
  #---------------------------------------------------------------------------
+ # ARCHIVING
+ #---------------------------------------------------------------------------
+ 
+ # - Settings -
+ 
+ #archive_mode = true		# enables archiving of full txn log files
+ #archive_dest = '/tmp'        # specifies destination of archive files
+ #archive_program = 'cp %s %s'   # external archiving program command line
+ 
+ 
+ #---------------------------------------------------------------------------
  # QUERY TUNING
  #---------------------------------------------------------------------------
  
Index: bin/initdb/initdb.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/bin/initdb/initdb.c,v
retrieving revision 1.43
diff -c -r1.43 initdb.c
*** bin/initdb/initdb.c	14 Jul 2004 17:55:10 -0000	1.43
--- bin/initdb/initdb.c	14 Jul 2004 22:41:55 -0000
***************
*** 2023,2029 ****
  	char	   *pgdenv;			/* PGDATA value got from sent to
  								 * environment */
  	char	   *subdirs[] =
! 	{"global", "pg_xlog", "pg_clog", "pg_subtrans", "base", "base/1", "pg_tblspc"};
  
  	progname = get_progname(argv[0]);
  	set_pglocale_pgservice(argv[0], "initdb");
--- 2023,2030 ----
  	char	   *pgdenv;			/* PGDATA value got from sent to
  								 * environment */
  	char	   *subdirs[] =
! 
! 	{"global", "pg_xlog", "pg_xlog/archive_status", "pg_clog", "pg_subtrans", "base", "base/1", "pg_tblspc"};
  
  	progname = get_progname(argv[0]);
  	set_pglocale_pgservice(argv[0], "initdb");
Index: include/access/xlog.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/include/access/xlog.h,v
retrieving revision 1.52
diff -c -r1.52 xlog.h
*** include/access/xlog.h	1 Jul 2004 00:51:38 -0000	1.52
--- include/access/xlog.h	14 Jul 2004 22:41:56 -0000
***************
*** 210,215 ****
--- 210,219 ----
  extern int	XLOGbuffers;
  extern char *XLOG_sync_method;
  extern const char XLOG_sync_method_default[];
+ extern bool 			XLogArchiveMode;
+ extern bool 			XLogArchiveDEBUG;
+ extern char 			*XLogArchiveDest;
+ extern char 			*XLogArchiveProgram;
  
  #ifdef WAL_DEBUG
  extern bool	XLOG_DEBUG;
Index: include/storage/pmsignal.h
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/include/storage/pmsignal.h,v
retrieving revision 1.8
diff -c -r1.8 pmsignal.h
*** include/storage/pmsignal.h	29 May 2004 22:48:23 -0000	1.8
--- include/storage/pmsignal.h	14 Jul 2004 22:41:56 -0000
***************
*** 24,29 ****
--- 24,30 ----
  {
  	PMSIGNAL_PASSWORD_CHANGE,	/* pg_pwd file has changed */
  	PMSIGNAL_WAKEN_CHILDREN,	/* send a SIGUSR1 signal to all backends */
+   	PMSIGNAL_WAKEN_ARCHIVER,	/* send a NOTIFY signal to ARCHIVER */
  
  	NUM_PMSIGNALS				/* Must be last value of enum! */
  } PMSignalReason;
recovery.conf.sampletext/plain; charset=; name=recovery.conf.sampleDownload
pgarch.ctext/x-c; charset=; name=pgarch.cDownload
pgarch.htext/x-c-header; charset=; name=pgarch.hDownload
READMEtext/html; charset=; name=READMEDownload
#39Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#36)
Re: [HACKERS] Point in Time Recovery

I noticed that compiling with 5_1 patch applied fails due to
XLOG_archive_dir being removed from xlog.c , but
src/backend/commands/tablecmds.c still uses it.

I did the following to tablecmds.c :

5408c5408
< extern char XLOG_archive_dir[];
---

extern char *XLogArchiveDest;

5410c5410
< use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
---

use_wal = XLogArchiveDest[0] && !rel->rd_istemp;

Now I have to see if I have broken it with this change :-)

regards

Mark

Simon Riggs wrote:

Show quoted text

On Wed, 2004-07-14 at 16:55, markw@osdl.org wrote:

On 14 Jul, Simon Riggs wrote:

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

I just tried applying the v5_1 patch against the cvs tip today and got a
couple of rejections. I'll copy the patch output here. Let me know if
you want to see the reject files or anything else:

I'm on it. Sorry 'bout that all - midnight fingers.

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

#40SAKATA Tetsuo
sakata.tetsuo@lab.ntt.co.jp
In reply to: Simon Riggs (#32)
Re: Point in Time Recovery

Hi, folks.

My colleages and I are planning to test PITR after the 7.5 beta release.
Now we are desinging test items, but some specification are enough clear
(to us).

For example, we are not clear which resouce manager order to store log
records.

- some access method (like B-tree) require to log its date or not.
- create/drop action of table space to be stored to the log or not.

We'll be pleased if someone informs them.

The test set we'll proceed has following items;

- PITR can recover ordinary commited transaction's data.
- tuple data themselves
- index data associated with them
- PITR can recover commited some special transaction's data.
- DDL; create database, table, index and so on
- maintenance commands (handling large amount of data);
truncate, vacuum, reindex and so on.

Items above are 'data aspects' of the test. Other aspects are as follows

- Place of the archival log's drive;
PITR can recover a database from archived log data
- stored in the same drive as xlog.
- stored in a different drive on the same machine
in which the PostgreSQL runs.
- stored in a different drive on a different machine.

- Duration between a checkpoint and recovery;
PITR can recover a database enough long after a checkpoint.

- Time to Recover;
- to end of the log.
- to some specified time.

- Type of failures;
- system down --- kill the PostgreSQL process (as a simulation).
- media lost --- delete database files (as a simulation).
- These two case will be tested by a simulated situation first,
and we would try some 'real' failure after.
(real power down of the test machine to the first case,
and 'plug off' the disk drive to the second one.
these action would damage test machine, this is because
we plan them after 'ordinary' test items.)

The test set is under construction and we'll test the 7.5 beta
for some weeks, and report the result of the test here.

Sincerely yours.
Tetsuo SAKATA.

--
sakata.tetsuo _at_ lab.ntt.co.jp
SAKATA, Tetsuo. Yokosuka JAPAN.

#41Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#37)
Re: Point in Time Recovery

I talked to Tom on the phone today and and I think we have a procedure
for doing backup/restore in a fairly foolproof way.

As outlined below, we need to record the start/stop and checkpoint WAL
file names and offsets, and somehow pass those on to restore. I think
any system that requires users to link those values together is going
to cause confusion and be error-prone.

My idea is to do much of this automatically. First, create a
server-side function called pitr_backup_start() which creates a file in
the /data directory which contains the WAL filename/offsets for
last checkpoint and start. Then do the backup of the data directory.
Then call pitr_backup_stop() which adds the stop filename/offsets to the
file, and archive that file in the same place as the WAL files.

To restore, you untar the backup of /data. Then the recover backend
reads the file created by pitr_backup_start() to find the name of the
backup parameter file. It then recovers that file from the archive
location and uses the start/stop/checkpoint filename/offset information
to the restore.

The advantage of this is that the tar backup contains everything needed
to find the proper parameter file for restore. Ideally we could get all
the parameters into the tar backup, but that isn't possible because we
can't push the stop counters into the backup after the backup has
completed.

I recommend the pitr_backup_start() file be named for the current WAL
filename/offset, perhaps 000000000000032c.3da390.backup or something
like that. The file would be a simple text file in
pg_xlog/archive_status:

# start 2004-07-14 21:35:22.324579-04
wal_checkpoint = 0000000000000319.021233
wal_start = 000000000000032c.92a9cb
...added after backup completes...
wal_stop = 000000000000034a.3db030
# stop 2004-07-14 21:32:22.0923213-04

The timestamps are for documentation only. These files give admins
looking in the archive directory information on backup times.

(As an idea, there is no need for the user to specify a recovery mode.
If the postmaster starts and sees the pitr_backup_start() file in /data,
it can go into recovery mode automatically. If the archiver can't find
the file in the archive location, it can assume that it is just being
started from power failure mode. However if it finds the file in the
archive location, it can assume it is to enter recovery mode. There is
a race condition that a crash during copy of the file to the archive
location would be a problem. The solution would be to create a special
flag file before copying the file to archive, and then archive it and
remove the flag file. If the postmaster starts up and sees the
pitr_backup_start() file in /data and in the archive location, and it
doesn't see the flag file, it then knows it is doing a restore because
the flag file would never appear in a backup. Anyway, this is just an
idea.)

---------------------------------------------------------------------------

Simon Riggs wrote:

On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts
rollforward

Unfortunately this only works if pg_control was the first file to be
backed up (or by chance no checkpoint happened after backup start and
pg_control backup)

Other db's have commands for:
start/end external backup

OK...this idea has come up a few times. Here's my take:

- OS and hardware facilities exist now to make instant copies of sets of
files. Some of these are open source, others not. If you use these, you
have no requirement for this functionality....but these alone are no
replacement for archive recovery.... I accept that some people may not
wish to go to the expense or effort to use those options, but in my mind
these are the people that will not be using archive_mode anyway.

- all we would really need to do is to stop the bgwriter from doing
anything during backup. pgcontrol is only updated at checkpoint. The
current xlog is updated constantly, but this need not be copied because
we are already archiving it as soon as its full. That leaves the
bgwriter, which is now responsible for both lazy writing and
checkpoints.
So, put a switch into bgwriter to halt for a period, then turn it back
on at the end. Could be a SIGHUP GUC...or...

...and with my greatest respects....

- please could somebody else code that?... my time is limited

Best regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#42Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#39)
Re: [HACKERS] Point in Time Recovery

On Thu, 2004-07-15 at 02:43, Mark Kirkwood wrote:

I noticed that compiling with 5_1 patch applied fails due to
XLOG_archive_dir being removed from xlog.c , but
src/backend/commands/tablecmds.c still uses it.

I did the following to tablecmds.c :

5408c5408
< extern char XLOG_archive_dir[];
---

extern char *XLogArchiveDest;

5410c5410
< use_wal = XLOG_archive_dir[0] && !rel->rd_istemp;
---

use_wal = XLogArchiveDest[0] && !rel->rd_istemp;

Yes, I discovered that myself.

The fix is included in pitr_v5_2.patch...

Your patch follows the right thinking and looks like it would have
worked...
- XLogArchiveMode carries the main bool value for mode on/off
- XLogArchiveDest might also be used, though best to use the mode

Thanks for looking through the code...

Best Regards, Simon Riggs

#43Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#41)
Re: Point in Time Recovery

On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:

I talked to Tom on the phone today and and I think we have a procedure
for doing backup/restore in a fairly foolproof way.

As outlined below, we need to record the start/stop and checkpoint WAL
file names and offsets, and somehow pass those on to restore. I think
any system that requires users to link those values together is going
to cause confusion and be error-prone.

Unfortunately, it seems clear that many of my posts have not been read,
nor has anyone here actually tried to use the patch. Everybody's views
on what constitutes error-prone might well differ then.

Speculation about additional requirements is just great, but please
don't assume that I have infinite resources to apply to these problems.
Documentation has still to be written.

For a long time now, I've been adding "one last feature" to what is
there, but we're still no nearer to anybody inspecting the patch or
committing it.

There is building consensus on other threads that PITR should not even
be included in the release (3 tentative votes). This latest request
feels more like the necessary excuse to take the decision to pull PITR.
I would much rather that we took the brave decision and pull it NOW,
rather than have me work like crazy to chase this release.

:(

Best Regards, Simon Riggs

#44Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#42)
Re: Point in Time Recovery

I tried what I thought was a straightforward scenario, and seem to have
broken it :-(

Here is the little tale

1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9) recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup

This is what I get :

LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A48054
LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A48094
LOG: restored log file "0000000000000001" from archive
LOG: record with zero length at 0/1FFFFE0
LOG: redo done at 0/1FFFF30
LOG: restored log file "0000000000000001" from archive
LOG: restored log file "0000000000000001" from archive
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 13492) was terminated by signal 6
LOG: aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine
(i.e. I am *sure* noone else is connected!)

Mark

P.s : CVS HEAD from about 1 hour ago, PITR 5.2, FreeBSD 4.10 on x86

#45Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Mark Kirkwood (#44)
Re: Point in Time Recovery

Other db's have commands for:
start/end external backup

I see that the analogy to external backup was not good, since you are correct
that dba's would expect that to stop all writes, so they can safely split
their mirror or some such. Usually the expected time from start
until end external backup is expected to be only seconds. I actually think we
do not need this functionality, since in pg you can safely split the mirror any
time you like.

My comment was meant to give dba's a familiar tool. The effect of it
would only have been to create a separate backup of pg_control.
Might as well tell people to always backup pg_control first.

I originally thought you would require restore to specify an xlog id
from which recovery will start. You would search this log for the first
checkpoint record, create an appropriate pg_control, and start rollforward.

I still think this would be a nice feature, since then all that would be required
for restore is a system backup (that includes pg data files) and the xlogs.

Andreas

#46HISADAMasaki
hisada.masaki@lab.ntt.co.jp
In reply to: Simon Riggs (#27)
Re: Point in Time Recovery

Dear Simon,

I've just tested pitr_v5_2.patch and got an error message
during archiving process as follows.

-- begin
LOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
-- end

The command called in system(3) works, but it returns -1.
system(3) can not get right exit code from its child process,
when SIGCHLD is set as SIG_IGN.

So I did following change to pgarch_Main() in pgarch.c

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

After that,
the error message doen't come out and it seems to be working propery.

Regards,
Hisada, Masaki

On Wed, 14 Jul 2004 00:13:37 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

PITR Patch v5_1 just posted has Point in Time Recovery working....

Still some rough edges....but we really need some testers now to give
this a try and let me know what you think.

Klaus Naumann and Mark Wong are the only [non-committers] to have tried
to run the code (and let me know about it), so please have a look at
[PATCHES] and try it out.

Many thanks,

Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

--
HISADA, Masaki <hisada.masaki@lab.ntt.co.jp>

#47Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: HISADAMasaki (#46)
Re: Point in Time Recovery

Sorry for the stupid question, but how do I get this patch if I do not
receive the patches mails ?

The web interface html'ifies it, thus making it unusable.

Thanks
Andreas

#48Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#43)
Re: Point in Time Recovery

Simon Riggs wrote:

On Thu, 2004-07-15 at 03:02, Bruce Momjian wrote:

I talked to Tom on the phone today and and I think we have a procedure
for doing backup/restore in a fairly foolproof way.

As outlined below, we need to record the start/stop and checkpoint WAL
file names and offsets, and somehow pass those on to restore. I think
any system that requires users to link those values together is going
to cause confusion and be error-prone.

Unfortunately, it seems clear that many of my posts have not been read,
nor has anyone here actually tried to use the patch. Everybody's views
on what constitutes error-prone might well differ then.

Speculation about additional requirements is just great, but please
don't assume that I have infinite resources to apply to these problems.
Documentation has still to be written.

For a long time now, I've been adding "one last feature" to what is
there, but we're still no nearer to anybody inspecting the patch or
committing it.

I totally understand your feeling this, and I would be feeling the exact
same way (but would probably have complained much earlier :-) ).
Anyway, the problem is that Tom and I are serializing application of the
major features in the pipeline. We decided to focus on nested
transactions (NT) first (it is a larger patch), and that is why PITR has
gotten so little attention from us. However, there is no sense that
you had anything to do with it being places behind NT in the queue, and
therefore there is no feeling on our part that PITR is less important or
deserves less time than NT. Certainly any system that made you less
likely to be applied would be unfair and something we will not do.

My explanation about the file format was an attempt to address the
method of passing the wal filename/offset to the recover process. If
that isn't needed, I am sorry.

There is building consensus on other threads that PITR should not even
be included in the release (3 tentative votes). This latest request
feels more like the necessary excuse to take the decision to pull PITR.
I would much rather that we took the brave decision and pull it NOW,
rather than have me work like crazy to chase this release.

Those three individuals are not representative of the group. Sorry it
might seem there there is lack of enthusiasm for PITR, but it isn't true
from our end. You might have noticed that the patch queue has shrunk
dramatically, and now we are focused on NT and PITR almost exclusively.

We will get there --- it just seems dark at this time.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#49Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#37)
Re: Point in Time Recovery

Simon Riggs wrote:

On Wed, 2004-07-14 at 10:57, Zeugswetter Andreas SB SD wrote:

The recovery mechanism doesn't rely upon you knowing 1 or 3. The
recovery reads pg_control (from the backup) and then attempts to
de-archive the appropriate xlog segment file and then starts
rollforward

Unfortunately this only works if pg_control was the first file to be
backed up (or by chance no checkpoint happened after backup start and
pg_control backup)

Other db's have commands for:
start/end external backup

OK...this idea has come up a few times. Here's my take:

- OS and hardware facilities exist now to make instant copies of sets of
files. Some of these are open source, others not. If you use these, you
have no requirement for this functionality....but these alone are no
replacement for archive recovery.... I accept that some people may not
wish to go to the expense or effort to use those options, but in my mind
these are the people that will not be using archive_mode anyway.

- all we would really need to do is to stop the bgwriter from doing
anything during backup. pgcontrol is only updated at checkpoint. The
current xlog is updated constantly, but this need not be copied because
we are already archiving it as soon as its full. That leaves the
bgwriter, which is now responsible for both lazy writing and
checkpoints.
So, put a switch into bgwriter to halt for a period, then turn it back
on at the end. Could be a SIGHUP GUC...or...

I don't think we can turn off all file system writes during a backup.
Imagine writing to a tape. Preventing file system writes would make the
system useless.

- please could somebody else code that?... my time is limited

Yes, I think someone else could code this.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#50Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#44)
Re: Point in Time Recovery

On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:

I tried what I thought was a straightforward scenario, and seem to have
broken it :-(

Here is the little tale

1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9) recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup

This is what I get :

LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A48054
LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A48094
LOG: restored log file "0000000000000001" from archive
LOG: record with zero length at 0/1FFFFE0
LOG: redo done at 0/1FFFF30
LOG: restored log file "0000000000000001" from archive
LOG: restored log file "0000000000000001" from archive
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 13492) was terminated by signal 6
LOG: aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine
(i.e. I am *sure* noone else is connected!)

First, thanks for sticking with it to test this.

I've not received such a message myself - this is interesting.

Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called "archive_debug = true"
Does it happen identically the second time?

What time difference was there between steps 5 and 6? I think I can here
Andreas saying "told you".... I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...

Could you also post me the recovery.log file? (don't post to list)

Thanks, Simon Riggs

#51Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#48)
Re: Point in Time Recovery

On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:

We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now.

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!

Best Regards, Simon Riggs

#52Devrim GUNDUZ
devrim@gunduz.org
In reply to: Simon Riggs (#51)
Re: Point in Time Recovery

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Simon,

On Thu, 15 Jul 2004, Simon Riggs wrote:

We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now.

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

Personally, as a PostgreSQL Advocate, I believe that PITR is one of the
most important missing features in PostgreSQL. I've been keeping 'all' of
you e-mails about PITR and I'm really excited with that feature.

Please do not stop working on PITR. I'm pretty sure that most of the
'silent' people in the lists are waiting for PITR for an {Oracle, DB2, ...}-killer
database. In my country (Turkey), too many people spend a lot of
money for proprietary databases, just for some missing features in
PostgreSQL. If you finish your work on PITR (and other guys on NT, Win32
port, etc), then we'll feel more concentrated on PostgreSQL Advocation, so
that PostgreSQL will be used more and more. (Oh, we also need native
clustering...)

Maybe I should send this e-mail offlist, but I wanted everyone to learn my
feelings.

Regards and best wishes,
- --
Devrim GUNDUZ
devrim~gunduz.org devrim.gunduz~linux.org.tr
http://www.tdmsoft.com
http://www.gunduz.org
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFA9wKxtl86P3SPfQ4RAms+AJ95RfFi0lVwMD7u7zQ/DzLFEBC8MACgvRzd
HRqAjVqI3hekwImPpqelj9U=
=l445
-----END PGP SIGNATURE-----

#53Simon Riggs
simon@2ndquadrant.com
In reply to: HISADAMasaki (#46)
Re: Point in Time Recovery

On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

Dear Simon,

I've just tested pitr_v5_2.patch and got an error message
during archiving process as follows.

-- begin
LOG: archive command="cp /usr/local/pgsql/data/pg_xlog/0000000000000000 /tmp",return code=-1
-- end

The command called in system(3) works, but it returns -1.
system(3) can not get right exit code from its child process,
when SIGCHLD is set as SIG_IGN.

So I did following change to pgarch_Main() in pgarch.c

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

Thank you for testing the patch. Very much appreciated.

I was aware of the potential issues of incorrect return codes, and that
exact part of the code is the part I'm least happy with.

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

Best Regards, Simon Riggs

#54Simon Riggs
simon@2ndquadrant.com
In reply to: Devrim GUNDUZ (#52)
Re: Point in Time Recovery

On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:

Thanks for the vote of confidence, on or off list.

too many people spend a lot of
money for proprietary databases, just for some missing features in
PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

(Oh, we also need native
clustering...)

Next week, OK? :)

Best Regards, Simon

#55Alvaro Herrera
alvherre@dcc.uchile.cl
In reply to: Simon Riggs (#53)
Re: Point in Time Recovery

On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:

On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX. It goes on to say that BSD and SysV
behaviors differ on this aspect.

(This is on linux BTW)

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"La experiencia nos dice que el hombre pel� millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelar�an al hombre" (Ijon Tichy)

#56Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#54)
Re: Point in Time Recovery

Simon Riggs wrote:

On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:

Thanks for the vote of confidence, on or off list.

too many people spend a lot of
money for proprietary databases, just for some missing features in
PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

Agreed. I am surprised at how few requests we have gotten for PITR. I
assume people are either using replication or not considering us.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#57Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#50)
Re: Point in Time Recovery

Simon Riggs wrote:

First, thanks for sticking with it to test this.

I've not received such a message myself - this is interesting.

Is it possible to copy that directory to one side and re-run the test?
Add another parameter in postgresql.conf called "archive_debug = true"
Does it happen identically the second time?

Yes, identical results - I re-initdb'ed and ran the process again,
rather than reuse the files.

What time difference was there between steps 5 and 6? I think I can here
Andreas saying "told you".... I'm thinking the backup might be somehow
corrupted because the checkpoint occurred during the backup. Hmmm...

I was wondering about this, so left a bit more time in between, and
forced a sync as well for good measure.

5) $ psql -d test -c "checkpoint"; sleep 30;sync;sleep 30
6) $ tar -zcvf /data1/dump/pgdata-7.5.tar.gz *

Show quoted text

Thanks, Simon Riggs

#58Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#51)
Re: Point in Time Recovery

Simon Riggs wrote:

On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:

We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now.

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

As you can see, we are still chewing on NT. What PITR features are
missing? I assume because we can't stop file system writes during
backup that we will need a backup parameter file like I described. Is
there anything else that PITR needs?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#59Glen Parker
glenebob@nwlink.com
In reply to: Bruce Momjian (#56)
Re: Point in Time Recovery

Simon Riggs wrote:

On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:

Thanks for the vote of confidence, on or off list.

too many people spend a lot of
money for proprietary databases, just for some missing features in
PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

Agreed. I am surprised at how few requests we have gotten for PITR. I
assume people are either using replication or not considering us.

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available. I haven't requested per se, but
believe me, I'm waiting for it :-)

#60Mark Kirkwood
markir@coretech.co.nz
In reply to: Glen Parker (#59)
Re: Point in Time Recovery

Couldn't agree more. Maybe we should have made more noise :-)

Glen Parker wrote:

Show quoted text

Simon Riggs wrote:

On Thu, 2004-07-15 at 23:18, Devrim GUNDUZ wrote:

Thanks for the vote of confidence, on or off list.

too many people spend a lot of
money for proprietary databases, just for some missing features in
PostgreSQL

Agreed - PITR isn't aimed at existing users of PostgreSQL. If you use it
already, even though it doesn't have it, then you are quite likely to be
able to keep going without it.

Most commercial users won't touch anything that doesn't have PITR.

Agreed. I am surprised at how few requests we have gotten for PITR. I
assume people are either using replication or not considering us.

Don't forget that there are (must be) lots of us that know it's coming and
are just waiting until it's available. I haven't requested per se, but
believe me, I'm waiting for it :-)

---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
joining column's datatypes do not match

#61Mark Kirkwood
markir@coretech.co.nz
In reply to: Noname (#35)
Re: Point in Time Recovery

Simon Riggs wrote:

So far:

I've tried to re-create the problem as exactly as I can, but it works
for me.

This is clearly an important case to chase down.

I assume that this is the very first time you tried recovery? Second and
subsequent recoveries using the same set have a potential loophole,
which we have been discussing.

Right now, I'm thinking that the "exactly 2 logs worth" of data has
brought you very close to the end of the log file (FFFFE0) ending with 1
and the shutdown checkpoint that is then subsequently written is
failing.

Can you repeat this your end?

It is repeatable at my end. It is actually fairly easy to recreate the
example I am using, download

http://sourceforge.net/projects/benchw

and generate the dataset for Pg - but trim the large "fact0.dat" dump
file using head -100000.
Thus step 7 consists of creating the 4 tables and COPYing in the data
for them.

The nearest I can get to the exact record pointers you show are to start
recovery at A4807C and to end at with FFFF88.

Overall, PITR changes the recovery process very little, if at all. The
main areas of effect are to do with sequencing of actions and matching
up the right logs with the right backup. I'm not looking for bugs in the
code but in subtle side-effects and "edge" cases. Everything you can
tell me will help me greatly in chasing that down.

I agree - I will try this sort of example again, but will change the
number of rows I am COPYing (currently 100000) and see if that helps.

Best Regards, Simon Riggs

By way of contrast, using the *same* procedure (1-11), but generating 2
logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* -
e.g :

LOG: database system was interrupted at 2004-07-16 11:17:52 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A4803C
LOG: redo record is at 0/A4803C; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A4807C
postmaster starting
[postgres@shroudeater 7.5]$ LOG: restored log file "0000000000000001"
from archive
cp: cannot stat `/data1/pgdata/7.5-archive/0000000000000002': No such
file or directory
LOG: could not restore "0000000000000002" from archive
LOG: could not open file "/data1/pgdata/7.5/pg_xlog/0000000000000002"
(log file 0, segment 2): No such file or directory
LOG: redo done at 0/1FFFFD4
LOG: archive recovery complete
LOG: database system is ready
LOG: archiver started

#62Simon Riggs
simon@2ndquadrant.com
In reply to: Alvaro Herrera (#55)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:

On Thu, Jul 15, 2004 at 11:44:02PM +0100, Simon Riggs wrote:

On Thu, 2004-07-15 at 13:16, HISADAMasaki wrote:

-- line 236 ---
- pgsignal(SIGCHLD, SIG_IGN);

-- line 236 ---
+ pgsignal(SIGCHLD, SIG_DFL);

I'm not sure I understand why its returned -1, though I'll take you
recommendation. I've not witnessed such an issue. What system are you
running, or is it a default shell issue?

Do people think that the change is appropriate for all systems, or just
the one you're using?

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX. It goes on to say that BSD and SysV
behaviors differ on this aspect.

POSIX rules OK!

So - I should be setting this to SIG_DFL and thats good for everyone?

OK. Will do.

Best regards, Simon Riggs

#63Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#61)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 00:46, Mark Kirkwood wrote:

By way of contrast, using the *same* procedure (1-11), but generating 2
logs worth of INSERTS/UPDATES using 10 concurrent process *works fine* -
e.g :

Great...at least we have shown that something works (or can work) and
have begun to isolate the problem whatever it is.

Best Regards, Simon Riggs

#64Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#62)
Re: Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX.

So - I should be setting this to SIG_DFL and thats good for everyone?

Yeah, we learned the same lesson in the backend not too many releases
back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
but not others.

You could do worse than to look at the existing handling of signals in
the postmaster and its children; that code has been beat on pretty
heavily ...

regards, tom lane

#65Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Simon Riggs (#51)
Re: Point in Time Recovery

Thanks for that. My comments were heartfelt, but not useful right now.

Hi Simon, I'm sorry if I gave the impression that I thought your work
wasn't worthwhile, it is :(

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I've got your patch applied (but having some compilation problem), but
I'm really not sure what to test really. I don't really understand the
whole thing fully :/

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!

Definitely!

Chris

#66Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#64)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 04:49, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Fri, 2004-07-16 at 00:01, Alvaro Herrera wrote:

My manpage for signal(2) says that you shouldn't assign SIG_IGN to
SIGCHLD, according to POSIX.

So - I should be setting this to SIG_DFL and thats good for everyone?

Yeah, we learned the same lesson in the backend not too many releases
back. SIG_IGN'ing SIGCHLD is bad voodoo; it'll work on some platforms
but not others.

Many thanks all, Best Regards Simon Riggs

#67Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#66)
Re: Point in Time Recovery

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

As you can see, we are still chewing on NT. What PITR features are
missing? I assume because we can't stop file system writes during
backup that we will need a backup parameter file like I described. Is
there anything else that PITR needs?

No, we don't need to stop writes ! Not even to split a mirror,
other db's need that to be able to restore, but we dont.
We only need to tell people to backup pg_control first. The rest was only
intended to enforce
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

I think the idea with an extra file with WAL start position was overly
complicated, since all you need is pg_control (+ WAL end position to enforce 2.).

If we don't want to tell people to backup pg_control first, imho the next
best plan would be to add a "WAL start" input (e.g. xlog name) parameter
to recovery.conf, that "fixes" pg_control.

Andreas

#68Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#67)
Re: Point in Time Recovery

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

We only need to tell people to backup pg_control first. The rest was only
intended to enforce
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

Right, but I think Bruce's point is that it is far too easy to get those
things wrong; especially point 2 for which a straight tar dump will
simply not contain the information you need to determine what is a safe
stopping point.

I agree with Bruce that we should have some mechanism that doesn't rely
on the DBA to get this right. Exactly what the mechanism should be is
certainly open for discussion...

regards, tom lane

#69Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#68)
Re: Point in Time Recovery

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

We only need to tell people to backup pg_control first. The rest was only
intended to enforce
1. that pg_control is the first file backed up
2. the dba uses a large enough PIT (or xid) for restore

Right, but I think Bruce's point is that it is far too easy to get those
things wrong; especially point 2 for which a straight tar dump will
simply not contain the information you need to determine what is a safe
stopping point.

I agree with Bruce that we should have some mechanism that doesn't rely
on the DBA to get this right. Exactly what the mechanism should be is
certainly open for discussion...

Right. I am wondering what process people would use to backup
pg_control first? If they do:

tar -f $TAPE ./global/pg_control .

They will get two copies or pg_control, the early one, and one as part
of the directory scan. On restore, they would restore the early one,
but the directory scan would overwrite it. I suppose they could do:

cp global/pg_control global/pg_control.backup
tar -f $TAPE .

then on restore once all the files are restored move the
pg_control.backup to its original name. That gives us the checkpoint
wal/offset but how do we get the start/stop information. Is that not
required? Maybe we should just have a start/stop server-side functions
that create a file in the archive directory describing the start/stop
counters and time and the admin would then have to find those values.
Why are the start/stop wal/offset values needed anyway? I know why we
need the checkpoint value. Do we need a checkpoint after the archiving
starts but before the backup begins?

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery? Do you remove the recovery.conf file?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#70Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#69)
Re: Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery? Do you remove the recovery.conf file?

I do not care for the idea of a recovery.conf file at all, and have been
intending to look to see what we'd need to do to not have one. I find
it hard to believe that there is anything one would put in it that is
really persistent state. The above concern shows why it shouldn't be
treated as a persistent configuration file.

regards, tom lane

#71Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#70)
Re: Point in Time Recovery

then on restore once all the files are restored move the
pg_control.backup to its original name. That gives us the checkpoint
wal/offset but how do we get the start/stop information. Is that not
required?

The checkpoint wal/offset is in pg_control, that is sufficient start
information. The stop info is only necessary as a safeguard.

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery? Do you remove the recovery.conf file?

pg_control could be updated during rollforward (only if that actually
does a checkpoint). So if pg_control is also the recovery start info, then
we can continue from there if we have a power failure.
For the first release it would imho also be ok to simply start over if
you loose power.

I think the filename 'recovery.conf' is misleading, since it is not a
static configuration file, but a command file for one recovery.
How about 'recovery.command' then 'recovery.inprogress', and on recovery
completion it should be renamed to 'recovery.done'

Andreas

#72Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#71)
Re: Point in Time Recovery

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Actually yes. You have to start at a checkpoint record when replaying
the log, so if no checkpoint occurred between starting to archive WAL
and starting the tar backup, you have a useless backup.

It would be reasonable to issue a CHECKPOINT just before starting the
backup as part of the standard operating procedure for taking PITR
dumps. We need not require this, but it would help to avoid this
particular sort of mistake; and of course it might save a little bit of
replay effort if the backup is ever used.

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup. For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control. I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups. We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.

regards, tom lane

#73Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#72)
Re: Point in Time Recovery

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure.

Andreas

#74Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#73)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure.

By agreement, archive_mode can only be turned on at postmaster startup,
which means you always have a checkpoint - either because you shut it
down cleanly, or you didn't and it recovers, then writes one.

There is always something to start the rollforward.

So, non-issue.

Best regards, Simon Riggs

#75Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#69)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 15:27, Bruce Momjian wrote:

Also, when you are in recovery mode, how do you get out of recovery
mode, meaning if you have a power failure, how do you prevent the system
from doing another recovery? Do you remove the recovery.conf file?

That was the whole point of the recovery.conf file:
it prevents you from re-entering recovery accidentally, as would occur
if the parameters were set in the normal .conf file.

Best Regards, Simon Riggs

#76Simon Riggs
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#71)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 16:25, Zeugswetter Andreas SB SD wrote:

I think the filename 'recovery.conf' is misleading, since it is not a
static configuration file, but a command file for one recovery.
How about 'recovery.command' then 'recovery.inprogress', and on recovery
completion it should be renamed to 'recovery.done'

You understand this and your assessment is correct.

recovery.conf isn't an attempt to persist information. It is a means of
delivering a set of parameters to the recovery process, as well as
signalling overall that archive recovery is required (because the system
default remains the same, which is to recover from the logs it has
locally available to it).

I originally offered a design which used a command, similar to
DB2/Oracle...that was overruled as too complex. The (whatever you call
it) file is just a very simple way of specifying whats required.

There is more to be said here...clearly some explanations are required
and I will provide those later...

Best regards, Simon Riggs

#77Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#74)
Re: Point in Time Recovery

Simon Riggs wrote:

On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure.

By agreement, archive_mode can only be turned on at postmaster startup,
which means you always have a checkpoint - either because you shut it
down cleanly, or you didn't and it recovers, then writes one.

There is always something to start the rollforward.

So, non-issue.

I don't think so. I can imagine many cases where you want to do a
nightly tar backup without turning archiving on/off or restarting the
postmaster. In those cases, a manual checkpoint would have to be issued
before the backup begins.

Imagine a system that is up for a month, and they don't have enough
archive space to keep a months worth of WAL files. They would probably
do nightly or weekend tar backups, and then discard the WAL archives.

What procedure would they use? I assume they would copy all their old
WAL files to a save directory, issue a checkpoint, do a tar backup, then
they can delete the saved WAL files. Is that correct?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#78Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#77)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 19:30, Bruce Momjian wrote:

Simon Riggs wrote:

On Fri, 2004-07-16 at 16:58, Zeugswetter Andreas SB SD wrote:

Do we need a checkpoint after the archiving
starts but before the backup begins?

No.

Actually yes.

Sorry, I did incorrectly not connect 'archiving' with the backed up xlogs :-(
So yes, you need one checkpoint after archiving starts. Imho turning on xlog
archiving should issue such a checkpoint just to be sure.

By agreement, archive_mode can only be turned on at postmaster startup,
which means you always have a checkpoint - either because you shut it
down cleanly, or you didn't and it recovers, then writes one.

There is always something to start the rollforward.

So, non-issue.

I was discussing the claim that there might not be a checkpoint to begin
the rollforward from. There always is: if you are in archive_mode=true
then you will always have a checkpoint that can be used for recovery. It
may be "a long way in the past", if there has been no write activity,
but the rollforward will very very quick, since there will be no log
records.

I don't think so. I can imagine many cases where you want to do a
nightly tar backup without turning archiving on/off or restarting the
postmaster.

This is a misunderstanding. I strongly agree with what you say: the
whole system has been designed to avoid any benefit from turning on/off
archiving and there is no requirement to restart postmaster to take
backups.

In those cases, a manual checkpoint would have to be issued
before the backup begins.

A manual checkpoint doesn't HAVE TO be issued. Presumably most systems
will be running checkpoint every few minutes. Wherever the last one was
is where the rollforward would start from.

But you can if thats the way you want to do things, just wait long
enough for the checkpoint to have completed, otherwise your objective of
reducing rollforward time will not be met.

(please note my earlier reported rollback performance of approximately
x10 rate of recovery v elapsed time - will require testing on your own
systems).

Imagine a system that is up for a month, and they don't have enough
archive space to keep a months worth of WAL files. They would probably
do nightly or weekend tar backups, and then discard the WAL archives.

Yes, that would be normal practice. I would recommend keeping at least
the last 3 full backups and all of the WAL logs to cover that period.

What procedure would they use? I assume they would copy all their old
WAL files to a save directory, issue a checkpoint, do a tar backup, then
they can delete the saved WAL files. Is that correct?

PITR is designed to interface with a wide range of systems, through the
extensible archive/recovery program interface. We shouldn't focus on
just tar backups - if you do, then the whole thing seems less
feature-rich. The current design allows interfacing with tape, remote
backup, internet backup providers, automated standby servers and the
dozen major storage/archive vendors' solutions.

Writing a procedure to backup, assign filenames, keep track of stuff
isn't too difficult if you're a competent DBA with a mild knowledge of
shell or perl scripting. But if data is important, people will want to
invest the time and trouble to adopt one of the open source or licenced
vendors that provide solutions in this area.

Systems management is a discipline and procedures should be in place for
everything. I fully agree with the "automate everything" dictum, but
just don't want to constrain people too much to a particular way of
doing things.

-o-o-

Overall, for first release, I think the complexity of this design is
acceptable. PITR is similar to Oracle7 Backup/Recovery, and easily
recognisable to any DBA with current experience of current SQL Server,
DB2 (MVS, UDB) or Teradata systems. [I can't comment much on Ingres,
Informix, Sybase etc]

My main areas of concern are:
- the formal correctness of the recovery process
As a result of this concern, PITR makes ZERO alterations to the recovery
code itself. The trick is to feed it the right xlog files and to stop,
if required, at the right place and allow normal work to resume.

- the robustness and quality of my implementation
This requires quality checking of the code and full beta testing

-o-o-

We've raised a couple of valid points on the lists in the last few days:
- its probably a desirable feature (but not essential) to implement a
write suspend feature on the bgwriter, if nothing else it will be a
confidence building feature...as said previously, for many people, this
will not be required, but people will no doubt keep asking
- there is a small window of risk around the possibility that a recovery
target might be set by the user that doesn't rollforward all the way
past the end of the backup. That is real, but in general, people aren't
likely to be performing archive recovery within minutes of a backup
being taken - and if they are, they can always start from the previous
one to that. This is a gap we should close, but its just something to be
aware of...just like pg_dump not sorting things in the correct order in
its first release.

Not for now, but soon, I would propose:
- a command to suspend/resume bgwriter to allow backups.
- use the suspend/resume feature to write a log record "backup end
marker" which shows when this took place. Ensure that any rollforward
goes through AT LEAST ONE "backup end marker" on its way. (If a Point in
Time is specified too early, we can check this immediately against the
checkpoint record. We can then refuse to stop at eny point in time
earlier than the backup end marker.

I've written a todo list and will post this again soon.

Best Regards, Simon Riggs

#79Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#72)
Re: Point in Time Recovery

On Fri, 2004-07-16 at 16:47, Tom Lane wrote:

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup. For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

Sounds good. That would be familiar to Oracle DBAs doing BACKUP
CONTROLFILE. We can document that and off it as a suggested procedure.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control. I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

I wouldn't be in favour of a manual mechanism. If you want an automated
mechanism, whats wrong with using the one thats already there? You can
use pg_controldata to read the controlfile, again whats wrong with that?

We agreed some time back that an off-line xlog file inspector would be
required to allow us to inspect the logs and make a decision about where
to end recovery. You'd still need that.

It's scary enough having to specify the end point, let alone having to
specify the starting point as well.

At your request, and with Bruce's idea, I designed and built the
recovery system so that you don't need to know what range of xlogs to
bring back. You just run it, it brings back the right files from archive
and does recovery with them, then cleans up - and it works without
running out of disk space on long recoveries.

I've built it now and it works...

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups. We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.

...if you are doing tar backups...what will you do if you're not using
that mechanism?

If you are: It's common practice to make up a backup filename from
elements such as systemname, databasename, date and time etc. That gives
you the start time, the file last mod date gives you the end time.

I think its perfectly fine for everybody to do backups any way they
please. There are many licenced variants of PostgreSQL and it might be
appropriate in those to specify particular ways of doing things.

I'll be trusting the management of backup metadata and storage media to
a solution designed for the purpose (open or closed source), just as
I'll be trusting my data to a database solution designed for that
purpose. That for me is one of the good things about PostgreSQL - we use
the filesystem, we don't write our own, we provide language interfaces
not invent our own proprietary server language etc..

Best Regards, Simon Riggs

#80Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#79)
Re: Point in Time Recovery

OK, I think I have some solid ideas and reasons for them.

First, I think we need server-side functions to call when we start/stop
the backup. The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error. It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures. We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Second, I think once we start a restore, we should rename recovery.conf
to recovery.in_progress, and when complete rename that to
recovery.done. If the postmaster starts and sees recovery.in_progress,
it will fail to start knowing its recovery was interrupted. This allows
the admin to take appropriate action. (I am not sure what that action
would be. Does he bring back the backup files or just keep going?)

Third, I think we need to put a file in the archive location once we
complete a backup, recording the start/stop xid and wal/offsets. This
gives the admin documentation on what archive logs to keep and what xids
are available for recovery. Ideally the recover program would read that
file and check the recover xid to make sure it is after the stop xid
recorded in the file.

How would the recover program know the name of that file? We need to
create it in /data with start contents before the backup, then complete
it with end contents and archive it.

What should we name it? Ideally it would be named by the WAL
name/offset of the start so it orders in the proper spot in the archive
file listing, e.g.:

000000000000093a
000000000000093b
000000000000093b.032b9.start
000000000000093c

Are people going to know they need 000000000000093b for
000000000000093b.032b9.start? I hope so. Another idea is to do:

000000000000093a.xlog
000000000000093b.032b9.start
000000000000093b.xlog
000000000000093c.xlog

This would order properly. It might be a very good idea to add
extensions to these log files now that we are archiving them in strange
places. In fact, maybe we should use *.pg_xlog to document the
directory they came from.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Fri, 2004-07-16 at 16:47, Tom Lane wrote:

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup. For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

Sounds good. That would be familiar to Oracle DBAs doing BACKUP
CONTROLFILE. We can document that and off it as a suggested procedure.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control. I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

I wouldn't be in favour of a manual mechanism. If you want an automated
mechanism, whats wrong with using the one thats already there? You can
use pg_controldata to read the controlfile, again whats wrong with that?

We agreed some time back that an off-line xlog file inspector would be
required to allow us to inspect the logs and make a decision about where
to end recovery. You'd still need that.

It's scary enough having to specify the end point, let alone having to
specify the starting point as well.

At your request, and with Bruce's idea, I designed and built the
recovery system so that you don't need to know what range of xlogs to
bring back. You just run it, it brings back the right files from archive
and does recovery with them, then cleans up - and it works without
running out of disk space on long recoveries.

I've built it now and it works...

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups. We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.

...if you are doing tar backups...what will you do if you're not using
that mechanism?

If you are: It's common practice to make up a backup filename from
elements such as systemname, databasename, date and time etc. That gives
you the start time, the file last mod date gives you the end time.

I think its perfectly fine for everybody to do backups any way they
please. There are many licenced variants of PostgreSQL and it might be
appropriate in those to specify particular ways of doing things.

I'll be trusting the management of backup metadata and storage media to
a solution designed for the purpose (open or closed source), just as
I'll be trusting my data to a database solution designed for that
purpose. That for me is one of the good things about PostgreSQL - we use
the filesystem, we don't write our own, we provide language interfaces
not invent our own proprietary server language etc..

Best Regards, Simon Riggs

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#81Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#51)
Re: Point in Time Recovery

Let me address you concerns about PITR getting into 7.5. I think a few
people spoke last week expressing concern about our release process and
wanting to take drastic action. However, looking at the release status
report I am about to post, you will see we are on track for an August 1
beta.

PITR has been neglected only because it has been moving along so well we
haven't needed to get deeply involved. Simon has been able to address
concerns as we raised them and make adjustments quickly with little
guidance.

Now, we certainly don't want to skip adding PITR by not giving it our
full attention to get into 7.5. Once Tom completes the cursor issues
with NT in the next day or so, I think that removes the last big NT
stumbling block, and we will start to focus on PITR. Unless there is
some major thing we are missing, we fully expect to get PITR in 7.5. We
don't have a crystal ball to know for sure, but our intent is clear.

I know Simon is going away July 26 so we want to get him feedback as
soon as possible. If we wait until after July 26, we will have to make
all the adjustments without Simon's guidance, which will be difficult.

As far as the importance of PITR, it is a _key_ enterprise feature, even
more key than NT. PITR is going to be one of the crowning jewels of the
7.5 release, and I don't want to go into beta without it unless we can't
help it.

So, I know with the deadline looming, and everyone it getting nervous,
but keep the faith. I can see the light at the end of the tunnel. I
know this is a tighter schedule than we would like, but I know we can do
it, and I expect we will do it.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Thu, 2004-07-15 at 15:57, Bruce Momjian wrote:

We will get there --- it just seems dark at this time.

Thanks for that. My comments were heartfelt, but not useful right now.

I'm badly overdrawn already on my time budget, though that is my concern
alone. There is more to do than I have time for. Pragmatically, if we
aren't going to get there then I need to stop now, so I can progress
other outstanding issues. All help is appreciated.

I'm aiming for the minimum feature set - which means we do need to take
care over whether that set is insufficient and also to pull any part
that doesn't stand up to close scrutiny over the next few days.

Overall, my primary goal is increased robustness and availability for
PostgreSQL...and then to have a rest!

Best Regards, Simon Riggs

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#82Gaetano Mendola
mendola@bigfoot.com
In reply to: Simon Riggs (#32)
Re: [HACKERS] Point in Time Recovery

Simon Riggs wrote:

On Wed, 2004-07-14 at 03:31, Christopher Kings-Lynne wrote:

Can you give us some suggestions of what kind of stuff to test? Is
there a way we can artificially kill the backend in all sorts of nasty
spots to see if recovery works? Does kill -9 simulate a 'power off'?

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

If you write also how to perform these tests it's also good in order to show
which problem PITR is addressing, I mean I know that is addressing a power off
but how I will recover it ?

Regards
Gaetano Mendola

#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#38)
Re: [HACKERS] Point in Time Recovery

[ ... some desultory reading of PITR patch ... ]

What is the point of having both archive_program and archive_dest as
GUC variables? Wouldn't it be simpler to fold them into one parameter,
viz

archive_command = 'cp %s /archivedir'

For that matter, do we need a separate archive_mode boolean? The one
thing I can positively guarantee about archive_dest (or archive_command)
is that we cannot come up with a useful default for it (no, /tmp isn't
good). Therefore it does not seem very reasonable to let the user turn
on archiving without having explicitly specified an archive destination.

I propose that we fold all three GUC flags into a single archive_command
string whose built-in default is an empty string, and you enable
archiving by setting it to something nonempty.

regards, tom lane

#84Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#83)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

[ ... some desultory reading of PITR patch ... ]

What is the point of having both archive_program and archive_dest as
GUC variables? Wouldn't it be simpler to fold them into one parameter,
viz

archive_command = 'cp %s /archivedir'

For that matter, do we need a separate archive_mode boolean? The one
thing I can positively guarantee about archive_dest (or archive_command)
is that we cannot come up with a useful default for it (no, /tmp isn't
good). Therefore it does not seem very reasonable to let the user turn
on archiving without having explicitly specified an archive destination.

I assume archive_dest is used for both archive and recovery of archives.

I propose that we fold all three GUC flags into a single archive_command
string whose built-in default is an empty string, and you enable
archiving by setting it to something nonempty.

I think the idea is that you would turn archiving on and off regularly
while you might never change the archive_command value. Also, how would
you disable it? Set it to "", and if you do, you then have not way to
remember your command string when you want to re-enable it.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#85Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#84)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

What is the point of having both archive_program and archive_dest as
GUC variables?

I assume archive_dest is used for both archive and recovery of archives.

You assume wrong; it's not used there. There isn't any real good
reason to suppose that the recovery process is going to fetch the files
from exactly where archiving put them, anyhow.

I think the idea is that you would turn archiving on and off regularly

Why in the world would you do that? People who want PITR at all will
want it 24x7.

while you might never change the archive_command value. Also, how would
you disable it? Set it to "", and if you do, you then have not way to
remember your command string when you want to re-enable it.

Leave the original value in a comment, if you're going to want it again
later.

I don't think any of the above arguments outweigh the risk of people
shooting themselves in the foot by enabling archive_mode without
specifying a proper command/destination.

regards, tom lane

#86Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#85)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

What is the point of having both archive_program and archive_dest as
GUC variables?

I assume archive_dest is used for both archive and recovery of archives.

You assume wrong; it's not used there. There isn't any real good
reason to suppose that the recovery process is going to fetch the files
from exactly where archiving put them, anyhow.

I think the idea is that you would turn archiving on and off regularly

Why in the world would you do that? People who want PITR at all will
want it 24x7.

while you might never change the archive_command value. Also, how would
you disable it? Set it to "", and if you do, you then have not way to
remember your command string when you want to re-enable it.

Leave the original value in a comment, if you're going to want it again
later.

I don't think any of the above arguments outweigh the risk of people
shooting themselves in the foot by enabling archive_mode without
specifying a proper command/destination.

So you want to merge them all into a single command string. That does
seem less error-prone. I see a few variables that turn off
when set to '' like unix_socket_*. How would this command string work?
How do you specify the WAL file name to transfer?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#87Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#86)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

So you want to merge them all into a single command string. That does
seem less error-prone. I see a few variables that turn off
when set to '' like unix_socket_*. How would this command string work?
How do you specify the WAL file name to transfer?

No different from before, necessarily. However I did not like the
restriction to a single %s in the submitted implementation. What I
have in my local copy is
%p -> full path of XLOG file to be archived
%f -> base name of XLOG file to be archived
and the suggested example becomes
archive_command = 'cp %p /mnt/server/pgarchive/%f'

Note that this example immediately eliminates one of the failure modes
Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
isn't a directory. More generally, though, *only* a cp-to-directory
solution is likely to be very happy with not being able to get at the
base file name. Yes you can make a shellscript and use basename,
but I don't think you should have to do that if it could otherwise
be a one-liner.

(In case it's not obvious from the above, I am hacking with intent to
commit soon. Maybe tomorrow, if my wife doesn't make me paint the
bathroom instead...)

regards, tom lane

#88Mark Kirkwood
markir@coretech.co.nz
In reply to: Noname (#35)
PITR COPY Failure (was Point in Time Recovery)

I decided to produce a nice simple example, so that anyone could
hopefully replicate what I am seeing.

The scenario is the same as before (the 11 steps), but the CREATE TABLE
and COPY step has been reduced to:

CREATE TABLE test0 (filler VARCHAR(120));
COPY test0 FROM '/data0/dump/test0.dat' USING DELIMITERS ',';

Now the file 'test0.dat' consists of (128293) identical lines, each of
109 'a' charactors (plus end of line)

A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA,
and you will need to make your own data file)

The main points of interest are:
- anything <=128392 rows in test0.dat results in 1 archived log, and the
recovery succeeds
- anything >=128393 rows in test0.dat results in 2 or more archived
logs, and recovery fails on the second log (and gives the zero length
redo at 0/1FFFFE0 message).

Let me know if I can do any more legwork on this (I am considering
re-compiling with WAL_DEBUG now that example is simpler)

regards

Mark

Simon Riggs wrote:

Show quoted text

On Thu, 2004-07-15 at 10:47, Mark Kirkwood wrote:

I tried what I thought was a straightforward scenario, and seem to have
broken it :-(

Here is the little tale

1) initdb
2) set archive_mode and archive_dest in postgresql.conf
3) startup
4) create database called 'test'
5) connect to 'test' and type 'checkpoint'
6) backup PGDATA using 'tar -zcvf'
7) create tables in 'test' and add data using COPY (exactly 2 logs worth)
8) shutdown and remove PGDATA
9) recover using 'tar -zxvf'
10) copy recovery.conf into PGDATA
11) startup

This is what I get :

LOG: database system was interrupted at 2004-07-15 21:24:04 NZST
LOG: recovery command file found...
LOG: restore_program = cp %s/%s %s
LOG: recovery_target_inclusive = true
LOG: recovery_debug_log = true
LOG: starting archive recovery
LOG: restored log file "0000000000000000" from archive
LOG: checkpoint record is at 0/A48054
LOG: redo record is at 0/A48054; undo record is at 0/0; shutdown FALSE
LOG: next transaction ID: 496; next OID: 25419
LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A48094
LOG: restored log file "0000000000000001" from archive
LOG: record with zero length at 0/1FFFFE0
LOG: redo done at 0/1FFFF30
LOG: restored log file "0000000000000001" from archive
LOG: restored log file "0000000000000001" from archive
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 13492) was terminated by signal 6
LOG: aborting startup due to startup process failure

The concurrent access is a bit of a puzzle, as this is my home machine
(i.e. I am *sure* noone else is connected!)

I can see what is wrong now, but you'll have to help me on details your
end...

The log shows that xlog 1 was restored from archive. It contains a zero
length record, which indicates that it isn't yet full (or thats what the
existing recovery code assumes it means). Which also indicates that it
should never have been archived in the first place, and should not
therefore be a candidate for a restore from archive.

The double message "restored log file" can only occur after you've
retrieved a partially full file from archive - which as I say, shouldn't
be there.

Other messages are essentially spurious in those circumstances.

Either:
- somehow the files have been mixed up in the archive directory, which
is possible if the filing discipline is not strict - various ways,
unfortunately I would guess this to be the most likely, somehow
- the file that has been restored has been damaged in some way
- the archiver has archived a file too early (very unlikely, IMHO -
thats the most robust bit of the code)
- some aspect of the code has written a zero length record to WAL (which
is supposed to not be possible, but we musn't discount an error in
recent committed work)

- there may also be an effect going on with checkpoints that I don't
understand...spurious checkpoint warning messages have already been
observed and reported,

Best regards, Simon Riggs

#89Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#88)
Re: PITR COPY Failure (was Point in Time Recovery)

Mark Kirkwood <markir@coretech.co.nz> writes:

- anything >=128393 rows in test0.dat results in 2 or more archived
logs, and recovery fails on the second log (and gives the zero length
redo at 0/1FFFFE0 message).

Zero length record is not an error, it's the normal way of detecting
end-of-log.

regards, tom lane

#90Mark Kirkwood
markir@coretech.co.nz
In reply to: Mark Kirkwood (#88)
Re: PITR COPY Failure (was Point in Time Recovery)

There are some silly bugs in the script:

- forgot to export PGDATA and PATH after changing them
- forgot to mention the need to edit test.sql (COPY line needs path to
dump file)

Apologies - I will submit a fixed version a little later

regards

Mark

Mark Kirkwood wrote:

Show quoted text

A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg, PGDATA,
and you will need to make your own data file)

#91Mark Kirkwood
markir@coretech.co.nz
In reply to: Mark Kirkwood (#90)
Re: PITR COPY Failure (was Point in Time Recovery)

fixed.

Mark Kirkwood wrote:

Show quoted text

There are some silly bugs in the script:

- forgot to export PGDATA and PATH after changing them
- forgot to mention the need to edit test.sql (COPY line needs path to
dump file)

Apologies - I will submit a fixed version a little later

regards

Mark

Mark Kirkwood wrote:

A script to run the whole business can be found here :

http://homepages.paradise.net.nz/markir/download/pitr-bug.tar.gz

(It will need a bit of editing for things like location of Pg,
PGDATA, and you will need to make your own data file)

#92Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#87)
Re: [HACKERS] Point in Time Recovery

On Sun, 2004-07-18 at 06:04, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

So you want to merge them all into a single command string. That does
seem less error-prone. I see a few variables that turn off
when set to '' like unix_socket_*. How would this command string work?
How do you specify the WAL file name to transfer?

GUC-wise, I implemented what we agreed in discussions...

There are many things in need of refactoring, so my focus was on
delivering what we agreed, even knowing it would probably change...

A few notes on the patch (as I submitted it - so as not to confuse with
other versions being worked upon)
- archive_dest is definitely used in both archive and recovery. There
wasn't much need for this GUC apart from that and I think we are better
off without it. Removing it improves recovery flexibility (we cannot
assume the recovery is taking place in anything like the original
configuration).

- archive_mode I would prefer to keep - it is explicit then which mode
you are in, rather than implicit from the command string. In all other
ways I agree with everything Tom has said. It allows us to talk about
"being in archive_mode" without people saying "but I can't work out how
to turn archive mode on".

When archiver starts the FIRST thing it does is run a test to confirm
that the command string works, so setting archive_command to '' would
simply generate an error.

Also, I would suggest this:
- changing archive mode requires a postmaster restart
- changing archive command should just be a SIGHUP...we don't want to
force a restart just to switch to a new kind of archiving

If you can only change archive_program at postmaster start that is
restrictive, but making that SIGHUP would allow people to set it to ''
and turn off archiving while postmaster is up == lurking fault.

No different from before, necessarily. However I did not like the
restriction to a single %s in the submitted implementation. What I
have in my local copy is
%p -> full path of XLOG file to be archived
%f -> base name of XLOG file to be archived
and the suggested example becomes
archive_command = 'cp %p /mnt/server/pgarchive/%f'

I'm happy with those changes and would have done them myself given
time... the 2 or 3 %s parameters wasn't the most user friendly way of
doing it.

Note that this example immediately eliminates one of the failure modes
Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
isn't a directory. More generally, though, *only* a cp-to-directory
solution is likely to be very happy with not being able to get at the
base file name. Yes you can make a shellscript and use basename,
but I don't think you should have to do that if it could otherwise
be a one-liner.

Good.

(In case it's not obvious from the above, I am hacking with intent to
commit soon. Maybe tomorrow, if my wife doesn't make me paint the
bathroom instead...)

...just returned from there... :)

Best Regards, Simon Riggs

#93Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#38)
Re: [HACKERS] Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

Latest version, pitr_v5_2.patch...

Reviewed and committed with some adjustments.

I see the following significant loose ends:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

* As Bruce has pointed out already, the process of making a backup
needs some improvements for more safety: the starting and ending WAL
offsets have got to be recorded somehow.

* As I have pointed out already, we need to invent "timelines" to
allow incompatible WAL segments to exist side-by-side. I will volunteer
to look into this.

* I think creating a .ready file during XLogFileOpen is completely bogus,
for reasons mentioned in committed comments (look for XXX). Possibly
this can go away with timelines.

* I am wondering if it wouldn't be a good idea to remove the local copy
of any segment we successfully obtain from archive. The existing
comments note that we might get a wrong or corrupted file from archive,
but aren't we in at least as much risk of using an obsolete segment
restored from backup if we leave the local segment in place? (The
archive recovery run itself will know not to do this, but if we crash
shortly thereafter, the ensuing recovery run would NOT know not to
trust such files.)

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to. Can we redesign either the
backup process or the disk layout so that that will not happen? Then
we could stop worrying about stale local pg_xlog files.

regards, tom lane

#94Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#92)
Re: [HACKERS] Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

When archiver starts the FIRST thing it does is run a test to confirm
that the command string works, so setting archive_command to '' would
simply generate an error.

No, it would do no such thing; the test cannot really tell anything more
than whether system("foo") returns zero ... and at least on my machine,
system("") returns zero. It certainly does not prove that any data went
to anyplace safe.

I diked that test out of the committed patch because I felt it cluttered
the archive area without actually proving anything of interest. We can
revisit the point if you like.

Also, I would suggest this:
- changing archive mode requires a postmaster restart

Why?

- changing archive command should just be a SIGHUP...

Check, as committed [and tested to work...]

regards, tom lane

#95Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#87)
Re: [HACKERS] Point in Time Recovery

What is the process of logging to tape? Ideally we could just do 'dd'
to the tape drive in append mode; however we need a way of signalling
that we want to change tapes.

The only method I can think of is to have PITR dump the files into a
holding directory, and have a daemon that scans the directory and writes
files to tape when they are completely copied (how do we detect that?
Use 'mv' after the copy? Seems like a good use for our new %
parameters). Then we need a control program to signal the daemon to
stop archiving to tape, have it set a flag file so we know it is
suspended tape writes, report that back to the client, change tapes,
then tell it to restart.

I am asking to make sure we don't need a PITR pause mode that prevents
WAL files from being archived but also prevents them from being
recycled. If we did that, we could probably append to tape directly,
but then we need to go into 'pause archive" mode in the PITR process,
and such switching seems like a pain and the wrong place to do it.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

So you want to merge them all into a single command string. That does
seem less error-prone. I see a few variables that turn off
when set to '' like unix_socket_*. How would this command string work?
How do you specify the WAL file name to transfer?

No different from before, necessarily. However I did not like the
restriction to a single %s in the submitted implementation. What I
have in my local copy is
%p -> full path of XLOG file to be archived
%f -> base name of XLOG file to be archived
and the suggested example becomes
archive_command = 'cp %p /mnt/server/pgarchive/%f'

Note that this example immediately eliminates one of the failure modes
Simon enumerates in his README, which is to try 'cp %s /foo' where /foo
isn't a directory. More generally, though, *only* a cp-to-directory
solution is likely to be very happy with not being able to get at the
base file name. Yes you can make a shellscript and use basename,
but I don't think you should have to do that if it could otherwise
be a one-liner.

(In case it's not obvious from the above, I am hacking with intent to
commit soon. Maybe tomorrow, if my wife doesn't make me paint the
bathroom instead...)

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#96Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#93)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Latest version, pitr_v5_2.patch...

Reviewed and committed with some adjustments.

I see the following significant loose ends:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

I figure it should go in share like the other sample files, and tell
people to copy it to /data and modify it for recovery.

* As Bruce has pointed out already, the process of making a backup
needs some improvements for more safety: the starting and ending WAL
offsets have got to be recorded somehow.

Yep, we need those files in the archive location and the /data directory
tarball.

* As I have pointed out already, we need to invent "timelines" to
allow incompatible WAL segments to exist side-by-side. I will volunteer
to look into this.

Great.

* I think creating a .ready file during XLogFileOpen is completely bogus,
for reasons mentioned in committed comments (look for XXX). Possibly
this can go away with timelines.

* I am wondering if it wouldn't be a good idea to remove the local copy
of any segment we successfully obtain from archive. The existing
comments note that we might get a wrong or corrupted file from archive,
but aren't we in at least as much risk of using an obsolete segment
restored from backup if we leave the local segment in place? (The
archive recovery run itself will know not to do this, but if we crash
shortly thereafter, the ensuing recovery run would NOT know not to
trust such files.)

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to. Can we redesign either the
backup process or the disk layout so that that will not happen? Then
we could stop worrying about stale local pg_xlog files.

Seems we should just clear out the /pg_xlog directory before we start
recovery. We are going to rename recovery.conf to recovery.in-progress
or something to prevent us from clearing out the directory after a
crash, right? (I see you rename recovery.conf to recovery.done. Is
that wise? I thought we would disable recovery after a crash, or does
it just keep going? If so, nice.)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#97Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#96)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

I figure it should go in share like the other sample files, and tell
people to copy it to /data and modify it for recovery.

It should certainly go to /share as a .sample file. I was thinking that
initdb should perhaps copy it into $PGDATA (still as .sample, not as
.conf!) so it'd be right there when you need it.

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to.

Seems we should just clear out the /pg_xlog directory before we start
recovery.

No, that's a horrid idea, because it loses the ability to combine
archival xlog files with recent files in /pg_xlog that are not yet
archived. We need to distinguish old files that were accidentally
captured by backup from very-recent files. I think the cleanest way to
do that is for backup not to capture them in the first place.

We are going to rename recovery.conf to recovery.in-progress
or something to prevent us from clearing out the directory after a
crash, right?

I had second thoughts about that and didn't do it in the committed
patch, though it's certainly still open for debate.

(I see you rename recovery.conf to recovery.done. Is
that wise?

Yes. Once you've done with a PITR recovery you definitely do *not* want
a subsequent crash recovery to think it should obey your recovery_target
limit. But if you fail before you've finished the recovery run it
should theoretically be okay to retry, so I didn't add code to rename to
"recovery.inprogress". We can certainly add it later if we decide it's
a good idea.

regards, tom lane

#98Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#95)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

What is the process of logging to tape? Ideally we could just do 'dd'
to the tape drive in append mode; however we need a way of signalling
that we want to change tapes.

The reason we use a user-specifiable shell command for archiving is
so that we do not have to answer the above ;-). It's the user's problem
to write a shell script that does things the way he wants. He can make
it connect to /dev/tty and ask the operator to swap tapes, or whatever.

Personally I am very accustomed to Hewlett-Packard's disk-to-tape backup
program "fbackup", which allows you to provide a shell script to handle
exactly this sort of thing, and it's worked well for me for many years.

I am asking to make sure we don't need a PITR pause mode that prevents
WAL files from being archived but also prevents them from being
recycled.

WAL files will not be recycled until the archiver daemon has set a .done
flag file for them, so I see no problem here. (Note: I took out some
code in Simon's original patch that would start bleating on the basis
of totally unsupportable assumptions about long archival of a log
segment "ought to" take.)

regards, tom lane

#99Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#94)
Re: [HACKERS] Point in Time Recovery

On Mon, 2004-07-19 at 04:13, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

When archiver starts the FIRST thing it does is run a test to confirm
that the command string works, so setting archive_command to '' would
simply generate an error.

No, it would do no such thing; the test cannot really tell anything more
than whether system("foo") returns zero ... and at least on my machine,
system("") returns zero. It certainly does not prove that any data went
to anyplace safe.

I diked that test out of the committed patch because I felt it cluttered
the archive area without actually proving anything of interest. We can
revisit the point if you like.

If the test doesn't guarantee success, then it needs to go....

Thanks for removing it.

Best Regards, Simon Riggs

#100Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#93)
Re: [HACKERS] Point in Time Recovery

On Mon, 2004-07-19 at 04:03, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Latest version, pitr_v5_2.patch...

Reviewed and committed with some adjustments.

Wow! Thanks very much - you work fast.

I'll be re-testing later today.

I see the following significant loose ends:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

Yes...wasn't sure what to do with that. Is everybody happy to install it
as a sample into the main Data Directory? (i.e. as recovery.conf.sample
rather than recovery.conf which would be a bad thing).

* As Bruce has pointed out already, the process of making a backup
needs some improvements for more safety: the starting and ending WAL
offsets have got to be recorded somehow.

Haven't got to that yet, but will do.

* As I have pointed out already, we need to invent "timelines" to
allow incompatible WAL segments to exist side-by-side. I will volunteer
to look into this.

Yes, discussing on the other thread.

* I think creating a .ready file during XLogFileOpen is completely bogus,
for reasons mentioned in committed comments (look for XXX). Possibly
this can go away with timelines.

Yes, to some extent it would go away with timelines.

If you have a local copy at the end of a timeline that isn't archived,
then it seems a good idea to archive it, or at least copy it somewhere
safe. If you don't then you will not be able to revert to a full
recovery of that timeline in the future should you choose to do so.

The code and its location may be somewhat more suspect.... :)

* I am wondering if it wouldn't be a good idea to remove the local copy
of any segment we successfully obtain from archive. The existing
comments note that we might get a wrong or corrupted file from archive,
but aren't we in at least as much risk of using an obsolete segment
restored from backup if we leave the local segment in place? (The
archive recovery run itself will know not to do this, but if we crash
shortly thereafter, the ensuing recovery run would NOT know not to
trust such files.)

I agree they're a loose end that needs some thought.

I avoided that decision by going around the files. We originally agreed
that we would keep that data....reason was you can't tell whether the
files have been restored by a backup that forgot to exclude pg_xlog, or
that we are choosing to do a PITR recovery on an otherwise healthy
system (or as the comments explain maybe we lost everything except
pg_xlog).

If we crash during recovery it doesn't crash recover and restart.

If we crash after recovery, then the checkpoint record will have moved
forward and we so we don't then accidentally re-use those local copies.

Timelines will solve this...

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to. Can we redesign either the
backup process or the disk layout so that that will not happen? Then
we could stop worrying about stale local pg_xlog files.

Thats the way I saw it.

Seems fairly easy to say "don't backup pg_xlog", but you can't guarantee
they won't, even if you tell them not to...

What is stale today maybe considered to be actually your best option
when testing to see whether a recovery has achieved your objectives.

I'll read the who patch, your comments and test before I respond
further. Thanks for working so hard on this, so quickly.

Best Regards, Simon Riggs

#101Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#97)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

* Documentation is, um, lacking. (One point in particular is that I
inserted the recovery.conf.sample file into CVS, but did not fill in
the patch's lack of attempt to install it anywhere.)

I figure it should go in share like the other sample files, and tell
people to copy it to /data and modify it for recovery.

It should certainly go to /share as a .sample file. I was thinking that
initdb should perhaps copy it into $PGDATA (still as .sample, not as
.conf!) so it'd be right there when you need it.

I think /share is best. I see other *.share file that aren't used until
you rename them and move them to the right directory, and
recovery.conf.sample seems the same. I think having the sample at the
top of data when for most people it will be unused is strange.

Perhaps the last point is really a backup-process issue. AFAICS there
is no good reason for a backup tarfile to include $PGDATA/pg_xlog at
all, and some good reasons for it not to.

Seems we should just clear out the /pg_xlog directory before we start
recovery.

No, that's a horrid idea, because it loses the ability to combine
archival xlog files with recent files in /pg_xlog that are not yet
archived. We need to distinguish old files that were accidentally
captured by backup from very-recent files. I think the cleanest way to
do that is for backup not to capture them in the first place.

I am confused. Aren't we always doing a restore from a backup? Are you
saying there are cases where we aren't and need the stuff in pg_xlog?
Are you saying we might have some new WAL files that we want to add to
pg_xlog before we do the restore, like the most recent WAL that wasn't
archived because it wasn't finished? Why would we be doing a recover if
we had such files? I see your point that we wouldn't know which file
to use, the archive version or the pg_xlog version, but actually
wouldn't the archive version always be preferred because we would know
it to be complete.

I don't see any reliable way to prevent people from having pg_xlog in
their backups seeing they might use snapshots, tar, etc.

We are going to rename recovery.conf to recovery.in-progress
or something to prevent us from clearing out the directory after a
crash, right?

I had second thoughts about that and didn't do it in the committed
patch, though it's certainly still open for debate.

How are we handling a crash during recovery?

(I see you rename recovery.conf to recovery.done. Is
that wise?

Yes. Once you've done with a PITR recovery you definitely do *not* want
a subsequent crash recovery to think it should obey your recovery_target
limit. But if you fail before you've finished the recovery run it
should theoretically be okay to retry, so I didn't add code to rename to
"recovery.inprogress". We can certainly add it later if we decide it's
a good idea.

Ah, OK, so it just keeps going. However, we don't know if what is in
pg_xlog was in the process of being copied from the archive at the time
of the crash, no? In fact I am wondering if we should be transfering
the archive files into temporary names than doing an 'mv' to make them
current so we don't get partial files in pg_xlog. However, we can't do
that because we are using a user-supplied command line. Should we pass
a fake name to the command string then do the 'mv' ourselves. With WAL
now, we do an fsync so we know the contents are crash-proof, but I am
not sure how to do that during recovery. I guess this gets back to how
to handle the contents of pg_xlog during recovery.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#102Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#101)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

It should certainly go to /share as a .sample file. I was thinking that
initdb should perhaps copy it into $PGDATA (still as .sample, not as
.conf!) so it'd be right there when you need it.

I think /share is best.

Okay, we agree on that part at least; I'll take care of it. If anyone
wants to argue for further copying during initdb, that can be added
later.

I am confused. Aren't we always doing a restore from a backup?

No. This code serves two purposes: recovery from archived WAL and
point-in-time recovery. You might want to do a PITR run at a time
where not all your WAL segments have been pushed to archive. Indeed
the latest one can never be so pushed, since it's unfinished. Suppose
you are trying to do PITR recovery to a time just a few minutes ago
that is still in the latest WAL segment --- there is simply not any
legal way to have that come from the archive.

So we can't simply zero out pg_xlog at the start of a PITR run, even
if there weren't a don't-destroy-data argument against it.

I had second thoughts about that and didn't do it in the committed
patch, though it's certainly still open for debate.

How are we handling a crash during recovery?

Retry, perhaps. It doesn't seem any different from crash-during-recovery
in the non-archived scenario ...

Ah, OK, so it just keeps going. However, we don't know if what is in
pg_xlog was in the process of being copied from the archive at the time
of the crash, no?

Nonissue. It goes into RECOVERYXLOG and we never assume that that's
initially good. See RestoreArchivedXLog().

regards, tom lane

#103Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#102)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

It should certainly go to /share as a .sample file. I was thinking that
initdb should perhaps copy it into $PGDATA (still as .sample, not as
.conf!) so it'd be right there when you need it.

I think /share is best.

Okay, we agree on that part at least; I'll take care of it. If anyone
wants to argue for further copying during initdb, that can be added
later.

I am confused. Aren't we always doing a restore from a backup?

No. This code serves two purposes: recovery from archived WAL and
point-in-time recovery. You might want to do a PITR run at a time
where not all your WAL segments have been pushed to archive. Indeed
the latest one can never be so pushed, since it's unfinished. Suppose
you are trying to do PITR recovery to a time just a few minutes ago
that is still in the latest WAL segment --- there is simply not any
legal way to have that come from the archive.

So we can't simply zero out pg_xlog at the start of a PITR run, even
if there weren't a don't-destroy-data argument against it.

If we had some code that checks pg_xlog on recovery startup, it could
rename each pg_xlog file and then recover the file from the archive. If
it doesn't exist or is truncated, discard it. If it is the right size,
we need to check to see which one has a WAL eof-of-segment marker (we
have on of those, right?). This would seem to catch all the cases:

o file brought back by tar, but complete file in archive
o archive in process of writing during crash
o partially full file in pg_xlog

What it doesn't cover are cases where tar gets a partial copy of a
pg_xlog file but the file never made it to archive yet, and a new
pg_xlog file was created and we get some of that file too. In fact, the
backup could get holes in the pg_xlog file where the backup has zeros
but the real file had data added to it after the zeros:

in tar XXXXX 00000 XXXXX

real XXXXX XXXXX XXXXX

This could happen when file has this:

XXXXX 00000 00000

backup reads this:

XXXXX 00000

database writes this:

XXXXX XXXXX XXXXX

backup reads the remainder of the file:

XXXXX 00000 XXXXX

In this case the end-of-segment marker doesn't even help us, and their
might not be an archive copy of this because it didn't happen yet.

I think I see a solution. We are going to create a file during backup so
we know the wal offsets and xids. If we see that file, we know either
we have a restore of a backup or they currently running a backup. If we
tell them not to restore while a backup is running (seems pretty
obvious) we can then delete pg_xlog when the backup wal offset file
exists. In other cases, we know the WAL files are valid to use.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#104Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#98)
Re: [HACKERS] Point in Time Recovery

On Mon, 2004-07-19 at 05:54, Tom Lane wrote:

code in Simon's original patch that would start bleating

Code that bleats? LOL :) (is that a new log level?)

Some of it was perhaps a little woolly....

You've made my day, Simon Riggs (still laughing)

#105Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#103)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

we need to check to see which one has a WAL eof-of-segment marker (we
have on of those, right?).

No, we don't.

I think I see a solution. We are going to create a file during backup so
we know the wal offsets and xids. If we see that file, we know either
we have a restore of a backup or they currently running a backup.

... or the last backup attempt failed, but they forgot to remove the
file it left. Or we are doing crash recovery after the system lost
power while a backup was running. Or half a dozen other obvious scenarios.

If we tell them not to restore while a backup is running (seems pretty
obvious) we can then delete pg_xlog when the backup wal offset file
exists. In other cases, we know the WAL files are valid to use.

We're not deleting pg_xlog, period. IMHO it's too dangerous even to
have such a function in the code.

My original suggestion was to *replace* individual xlog files with data
extracted from archive, and only after determining that the archive
indeed has a copy of that particular file (and we can fetch it).
This at least has a fighting chance of not losing information. Wiping
pg_xlog in toto on the basis of a guess about the system status is just
a form of russian roulette. Sooner or later you will wipe some xlog
files that you can't get back from archive.

regards, tom lane

#106Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#105)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

we need to check to see which one has a WAL eof-of-segment marker (we
have on of those, right?).

No, we don't.

I think I see a solution. We are going to create a file during backup so
we know the wal offsets and xids. If we see that file, we know either
we have a restore of a backup or they currently running a backup.

... or the last backup attempt failed, but they forgot to remove the
file it left. Or we are doing crash recovery after the system lost
power while a backup was running. Or half a dozen other obvious scenarios.

If we tell them not to restore while a backup is running (seems pretty
obvious) we can then delete pg_xlog when the backup wal offset file
exists. In other cases, we know the WAL files are valid to use.

We're not deleting pg_xlog, period. IMHO it's too dangerous even to
have such a function in the code.

My original suggestion was to *replace* individual xlog files with data
extracted from archive, and only after determining that the archive
indeed has a copy of that particular file (and we can fetch it).
This at least has a fighting chance of not losing information. Wiping
pg_xlog in toto on the basis of a guess about the system status is just
a form of russian roulette. Sooner or later you will wipe some xlog
files that you can't get back from archive.

OK, if you don't want to place restrictions on recovery, fine, but how
do you handle the situation where you backup but the WAL file has holes
in the tar backup but you don't have an archive file to use because it
didn't make it to the archive before the drive died? Can we detect
holes in the WAL file recovered from backup? We might, but I am afraid
we might not.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#107Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#102)
Re: [HACKERS] Point in Time Recovery

On Mon, 2004-07-19 at 17:56, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Tom Lane wrote:

I had second thoughts about that and didn't do it in the committed
patch, though it's certainly still open for debate.

How are we handling a crash during recovery?

Retry, perhaps. It doesn't seem any different from crash-during-recovery
in the non-archived scenario ...

Well, a recovery is just re-applying already written logs at super
speed. We don't need to write WAL because we already wrote it once (and
that would really confuse the timeline issue).

I think if this was an issue, the solution would be to speed up recovery
since that would benefit us more than putting recovery-squared code in.

Just start over...

Best Regards, Simon Riggs

#108Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#80)
Re: Point in Time Recovery

On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:

OK, I think I have some solid ideas and reasons for them.

Sorry for taking so long to reply...

First, I think we need server-side functions to call when we start/stop
the backup. The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error. It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures. We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Yes, I think we should go down this route. ....there's a "but" and that
is we don't absolutely need it for correctness....and so I must decline
adding it to THIS release. I don't imagine I'll stop be associated with
this code for a while yet....

Can we recommend that users should expect to have to call a start and
end backup routine in later releases? Don't expect you'll agree to
that..

Second, I think once we start a restore, we should rename recovery.conf
to recovery.in_progress, and when complete rename that to
recovery.done. If the postmaster starts and sees recovery.in_progress,
it will fail to start knowing its recovery was interrupted. This allows
the admin to take appropriate action. (I am not sure what that action
would be. Does he bring back the backup files or just keep going?)

Superceded by Tom's actions. Two states are required: start and stop.
Recovery isn't going to be checkpoint-restartable anytime soon, IMHO.

Best regards, Simon Riggs

#109Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#108)
Re: Point in Time Recovery

Bruce and I had another phone chat about the problems that can ensue
if you restore a tar backup that contains old (incompletely filled)
versions of WAL segment files. While the current code will ignore them
during the recovery-from-archive run, leaving them laying around seems
awfully dangerous. One nasty possibility is that the archiving
mechanism will pick up these files and overwrite good copies in the
archive area with the obsolete ones from the backup :-(.

Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of
a recovery-from-archive run, but as I said I'm scared to death of code
that does such a thing automatically. In particular this would make it
impossible to handle scenarios where you want to do a PITR recovery but
you need to use some recent WAL segments that didn't make it into your
archive yet. (Maybe you could get around this by forcibly transferring
such segments into the archive, but that seems like a bad idea for
incomplete segments.)

It would really be best for the DBA to make sure that the starting
condition for the recovery run does not have any obsolete segment files
in pg_xlog. He could do this either by setting up his backup policy so
that pg_xlog isn't included in the tar backup in the first place, or by
manually removing the included files just after restoring the backup,
before he tries to start the recovery run.

Of course the objection to that is "what if the DBA forgets to do it?"

The idea that we came to on the phone was for the postmaster, when it
enters recovery mode because a recovery.conf file exists, to look in
pg_xlog for existing segment files and refuse to start if any are there
--- *unless* the user has put a special, non-default overriding flag
into recovery.conf.  Call it "use_unarchived_files" or something like
that.  We'd have to provide good documentation and an extensive HINT of
course, but basically the DBA would have two choices when he gets this
refusal to start:

1. Remove all the segment files in pg_xlog. (This would be the right
thing to do if he knows they all came off the backup.)

2. Verify that pg_xlog contains only segment files that are newer than
what's stored in the WAL archive, and then set the override flag in
recovery.conf. In this case the DBA is taking responsibility for
leaving only segment files that are good to use.

One interesting point is that with such a policy, we could use locally
available WAL segments in preference to pulling the same segments from
archive, which would be at least marginally more efficient, and seems
logically cleaner anyway.

In particular it seems that this would be a useful arrangement in cases
where you have questionable WAL segments --- you're not sure if they're
good or not. Rather than having to push questionable data into your WAL
archive, you can leave it local, try a recovery run, and see if you like
the resulting state. If not, it's a lot easier to do-over when you have
not corrupted your archive area.

Comments? Better ideas?

regards, tom lane

#110Mark Kirkwood
markir@coretech.co.nz
In reply to: Mark Kirkwood (#88)
Re: PITR COPY Failure (was Point in Time Recovery)

I have been doing some re-testing with CVS HEAD from about 1 hour ago
using the simplified example posted previously.

It is quite interesting:

i) create the table as:

CREATE TABLE test0 (filler TEXT);

and COPY 100 000 rows on length 109, then recovery succeeds.

ii) create the table as:

CREATE TABLE test0 (filler VARCHAR(120));

and COPY as above, then recovery *fails* with the the signal 6 error below.

LOG: database system was not properly shut down; automatic recovery in
progress
LOG: redo starts at 0/A4807C
LOG: record with zero length at 0/FFFFE0
LOG: redo done at 0/FFFF30
LOG: restored log file "0000000000000000" from archive
LOG: archive recovery complete
PANIC: concurrent transaction log activity while database system is
shutting down
LOG: startup process (PID 17546) was terminated by signal 6
LOG: aborting startup due to startup process failure

(I am pretty sure both TEXT and VARCHAR(120) failed using the original
patch)

Any suggestions for the best way to dig a bit deeper?

regards

Mark

#111Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#102)
Re: [HACKERS] Point in Time Recovery

Okay, we agree on that part at least; I'll take care of it. If anyone
wants to argue for further copying during initdb, that can be added
later.

I reckon it should be copied into $PGDATA :) Otherwise, when I'm in a
panic at recovery time, I'd have to figure out where the heck my package
has installed the share conf file to, conf files usually aren't in
share, etc., etc.

Chris

#112Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#109)
Re: Point in Time Recovery

I've got a PITR set up here that's happily scp'ing WAL files across to
another machine. However, the NIC in the machine is currently stuffed,
so it gets like 50k/s :) What happens in general if you are generating
WAL file bytes faster always than they can be copied off?

Also, does the archive dir just basically keep filling up forever? How
do I know when I can prune some files? Anything older than the last
full backup?

Chris

#113Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#112)
Re: Point in Time Recovery

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

I've got a PITR set up here that's happily scp'ing WAL files across to
another machine. However, the NIC in the machine is currently stuffed,
so it gets like 50k/s :) What happens in general if you are generating
WAL file bytes faster always than they can be copied off?

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

Also, does the archive dir just basically keep filling up forever? How
do I know when I can prune some files? Anything older than the last
full backup?

Anything older than the starting checkpoint of the last full backup that
you might want to restore to. We need to adjust the backup procedure so
that the starting segment number for a backup is more readily visible;
see recent discussions about logging that explicitly in some fashion.

regards, tom lane

#114Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#113)
Re: Point in Time Recovery

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

Hang on, are you supposed to MOVE or COPY away WAL segments?

Chris

#115Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Christopher Kings-Lynne (#114)
Re: Point in Time Recovery

Christopher Kings-Lynne wrote:

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

Hang on, are you supposed to MOVE or COPY away WAL segments?

Copy. pg will delete them once they are archived.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#116Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#114)
Re: Point in Time Recovery

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

If you keep falling further and further behind, eventually your pg_xlog
directory will fill the space available on its disk, and I think at that
point PG will panic and shut down because it can't create any more xlog
segments.

Hang on, are you supposed to MOVE or COPY away WAL segments?

COPY. The checkpoint code will then delete or recycle the segment file,
as appropriate.

regards, tom lane

#117Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#116)
Re: Point in Time Recovery

Hang on, are you supposed to MOVE or COPY away WAL segments?

COPY. The checkpoint code will then delete or recycle the segment file,
as appropriate.

So what happens if you just move it? Postgres breaks?

Chris

#118Tom Lane
tgl@sss.pgh.pa.us
In reply to: Christopher Kings-Lynne (#117)
Re: Point in Time Recovery

Christopher Kings-Lynne <chriskl@familyhealth.com.au> writes:

Hang on, are you supposed to MOVE or COPY away WAL segments?

COPY. The checkpoint code will then delete or recycle the segment file,
as appropriate.

So what happens if you just move it? Postgres breaks?

I don't think so, but it seems like a much less robust way to do things.
What happens if you have a failure partway through? For instance
archive machine dies and loses recent data right after you've rm'd the
source file. The recommended COPY procedure at least provides some
breathing room between when you install the data on the archive and when
the original file is removed.

It's not like you save any effort by using a MOVE anyway. You're not
going to have the archive on the same machine as the database (or if you
are, you ain't gonna be *my* DBA ...)

regards, tom lane

#119Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Tom Lane (#118)
Re: Point in Time Recovery

I don't think so, but it seems like a much less robust way to do things.
What happens if you have a failure partway through? For instance
archive machine dies and loses recent data right after you've rm'd the
source file. The recommended COPY procedure at least provides some
breathing room between when you install the data on the archive and when
the original file is removed.

Well, I tried it in 'cross your fingers' mode and it works, at least:

archive_command = 'rm %p'

:)

Chris

#120Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#110)
Re: PITR COPY Failure (was Point in Time Recovery)

Mark Kirkwood <markir@coretech.co.nz> writes:

I have been doing some re-testing with CVS HEAD from about 1 hour ago
using the simplified example posted previously.

It is quite interesting:

The problem seems to be that the computation of checkPoint.redo at
xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
allowing for the possibility that XLogInsert will decide it doesn't
want to split the checkpoint record across XLOG files, and will then
insert a WASTED_SPACE record to avoid that (see comment and following
code at lines 758-795). This wouldn't really matter except that there
is a safety crosscheck at line 4268 that tries to detect unexpected
insertions of other records during a shutdown checkpoint.

I think the code in CreateCheckPoint was correct when it was written,
because we only recently changed XLogInsert to not split records
across files. But it's got a boundary-case bug now, which your test
scenario is able to exercise by making the recovery run try to write
a shutdown checkpoint exactly at the end of a WAL file segment.

The quick and dirty solution would be to dike out the safety check at
4268ff. I don't much care for that, but am too tired right now to work
out a better answer. I'm not real sure whether it's better to adjust
the computation of checkPoint.redo or to smarten the safety check
... but one or the other needs to allow for file-end padding, or maybe
we could hack some update of the state in WasteXLInsertBuffer(). (But
at some point you have to say "this is more trouble than it's worth",
so maybe we'll end up taking out the safety check.)

In any case this isn't a fundamental bug, just an insufficiently
smart safety check. But thanks for finding it! As is, the code has
a nonzero probability of failure in the field :-( and I don't know
how we'd have tracked it down without a reproducible test case.

regards, tom lane

#121Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Christopher Kings-Lynne (#119)
Re: Point in Time Recovery

Hang on, are you supposed to MOVE or COPY away WAL segments?

Copy. pg will delete them once they are archived.

Copy. pg will recycle them once they are archived.

Andreas

#122Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#120)
Re: PITR COPY Failure (was Point in Time Recovery)

On Tue, 2004-07-20 at 05:14, Tom Lane wrote:

Mark Kirkwood <markir@coretech.co.nz> writes:

I have been doing some re-testing with CVS HEAD from about 1 hour ago
using the simplified example posted previously.

It is quite interesting:

The problem seems to be that the computation of checkPoint.redo at
xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
allowing for the possibility that XLogInsert will decide it doesn't
want to split the checkpoint record across XLOG files, and will then
insert a WASTED_SPACE record to avoid that (see comment and following
code at lines 758-795). This wouldn't really matter except that there
is a safety crosscheck at line 4268 that tries to detect unexpected
insertions of other records during a shutdown checkpoint.

I think the code in CreateCheckPoint was correct when it was written,
because we only recently changed XLogInsert to not split records
across files. But it's got a boundary-case bug now, which your test
scenario is able to exercise by making the recovery run try to write
a shutdown checkpoint exactly at the end of a WAL file segment.

Thanks for locating that, I was suspicious of that piece of code, but it
would have taken me longer than this to locate it exactly.

It was clear (to me) that it had to be of this nature, since I've done a
fair amount of recovery testing and not hit anything like that.

The quick and dirty solution would be to dike out the safety check at
4268ff. I don't much care for that, but am too tired right now to work
out a better answer. I'm not real sure whether it's better to adjust
the computation of checkPoint.redo or to smarten the safety check
... but one or the other needs to allow for file-end padding, or maybe
we could hack some update of the state in WasteXLInsertBuffer(). (But
at some point you have to say "this is more trouble than it's worth",
so maybe we'll end up taking out the safety check.)

I'll take a look

In any case this isn't a fundamental bug, just an insufficiently
smart safety check. But thanks for finding it! As is, the code has
a nonzero probability of failure in the field :-( and I don't know
how we'd have tracked it down without a reproducible test case.

All code has a non-zero probability of failure in the field, its just
they don't tell you that usually. The main thing here is that we write
everything we need to write to the logs in the first place.

If that is true, then the code can always be adjusted or the logs dumped
and re-spliced to recover data.

Definitely: Thanks Mark! Reproducibility is key.

Best regards, Simon Riggs

#123Mark Kirkwood
markir@coretech.co.nz
In reply to: Tom Lane (#120)
Re: PITR COPY Failure (was Point in Time Recovery)

Great that it's not fundamental - and hopefully with this discovery, the
probability you mentioned is being squashed towards zero a bit more :-)

Don't let this early bug detract from what is really a superb piece of work!

regards

Mark

Tom Lane wrote:

Show quoted text

In any case this isn't a fundamental bug, just an insufficiently
smart safety check. But thanks for finding it! As is, the code has
a nonzero probability of failure in the field :-( and I don't know
how we'd have tracked it down without a reproducible test case.

#124Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#120)
Re: PITR COPY Failure (was Point in Time Recovery)

On Tue, 2004-07-20 at 05:14, Tom Lane wrote:

Mark Kirkwood <markir@coretech.co.nz> writes:

I have been doing some re-testing with CVS HEAD from about 1 hour ago
using the simplified example posted previously.

It is quite interesting:

The problem seems to be that the computation of checkPoint.redo at
xlog.c lines 4162-4169 (all line numbers are per CVS tip) is not
allowing for the possibility that XLogInsert will decide it doesn't
want to split the checkpoint record across XLOG files, and will then
insert a WASTED_SPACE record to avoid that (see comment and following
code at lines 758-795). This wouldn't really matter except that there
is a safety crosscheck at line 4268 that tries to detect unexpected
insertions of other records during a shutdown checkpoint.

I think the code in CreateCheckPoint was correct when it was written,
because we only recently changed XLogInsert to not split records
across files. But it's got a boundary-case bug now, which your test
scenario is able to exercise by making the recovery run try to write
a shutdown checkpoint exactly at the end of a WAL file segment.

The quick and dirty solution would be to dike out the safety check at
4268ff.

Well, taking out the safety check isn't the answer.

The check produces the last error message "concurrent transaction...",
but it isn't the cause of the mismatch in the first place.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

I'm not real sure whether it's better to adjust
the computation of checkPoint.redo or to smarten the safety check
... but one or the other needs to allow for file-end padding, or maybe
we could hack some update of the state in WasteXLInsertBuffer(). (But
at some point you have to say "this is more trouble than it's worth",
so maybe we'll end up taking out the safety check.)

...I'm looking at other options now.

Best Regards, Simon Riggs

#125Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#124)
Re: PITR COPY Failure (was Point in Time Recovery)

Simon Riggs <simon@2ndquadrant.com> writes:

The quick and dirty solution would be to dike out the safety check at
4268ff.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized. That was added here:

2004-02-11 17:55 tgl

* src/: backend/access/transam/xact.c,
backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
bin/pg_controldata/pg_controldata.c,
bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
include/access/xlog.h, include/access/xlogutils.h,
include/pg_config_manual.h, include/catalog/pg_control.h,
include/storage/smgr.h: Commit the reasonably uncontroversial parts
of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
segment file so that it can be reliably identified. Avoid
splitting WAL records across segment files (this is not strictly
necessary, but makes it simpler to incorporate the header records).
Make WAL entries for file creation, deletion, and truncation (as
foreseen but never implemented by Vadim). Also, add support for
making XLOG_SEG_SIZE configurable at compile time, similarly to
BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
smgr API changes. initdb is forced due to changes in pg_control
contents.

There are other ways to do this, for example we could treat the WAL page
headers as variable-size, and stick the file labeling info into the
first page's header instead of making it be a separate record. The
separate-record way makes it easier to incorporate future additions to
the file labeling info, but I don't really think it's critical to allow
for that.

regards, tom lane

#126Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#125)
Re: PITR COPY Failure (was Point in Time Recovery)

On Tue, 2004-07-20 at 13:51, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The quick and dirty solution would be to dike out the safety check at
4268ff.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized. That was added here:

2004-02-11 17:55 tgl

* src/: backend/access/transam/xact.c,
backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
bin/pg_controldata/pg_controldata.c,
bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
include/access/xlog.h, include/access/xlogutils.h,
include/pg_config_manual.h, include/catalog/pg_control.h,
include/storage/smgr.h: Commit the reasonably uncontroversial parts
of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
segment file so that it can be reliably identified. Avoid
splitting WAL records across segment files (this is not strictly
necessary, but makes it simpler to incorporate the header records).
Make WAL entries for file creation, deletion, and truncation (as
foreseen but never implemented by Vadim). Also, add support for
making XLOG_SEG_SIZE configurable at compile time, similarly to
BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
smgr API changes. initdb is forced due to changes in pg_control
contents.

There are other ways to do this, for example we could treat the WAL page
headers as variable-size, and stick the file labeling info into the
first page's header instead of making it be a separate record. The
separate-record way makes it easier to incorporate future additions to
the file labeling info, but I don't really think it's critical to allow
for that.

I think I've fixed it now...but wait 20

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

I'll update again in a few minutes....no point us both looking at this.

Best regards, Simon Riggs

#127Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#126)
Re: PITR COPY Failure (was Point in Time Recovery)

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2004-07-20 at 13:51, Tom Lane wrote:

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized.

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
adding two more warts to the code to support it is sticking in my craw.
I'm thinking it would be cleaner to treat the extra labeling information
as an extension of the WAL page header.

regards, tom lane

#128Simon Riggs
simon@2ndquadrant.com
In reply to: Simon Riggs (#126)
1 attachment(s)
Re: PITR COPY Failure (was Point in Time Recovery)

On Tue, 2004-07-20 at 14:11, Simon Riggs wrote:

On Tue, 2004-07-20 at 13:51, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

The quick and dirty solution would be to dike out the safety check at
4268ff.

If you take out that check, we still fail because the wasted space at
the end is causing a "record with zero length" error.

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized. That was added here:

2004-02-11 17:55 tgl

* src/: backend/access/transam/xact.c,
backend/access/transam/xlog.c, backend/access/transam/xlogutils.c,
backend/storage/smgr/md.c, backend/storage/smgr/smgr.c,
bin/pg_controldata/pg_controldata.c,
bin/pg_resetxlog/pg_resetxlog.c, include/access/xact.h,
include/access/xlog.h, include/access/xlogutils.h,
include/pg_config_manual.h, include/catalog/pg_control.h,
include/storage/smgr.h: Commit the reasonably uncontroversial parts
of J.R. Nield's PITR patch, to wit: Add a header record to each WAL
segment file so that it can be reliably identified. Avoid
splitting WAL records across segment files (this is not strictly
necessary, but makes it simpler to incorporate the header records).
Make WAL entries for file creation, deletion, and truncation (as
foreseen but never implemented by Vadim). Also, add support for
making XLOG_SEG_SIZE configurable at compile time, similarly to
BLCKSZ. Fix a couple bugs I introduced in WAL replay during recent
smgr API changes. initdb is forced due to changes in pg_control
contents.

There are other ways to do this, for example we could treat the WAL page
headers as variable-size, and stick the file labeling info into the
first page's header instead of making it be a separate record. The
separate-record way makes it easier to incorporate future additions to
the file labeling info, but I don't really think it's critical to allow
for that.

I think I've fixed it now...but wait 20

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

I'll update again in a few minutes....no point us both looking at this.

This was a very confusing test...Here's what I think happened:

Mark discovered a numerological coincidence that meant that the
XLOG_WASTED_SPACE record was zero length at the end of EACH file he was
writing to, as long as there was just that one writer. So no matter how
many records were inserted, each xlog file had a zero length
XLOG_WASTED_SPACE record at its end.

ReadRecord failed on seeing a zero length record, i.e. when it got to
the FIRST of the XLOG_WASTED_SPACE records. Thats why the test fails no
matter how many records you give it, as long as it was more than enough
to write into a second xlog segment file.

By telling ReadRecord that XLOG_WASTED_SPACE records of zero length are
in fact *OK*, it continues happily. (Thats just a partial fix, see
later)

The test works, but gives what looks like strange results: the test
blows away the data directory completely, so the then-current xlog dies
too. That contained the commit for the large COPY, so even though the
recovery now works, the table has zero rows in it. (When things die
you're still likely to lose *some* data).

Anyway, so then we put the "concurrent transaction" test back in and the
test passes because we have now set the pointers correctly.

After all that, I think the wasted space idea is still sensible. You
musn't have a continuation record across files, otherwise we'll end up
with half a commit one-day, which would break ACID.

I'm happy that we have the explicit test in XLogInsert for zero-length
records. Somebody will one-day write a resource manager with zero length
records when they didn't mean to and we need to catch that at write
time, not at recovery time like Mark has done. The WasteXLInsertBuffer
was the only part of the code that *can* write a zero-length record, so
we will *not* see another recurrence of this situation --at recovery
time--.

Though further concerns along this theme are:
- what happens when the space at the end of a file is so small we can't
even write a zero-length XLOG_WASTED_SPACE record? Hopefully, you're
gonna say "damn your eyes...couldnt you see that, its already there".
- if the space at the end of a file was just zeros, then the "concurrent
transaction test" would still fail....we probably need to enhance this
to treat a few zeros at end of file AS IF it was an XLOG_WASTED_SPACE
record an continue. (That scenario would happen if we were doing a
recovery that included a local un-archived xlog that was very close to
being full - probably more likely to occur in crash recovery than
archive recovery)

The included patch doesn't attempt to address those issues, yet.

Best regards, Simon Riggs

Attachments:

zerolength.patchtext/x-patch; charset=; name=zerolength.patchDownload
Index: backend/access/transam/xlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql-server/src/backend/access/transam/xlog.c,v
retrieving revision 1.149
diff -d -c -r1.149 xlog.c
*** backend/access/transam/xlog.c	19 Jul 2004 14:34:39 -0000	1.149
--- backend/access/transam/xlog.c	20 Jul 2004 14:03:12 -0000
***************
*** 2370,2376 ****
  	 * Currently, xl_len == 0 must be bad data, but that might not be true
  	 * forever.  See note in XLogInsert.
  	 */
! 	if (record->xl_len == 0)
  	{
  		ereport(emode,
  				(errmsg("record with zero length at %X/%X",
--- 2370,2380 ----
  	 * Currently, xl_len == 0 must be bad data, but that might not be true
  	 * forever.  See note in XLogInsert.
  	 */
! 	if (record->xl_len == 0 && record->xl_info != XLOG_WASTED_SPACE)
  	{
  		ereport(emode,
  				(errmsg("record with zero length at %X/%X",
***************
*** 2489,2495 ****
  	}
  
  	/* Record does not cross a page boundary */
! 	if (!RecordIsValid(record, *RecPtr, emode))
  		goto next_record_is_invalid;
  	if (BLCKSZ - SizeOfXLogRecord >= RecPtr->xrecoff % BLCKSZ +
  		MAXALIGN(total_len))
--- 2493,2499 ----
  	}
  
  	/* Record does not cross a page boundary */
! 	if (!RecordIsValid(record, *RecPtr, emode) && record->xl_info != XLOG_WASTED_SPACE)
  		goto next_record_is_invalid;
  	if (BLCKSZ - SizeOfXLogRecord >= RecPtr->xrecoff % BLCKSZ +
  		MAXALIGN(total_len))
#129Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#127)
Re: PITR COPY Failure (was Point in Time Recovery)

On Tue, 2004-07-20 at 15:00, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2004-07-20 at 13:51, Tom Lane wrote:

Ugh. I'm beginning to think we ought to revert the patch that added the
don't-split-across-files logic to XLogInsert; that seems to have broken
more assumptions than I realized.

The problem was that a zero length XLOG_WASTED_SPACE record just fell
out of ReadRecord when it shouldn't have. By giving it a helping hand it
makes it through with pointers correctly set, and everything else was
already thought of in the earlier patch, so xlog_redo etc happens.

Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
adding two more warts to the code to support it is sticking in my craw.
I'm thinking it would be cleaner to treat the extra labeling information
as an extension of the WAL page header.

Sounds like a better solution than scrabbling around at the end of file
with too many edge cases to test properly

...over to you then...

Best Regards, Simon Riggs

#130Noname
markw@osdl.org
In reply to: Tom Lane (#93)
Re: [HACKERS] Point in Time Recovery

On 18 Jul, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

Latest version, pitr_v5_2.patch...

Reviewed and committed with some adjustments.

I pull from CVS and and got the following message when I tried starting
the database with the archive_mode parameter:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Mark

In reply to: Noname (#130)
Re: [HACKERS] Point in Time Recovery

On Tue, 20 Jul 2004 markw@osdl.org wrote:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Yes, Tom has removed this option in favorite of just setting
archive_command to a value which then enables the PITR code also.

But as I've seen this isn't discussed to the very end currently.

My 2ct: I'd prefer to have archive_mode in the config as it really makes
clear that this database is archiving. I fear users will not understand
that giving a program for archival will also enable the PITR function.

Greetings, Klaus

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)

#132Simon Riggs
simon@2ndquadrant.com
In reply to: Klaus Naumann (#131)
Re: [HACKERS] Point in Time Recovery

On Tue, 2004-07-20 at 17:29, Klaus Naumann wrote:

On Tue, 20 Jul 2004 markw@osdl.org wrote:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Yes, Tom has removed this option in favorite of just setting
archive_command to a value which then enables the PITR code also.

But as I've seen this isn't discussed to the very end currently.

My 2ct: I'd prefer to have archive_mode in the config as it really makes
clear that this database is archiving. I fear users will not understand
that giving a program for archival will also enable the PITR function.

I do also think that option should go back in, just to be explicit.

A more important omission is the deletion of a message to indicate that
the server is acting in archive_mode....so there's no visual clue in the
log to warn an admin that its been turned off now or incorrectly
specified (by somebody else, of course). (At least using the default log
mode).

Best Regards, Simon Riggs

#133Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#108)
Re: Point in Time Recovery

Simon Riggs wrote:

On Sat, 2004-07-17 at 00:57, Bruce Momjian wrote:

OK, I think I have some solid ideas and reasons for them.

Sorry for taking so long to reply...

First, I think we need server-side functions to call when we start/stop
the backup. The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error. It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures. We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Yes, I think we should go down this route. ....there's a "but" and that
is we don't absolutely need it for correctness....and so I must decline
adding it to THIS release. I don't imagine I'll stop be associated with
this code for a while yet....

Can we recommend that users should expect to have to call a start and
end backup routine in later releases? Don't expect you'll agree to
that..

I guess my big question is that if we don't do this for 7.5 how will
people doing restores know if the xid they specify is valid for the
backup they used. If we recover to most recent time, is there any check
that will tell them their backup is invalid because there are no archive
records that span the time of their backup?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#134Mark Kirkwood
markir@coretech.co.nz
In reply to: Klaus Naumann (#131)
Re: [HACKERS] Point in Time Recovery

I'd vote for it as a clarity factor too.

Klaus Naumann wrote:

Show quoted text

On Tue, 20 Jul 2004 markw@osdl.org wrote:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Yes, Tom has removed this option in favorite of just setting
archive_command to a value which then enables the PITR code also.

But as I've seen this isn't discussed to the very end currently.

My 2ct: I'd prefer to have archive_mode in the config as it really makes
clear that this database is archiving. I fear users will not understand
that giving a program for archival will also enable the PITR function.

Greetings, Klaus

#135Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Mark Kirkwood (#134)
Re: [HACKERS] Point in Time Recovery

I'm in favour of how it is now, so long as the comment is clear. It's
the Unix Way :)

Chris

Show quoted text

I'd vote for it as a clarity factor too.

Klaus Naumann wrote:

On Tue, 20 Jul 2004 markw@osdl.org wrote:

FATAL: unrecognized configuration parameter "archive_mode"

Have I missed something since it has been committed?

Yes, Tom has removed this option in favorite of just setting
archive_command to a value which then enables the PITR code also.

But as I've seen this isn't discussed to the very end currently.

My 2ct: I'd prefer to have archive_mode in the config as it really makes
clear that this database is archiving. I fear users will not understand
that giving a program for archival will also enable the PITR function.

Greetings, Klaus

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#136Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#128)
Re: PITR COPY Failure (was Point in Time Recovery)

FYI - I can confirm that the patch fixes main issue.

Simon Riggs wrote:

Show quoted text

This was a very confusing test...Here's what I think happened:
.....
The included patch doesn't attempt to address those issues, yet.

Best regards, Simon Riggs

#137Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#128)
Re: PITR COPY Failure (was Point in Time Recovery)

This is presumably a standard feature of any PITR design - if the
failure event destroys the current transaction log, then you can only
recover transactions that committed in the last *archived* log.

regards

Mark

Simon Riggs wrote:

Show quoted text

The test works, but gives what looks like strange results: the test
blows away the data directory completely, so the then-current xlog dies
too. That contained the commit for the large COPY, so even though the
recovery now works, the table has zero rows in it. (When things die
you're still likely to lose *some* data).

#138Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#132)
Re: [HACKERS] Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

A more important omission is the deletion of a message to indicate that
the server is acting in archive_mode....so there's no visual clue in the
log to warn an admin that its been turned off now or incorrectly
specified (by somebody else, of course). (At least using the default log
mode).

Hmm, we are apparently not reading the same code. My copy shows

LOG: starting archive recovery
LOG: restore_command = "cp /home/postgres/testversion/archive/%f %p"
... blah blah ...
LOG: archive recovery complete

Which part of this is insufficiently clear?

regards, tom lane

In reply to: Tom Lane (#138)
Re: [HACKERS] Point in Time Recovery

On Wed, 21 Jul 2004, Tom Lane wrote:

Hi Tom,

Simon doesn't mean the recovery part. Instead he means the "normal"
startup of the server. It has to be absolutely clear (in the logfile!) if
the server was started in archive mode or not. Otherwise you always have
to guess.
On server startup there should to be a message like

LOG: Database started in archive mode

or

LOG: Archive mode is DISABLED

To get the users attention.

Greetings, Klaus

Simon Riggs <simon@2ndquadrant.com> writes:

A more important omission is the deletion of a message to indicate that
the server is acting in archive_mode....so there's no visual clue in the
log to warn an admin that its been turned off now or incorrectly
specified (by somebody else, of course). (At least using the default log
mode).

Hmm, we are apparently not reading the same code. My copy shows

LOG: starting archive recovery
LOG: restore_command = "cp /home/postgres/testversion/archive/%f %p"
... blah blah ...
LOG: archive recovery complete

Which part of this is insufficiently clear?

regards, tom lane

--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)

#140Tom Lane
tgl@sss.pgh.pa.us
In reply to: Klaus Naumann (#139)
Re: [HACKERS] Point in Time Recovery

Klaus Naumann <kn@mgnet.de> writes:

Simon doesn't mean the recovery part. Instead he means the "normal"
startup of the server. It has to be absolutely clear (in the logfile!) if
the server was started in archive mode or not. Otherwise you always have
to guess.

Why would you guess? "SHOW archive_command" will tell you, without
question, at any time. I don't see the point of placing such a message
in the postmaster log --- in normal circumstances the postmaster will
still be running long after its starting messages have been discarded
due to log rotation.

Also, the current implementation allows you to stop and start archiving
on-the-fly, so a start-time message would be an unreliable guide to what
the postmaster is actually doing at the moment.

regards, tom lane

#141Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#140)
Re: [HACKERS] Point in Time Recovery

On Wed, 2004-07-21 at 15:53, Tom Lane wrote:

Klaus Naumann <kn@mgnet.de> writes:

Simon doesn't mean the recovery part. Instead he means the "normal"
startup of the server. It has to be absolutely clear (in the logfile!) if
the server was started in archive mode or not. Otherwise you always have
to guess.

Why would you guess? "SHOW archive_command" will tell you, without
question, at any time. I don't see the point of placing such a message
in the postmaster log --- in normal circumstances the postmaster will
still be running long after its starting messages have been discarded
due to log rotation.

Also, the current implementation allows you to stop and start archiving
on-the-fly, so a start-time message would be an unreliable guide to what
the postmaster is actually doing at the moment.

Overall, this is a small point and I think we should leave Tom alone, to
focus on the bigger issues that we care about.

Tom has done an amazingly good job in the last few days of refactoring
some reasonably ugly code on my part, all without a murmur. I relent on
this to allow everything to be finished in time.

The PITR journey has just begun, so there will be further opportunity to
discuss and agree what constitutes real issues and then correct them.
This may not be on that list later.

Best Regards, Simon Riggs

#142Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#129)
Re: PITR COPY Failure (was Point in Time Recovery)

Simon Riggs <simon@2ndquadrant.com> writes:

On Tue, 2004-07-20 at 15:00, Tom Lane wrote:

Yeah, but the WASTED_SPACE/FILE_HEADER stuff is already pretty ugly, and
adding two more warts to the code to support it is sticking in my craw.
I'm thinking it would be cleaner to treat the extra labeling information
as an extension of the WAL page header.

Sounds like a better solution than scrabbling around at the end of file
with too many edge cases to test properly

This is done in CVS tip. Mark, could you retest to verify it's fixed?

regards, tom lane

#143Mark Kirkwood
markir@coretech.co.nz
In reply to: Tom Lane (#142)
Re: PITR COPY Failure (was Point in Time Recovery)

Looks good to me. Log file numbering scheme seems to have changed - is
that part of the fix too?.

Tom Lane wrote:

Show quoted text

This is done in CVS tip. Mark, could you retest to verify it's fixed?

regards, tom lane

#144Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#143)
Re: PITR COPY Failure (was Point in Time Recovery)

Mark Kirkwood <markir@coretech.co.nz> writes:

Looks good to me. Log file numbering scheme seems to have changed - is
that part of the fix too?.

That's for timelines ... it's not directly related but I thought I
should put in both changes at once to avoid forcing an extra initdb.

regards, tom lane

#145Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#32)
Re: [HACKERS] Point in Time Recovery

Here is one for the 'idiot proof' category:

1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup

Obviously this does not work as the backup is performed with the
database shutdown.

This got me wondering for 2 reasons:

1) Some alternative database servers *require* a procedure like this to
enable their version of PITR - so the potential foot-gun thing is there.

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Simon Riggs wrote:

Show quoted text

I was hoping some fiendish plans would be presented to me...

But please start with "this feels like typical usage" and we'll go from
there...the important thing is to try the first one.

I've not done power off tests, yet. They need to be done just to
check...actually you don't need to do this to test PITR...

We need to exhaustive tests of...
- power off
- scp and cross network copies
- all the permuted recovery options
- archive_mode = off (i.e. current behaviour)
- deliberately incorrectly set options (idiot-proof testing)

#146Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#145)
Re: [HACKERS] Point in Time Recovery

Mark Kirkwood <markir@coretech.co.nz> writes:

Here is one for the 'idiot proof' category:
1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup

Obviously this does not work as the backup is performed with the
database shutdown.

Huh? It works fine.

The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

I'm becoming more and more convinced that we should bite the bullet and
move pg_xlog/ to someplace that is not under $PGDATA. It would just
make things a whole lot more reliable, both for backup and to deal with
scenarios like yours above. I tried to talk Bruce into this on the
phone the other day, but he wouldn't bite. I still think it's a good
idea though. It would
(1) eliminate the problem that a tar backup of $PGDATA would restore
stale copies of xlog segments, because the tar wouldn't include
pg_xlog in the first place.
(2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
xlog segments that you still need.

A possible compromise is that we should strongly suggest that pg_xlog
be pushed out to another place and symlinked if you are going to use
WAL archiving. That's already considered good practice for performance
if you have a separate disk spindle to put WAL on. It'll just have
to be good practive for WAL archiving too.

regards, tom lane

#147Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#146)
Re: [HACKERS] Point in Time Recovery

I think we should push the partially complete WAL file to the archive
location before shutdown. I talked to you or Jan about it and you (or
Jan) wouldn't bite either, but I think when someone shuts down, they
assume they have things fully archived and can recover fully with a
previous backup and the archive files.

When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK. Maybe we would
need to give it a special file extension so we only use it when we don't
have a full version.

---------------------------------------------------------------------------

Tom Lane wrote:

Mark Kirkwood <markir@coretech.co.nz> writes:

Here is one for the 'idiot proof' category:
1) initdb and set archive_command
2) shutdown
3) do a backup
4) startup and run some transactions
5) shutdown and remove PGDATA
6) restore backup
7) startup

Obviously this does not work as the backup is performed with the
database shutdown.

Huh? It works fine.

The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

I'm becoming more and more convinced that we should bite the bullet and
move pg_xlog/ to someplace that is not under $PGDATA. It would just
make things a whole lot more reliable, both for backup and to deal with
scenarios like yours above. I tried to talk Bruce into this on the
phone the other day, but he wouldn't bite. I still think it's a good
idea though. It would
(1) eliminate the problem that a tar backup of $PGDATA would restore
stale copies of xlog segments, because the tar wouldn't include
pg_xlog in the first place.
(2) eliminate the problem that a naive "rm -rf $PGDATA" would blow away
xlog segments that you still need.

A possible compromise is that we should strongly suggest that pg_xlog
be pushed out to another place and symlinked if you are going to use
WAL archiving. That's already considered good practice for performance
if you have a separate disk spindle to put WAL on. It'll just have
to be good practive for WAL archiving too.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#148Mark Kirkwood
markir@coretech.co.nz
In reply to: Tom Lane (#146)
Re: [HACKERS] Point in Time Recovery

Well that is interesting :_)

Here is what I am doing on the removal front (I am keeping pg_xlog *now*):

$ cd $PGDATA
$ pg_ctl stop
$ ls|grep -v pg_xlog|xargs rm -rf

The contents of the archive directory just before recovery starts:

$ ls -l $PGDATA/../7.5-archive
total 49212
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000000
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000001
-rw------- 1 postgres postgres 16777216 Jul 22 14:59
000000010000000000000002

But here is recovery startup log:

LOG: database system was shut down at 2004-07-22 14:58:57 NZST
LOG: starting archive recovery
LOG: restore_command = "cp /data1/pgdata/7.5-archive/%f %p"
cp: cannot stat `/data1/pgdata/7.5-archive/00000001.history': No such
file or directory
LOG: restored log file "000000010000000000000000" from archive
LOG: checkpoint record is at 0/A4D3E8
LOG: redo record is at 0/A4D3E8; undo record is at 0/0; shutdown TRUE
LOG: next transaction ID: 496; next OID: 17229
LOG: archive recovery complete
LOG: database system is ready

regards

Mark

Tom Lane wrote:

Show quoted text

Huh? It works fine.

The bit you may be missing is that if you blow away $PGDATA including
pg_xlog/, you won't be able to recover past whatever you have in your WAL
archive area. The archive is certainly not going to include the current
partially-filled WAL segment, and it might be missing a few earlier
segments if the archival process isn't speedy. So you need to keep
those recent segments in pg_xlog/ if you want to recover to current time
or near-current time.

#149Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#147)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

#150Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#149)
Re: [HACKERS] Point in Time Recovery

Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

In fact, if you shutdown your database and want to reproduce it on
another machine, how do you do it? Seems you have to copy pg_xlog
directory over to the new machine.

In fact, moving pg_xlog to a new location doesn't make that clear
either. Seems documentation might be the only way to make this clear.

One idea would be to just push the partial WAL file to the archive on
server shutdown and not reuse it and start with a new WAL file on
startup. At least for a normal system shutdown this will give us an
archive that contains all the information that is in pg_xlog.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#151Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#150)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

That's only because you are clinging to the broken assumption that
pg_xlog/ is part of the database, rather than part of the logs.
Separate that out as a distinct entity, and all gets better.

regards, tom lane

#152Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#151)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Agreed, it might not be possible, but your report does point out a
limitation in our implementation --- that a shutdown database contains
more information than a backup and the archive logs. That is not
intuitive.

That's only because you are clinging to the broken assumption that
pg_xlog/ is part of the database, rather than part of the logs.
Separate that out as a distinct entity, and all gets better.

Imagine this. I stop the server. I have a tar backup and a copy of
the archive. I should be able to take them to another machine and
recover the system to the point I stopped.

You are saying I need a copy of pg_xlog directory too, and I need to
remove pg_xlog after I untar the data directory and put the saved
pg_xlog into there before I recover.

Should we create a server-side function that forces all WAL files to the
archive, including partially written ones. Maybe that fixes the problem
with people deleting pg_xlog before they untar. You tell them to run
the function before recovery. If the system can't be started, the it is
possible the WAL files are no good too, not sure.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#153Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#149)
Re: [HACKERS] Point in Time Recovery

On Thu, 2004-07-22 at 04:29, Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

Not read whole chain of conversation...but this idea came up before and
was rejected then. I agree with the 3 objections to that thought above.

There's already enough copies of full xlogs around to worry about.

If you need more granularity, reduce size of xlog files....

(Tom, SUID would be the correct timeline id in that situation? )

More later, Simon Riggs

#154Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mark Kirkwood (#145)
Re: [HACKERS] Point in Time Recovery

Mark Kirkwood <markir@coretech.co.nz> writes:

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

regards, tom lane

#155Mark Kirkwood
markir@coretech.co.nz
In reply to: Tom Lane (#154)
Re: [HACKERS] Point in Time Recovery

Excellent - Just updated and it is all good!

This change makes the whole "how do I do my backup" business nice and
basic - which the right way IMHO.

regards

Mark

Tom Lane wrote:

Show quoted text

Mark Kirkwood <markir@coretech.co.nz> writes:

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

regards, tom lane

#156Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#154)
Re: [HACKERS] Point in Time Recovery

On Thu, 2004-07-22 at 21:19, Tom Lane wrote:

Mark Kirkwood <markir@coretech.co.nz> writes:

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.

There is a code path on secondary checkpoints that indicates that crash
recovery can occur even when the database was shutdown, since the code
forces recovery whether it was or not. On that basis, this may work, but
is yet untested. I didn't mention this because it might interfere with
getting hot backup to work...

Best Regards, Simon Riggs

#157Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#156)
Re: [HACKERS] Point in Time Recovery

I have tested the "cold" backup - and retested my previous scenarios
using "hot" backup (just to be sure) . They all work AFAICS!

cheers

Mark

Simon Riggs wrote:

Show quoted text

On Thu, 2004-07-22 at 21:19, Tom Lane wrote:

Mark Kirkwood <markir@coretech.co.nz> writes:

2) Is is possible to make the recovery kick in even though pg_control
says the database state is shutdown?

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.

There is a code path on secondary checkpoints that indicates that crash
recovery can occur even when the database was shutdown, since the code
forces recovery whether it was or not. On that basis, this may work, but
is yet untested. I didn't mention this because it might interfere with
getting hot backup to work...

Best Regards, Simon Riggs

#158Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#156)
Re: [HACKERS] Point in Time Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

On Thu, 2004-07-22 at 21:19, Tom Lane wrote:

Yeah, I think you are right: presence of recovery.conf should force a
WAL scan even if pg_control claims it's shut down. Fix committed.

This *should* be possible but I haven't tested it.

I did.

It's really not risky. The fact that the code doesn't look beyond the
checkpoint record when things seem to be kosher is just a speed
optimization (and probably a rather pointless one...) We have got to be
able to detect the end of WAL in any case, so we'd just find there are
no more records and stop.

regards, tom lane

#159Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#157)
Re: [HACKERS] Point in Time Recovery

On Fri, 2004-07-23 at 01:05, Mark Kirkwood wrote:

I have tested the "cold" backup - and retested my previous scenarios
using "hot" backup (just to be sure) . They all work AFAICS!

cheers

Yes, I'll drink to that! Thanks for your help.

Best Regards, Simon Riggs

#160Noname
markw@osdl.org
In reply to: Simon Riggs (#159)
Re: [PATCHES] Point in Time Recovery

On 26 Jul, To: simon@2ndquadrant.com wrote:

Sorry I wasn't clearer. I think I have a better idea about what's going
on now. With the archiving enabled, it looks like the database is able
to complete 1 transaction per database connection, but doesn't complete
any subsequent transactions. I'm not sure how to see what's going on.
Perhaps I should try a newer snapshot from CVS while I'm at it?

I tried to do an strace on the postmaster (and child processes) to see
if that might show something, but when the postmaster starts the
database isn't accepting any connections. I have the feeling it's not
really starting up. Trying to shut it down seems to agrees with that.
My wild guess is that the database is sitting waiting for something when
a stored function is called but I'm not sure how to verify that.

Mark

#161Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#80)
Re: Point in Time Recovery

We need someone to code two backend functions to complete PITR. The
function would be called at start/stop of backup of the data directory.
The functions would be checked during restore to make sure the requested
xid is not between the start/stop xids of the backup. They would also
contain timestamps so the admin can easily review the archive directory.

The start needs to call checkpoint and create file in the data directory
that contains a few server parameters. At backup stop the function
needs to move the file to pg_xlog and set the *.ready archive flag so it
is archived.

As for checking during recover, the file needs to be retrieved and
checked to see the xid recovery is valid. Tom and I can help you with
that detail.

DOn't worry about all the details of the email below. It is just a
general summary. We can give you details once you volunteer.

---------------------------------------------------------------------------

Bruce Momjian wrote:

OK, I think I have some solid ideas and reasons for them.

First, I think we need server-side functions to call when we start/stop
the backup. The advantage of these server-side functions is that they
will do the required work of recording the pg_control values and
creating needed files with little chance for user error. It also allows
us to change the internal operations in later releases without requiring
admins to change their procedures. We are even able to adjust the
internal operation in minor releases without forcing a new procedure on
users.

Second, I think once we start a restore, we should rename recovery.conf
to recovery.in_progress, and when complete rename that to
recovery.done. If the postmaster starts and sees recovery.in_progress,
it will fail to start knowing its recovery was interrupted. This allows
the admin to take appropriate action. (I am not sure what that action
would be. Does he bring back the backup files or just keep going?)

Third, I think we need to put a file in the archive location once we
complete a backup, recording the start/stop xid and wal/offsets. This
gives the admin documentation on what archive logs to keep and what xids
are available for recovery. Ideally the recover program would read that
file and check the recover xid to make sure it is after the stop xid
recorded in the file.

How would the recover program know the name of that file? We need to
create it in /data with start contents before the backup, then complete
it with end contents and archive it.

What should we name it? Ideally it would be named by the WAL
name/offset of the start so it orders in the proper spot in the archive
file listing, e.g.:

000000000000093a
000000000000093b
000000000000093b.032b9.start
000000000000093c

Are people going to know they need 000000000000093b for
000000000000093b.032b9.start? I hope so. Another idea is to do:

000000000000093a.xlog
000000000000093b.032b9.start
000000000000093b.xlog
000000000000093c.xlog

This would order properly. It might be a very good idea to add
extensions to these log files now that we are archiving them in strange
places. In fact, maybe we should use *.pg_xlog to document the
directory they came from.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Fri, 2004-07-16 at 16:47, Tom Lane wrote:

As far as the business about copying pg_control first goes: there is
another way to think about it, which is to copy pg_control to another
place that will be included in your backup. For example the standard
backup procedure could be

1. [somewhat optional] Issue CHECKPOINT and wait till it finishes.

2. cp $PGDATA/global/pg_control $PGDATA/pg_control.dump

3. tar cf /dev/mt $PGDATA

4. do something to record ending WAL position

If we standardized on this way, then the tar archive would automatically
contain the pre-backup checkpoint position in ./pg_control.dump, and
there is no need for any special assumptions about the order in which
tar processes things.

Sounds good. That would be familiar to Oracle DBAs doing BACKUP
CONTROLFILE. We can document that and off it as a suggested procedure.

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control. I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

I wouldn't be in favour of a manual mechanism. If you want an automated
mechanism, whats wrong with using the one thats already there? You can
use pg_controldata to read the controlfile, again whats wrong with that?

We agreed some time back that an off-line xlog file inspector would be
required to allow us to inspect the logs and make a decision about where
to end recovery. You'd still need that.

It's scary enough having to specify the end point, let alone having to
specify the starting point as well.

At your request, and with Bruce's idea, I designed and built the
recovery system so that you don't need to know what range of xlogs to
bring back. You just run it, it brings back the right files from archive
and does recovery with them, then cleans up - and it works without
running out of disk space on long recoveries.

I've built it now and it works...

This is pretty much the same chain of reasoning that Bruce and I went
through yesterday to come up with the idea of putting a label file
inside the tar backups. We concluded that it'd be worth putting
both the backup starting time and the checkpoint WAL position into
the label file --- the starting time isn't needed for restore but
might be really helpful as documentation, if you needed to verify
which dump file was which.

...if you are doing tar backups...what will you do if you're not using
that mechanism?

If you are: It's common practice to make up a backup filename from
elements such as systemname, databasename, date and time etc. That gives
you the start time, the file last mod date gives you the end time.

I think its perfectly fine for everybody to do backups any way they
please. There are many licenced variants of PostgreSQL and it might be
appropriate in those to specify particular ways of doing things.

I'll be trusting the management of backup metadata and storage media to
a solution designed for the purpose (open or closed source), just as
I'll be trusting my data to a database solution designed for that
purpose. That for me is one of the good things about PostgreSQL - we use
the filesystem, we don't write our own, we provide language interfaces
not invent our own proprietary server language etc..

Best Regards, Simon Riggs

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 359-1001
+  If your life is a hard drive,     |  13 Roberts Road
+  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faqs/FAQ.html

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#162Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#109)
Re: Point in Time Recovery

Oh, here is something else we need to add --- a GUC to control whether
pg_xlog is clean on recovery start.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce and I had another phone chat about the problems that can ensue
if you restore a tar backup that contains old (incompletely filled)
versions of WAL segment files. While the current code will ignore them
during the recovery-from-archive run, leaving them laying around seems
awfully dangerous. One nasty possibility is that the archiving
mechanism will pick up these files and overwrite good copies in the
archive area with the obsolete ones from the backup :-(.

Bruce earlier proposed that we simply "rm pg_xlog/*" at the start of
a recovery-from-archive run, but as I said I'm scared to death of code
that does such a thing automatically. In particular this would make it
impossible to handle scenarios where you want to do a PITR recovery but
you need to use some recent WAL segments that didn't make it into your
archive yet. (Maybe you could get around this by forcibly transferring
such segments into the archive, but that seems like a bad idea for
incomplete segments.)

It would really be best for the DBA to make sure that the starting
condition for the recovery run does not have any obsolete segment files
in pg_xlog. He could do this either by setting up his backup policy so
that pg_xlog isn't included in the tar backup in the first place, or by
manually removing the included files just after restoring the backup,
before he tries to start the recovery run.

Of course the objection to that is "what if the DBA forgets to do it?"

The idea that we came to on the phone was for the postmaster, when it
enters recovery mode because a recovery.conf file exists, to look in
pg_xlog for existing segment files and refuse to start if any are there
--- *unless* the user has put a special, non-default overriding flag
into recovery.conf.  Call it "use_unarchived_files" or something like
that.  We'd have to provide good documentation and an extensive HINT of
course, but basically the DBA would have two choices when he gets this
refusal to start:

1. Remove all the segment files in pg_xlog. (This would be the right
thing to do if he knows they all came off the backup.)

2. Verify that pg_xlog contains only segment files that are newer than
what's stored in the WAL archive, and then set the override flag in
recovery.conf. In this case the DBA is taking responsibility for
leaving only segment files that are good to use.

One interesting point is that with such a policy, we could use locally
available WAL segments in preference to pulling the same segments from
archive, which would be at least marginally more efficient, and seems
logically cleaner anyway.

In particular it seems that this would be a useful arrangement in cases
where you have questionable WAL segments --- you're not sure if they're
good or not. Rather than having to push questionable data into your WAL
archive, you can leave it local, try a recovery run, and see if you like
the resulting state. If not, it's a lot easier to do-over when you have
not corrupted your archive area.

Comments? Better ideas?

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#163Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#141)
Re: [HACKERS] Point in Time Recovery

I do think we need a boolean for start/stop of archiving, rather than
setting it to '' to turn it off. Tom, I think the group agreed to this
on clarity grounds. I would like the server to throw an error if you
try to turn on archiving and the command is set to ''.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Wed, 2004-07-21 at 15:53, Tom Lane wrote:

Klaus Naumann <kn@mgnet.de> writes:

Simon doesn't mean the recovery part. Instead he means the "normal"
startup of the server. It has to be absolutely clear (in the logfile!) if
the server was started in archive mode or not. Otherwise you always have
to guess.

Why would you guess? "SHOW archive_command" will tell you, without
question, at any time. I don't see the point of placing such a message
in the postmaster log --- in normal circumstances the postmaster will
still be running long after its starting messages have been discarded
due to log rotation.

Also, the current implementation allows you to stop and start archiving
on-the-fly, so a start-time message would be an unreliable guide to what
the postmaster is actually doing at the moment.

Overall, this is a small point and I think we should leave Tom alone, to
focus on the bigger issues that we care about.

Tom has done an amazingly good job in the last few days of refactoring
some reasonably ugly code on my part, all without a murmur. I relent on
this to allow everything to be finished in time.

The PITR journey has just begun, so there will be further opportunity to
discuss and agree what constitutes real issues and then correct them.
This may not be on that list later.

Best Regards, Simon Riggs

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#164Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#149)
Re: [HACKERS] Point in Time Recovery

Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to archive the most recent WAL file every
minute to some other location or something.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#165Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#149)
Re: [ADMIN] Point in Time Recovery

[ Sorry, sent to hackers now.]

Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a
tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to copy the most recent WAL file every
minute to some other location or something.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I think we should push the partially complete WAL file to the archive
location before shutdown. ...
When you are running and finally fill up the WAL file it would then
overwrite the one in the archive but I think that is OK.

I don't think this can fly at all. Here are some off-the-top-of-the-head
objections:

1. We don't have the luxury of spending indefinite amounts of time to
do a database shutdown. Commonly we are under a twenty-second sentence
of death from init. I don't want to spend the 20 seconds waiting to see
if the archiver will manage to push 16MB onto a slow tape drive. Also,
if the archiver does fail to push the data in time, it'll likely leave a
broken (partial) xlog file in the archive, which would be really bad
news if the user then relies on that.

2. What if the archiver process entirely fails to push the file? (Maybe
there's not enough disk space, for instance.) In normal operation we'll
just retry every so often. We definitely can't do that during shutdown.

3. You're blithely assuming that the archival process can easily provide
overwrite semantics for multiple pushes of the same xlog filename. Stop
thinking about "cp to some directory" and start thinking "dump to tape"
or "burn onto CD" or something like that. We'll be raising the ante
considerably if we require the archive_command to deal with this.

I think the last one is really the most significant issue. We have to
keep the archiver API as simple as possible.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#166Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#163)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I do think we need a boolean for start/stop of archiving, rather than
setting it to '' to turn it off. Tom, I think the group agreed to this
on clarity grounds.

I didn't see any consensus there, nor do I see a point to it.

regards, tom lane

#167Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#166)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I do think we need a boolean for start/stop of archiving, rather than
setting it to '' to turn it off. Tom, I think the group agreed to this
on clarity grounds.

I didn't see any consensus there, nor do I see a point to it.

I saw a lot of people saying it was a good idea, and only you saying it
was a bad idea.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#168Mark Kirkwood
markir@coretech.co.nz
In reply to: Bruce Momjian (#161)
Re: Point in Time Recovery

I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted

P.s : would love to be that volunteer - however up to the eyeballs in
Business Objects (cringe) and Db2 for the next week or so....

regards

Mark

Bruce Momjian wrote:

Show quoted text

We need someone to code two backend functions to complete PITR.

<snippage>

However, once you decide to do things like that, there is no reason why
the copied file has to be an exact image of pg_control. I claim it
would be more useful if the copied file were plain text so that you
could just "cat" it to find out the starting WAL position; that would
let you determine without any special tools what range of WAL archive
files you are going to need to bring back from your archives.

#169Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Mark Kirkwood (#168)
Re: Point in Time Recovery

Mark Kirkwood wrote:

I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#170Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Bruce Momjian (#163)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian wrote:

I do think we need a boolean for start/stop of archiving, rather than
setting it to '' to turn it off. Tom, I think the group agreed to this
on clarity grounds. I would like the server to throw an error if you
try to turn on archiving and the command is set to ''.

Let me illustrate. To turn off archiving you have to change:

#archive_command = ''
archive_command = 'cp %p /mnt/server/archivedir/%f'

to

archive_command = ''
#archive_command = 'cp %p /mnt/server/archivedir/%f'

and if you comment both or neither, you have problems.

With a boolean it would be:

archive_mode = on
archive_command = 'cp %p /mnt/server/archivedir/%f'

archive_mode = off
archive_command = 'cp %p /mnt/server/archivedir/%f'

Now, if you say people will rarely turn archiving on/off, then one
parameter seems to make more sense.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#171Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#170)
Re: [HACKERS] Point in Time Recovery

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Now, if you say people will rarely turn archiving on/off, then one
parameter seems to make more sense.

I really can't envision a situation where people would do that. If you
need PITR at all then you need it 24x7.

regards, tom lane

#172Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#171)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Now, if you say people will rarely turn archiving on/off, then one
parameter seems to make more sense.

I really can't envision a situation where people would do that. If you
need PITR at all then you need it 24x7.

OK, then we are OK. If we find that isn't true, we can reevaluate.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#173Noname
markir@coretech.co.nz
In reply to: Bruce Momjian (#169)
Re: Point in Time Recovery

Quoting Bruce Momjian <pgman@candle.pha.pa.us>:

Mark Kirkwood wrote:

I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

Ease of reading is a good thing, no argument there.

However using 'pg_controldata' (or similar) to perform the read is not really
that much harder than using 'cat' - (it is a wee bit harder, I grant you)

When I posted the original mail I was thinking that the pg_control image is good
because it has much more information than just the last wal offset, and could
be used to perform a recovery in the advent of the "actual" pg_control being
unsuitable (e.g. backed up last instead of first on a busy system).

Of couse this thinking didn't make it into the original mail, sorry about that!

regards

Mark

#174Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Noname (#173)
Re: Point in Time Recovery

I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

If you use a readable file you will also need a feature for restore (or a tool)
to create an appropriate pg_control file, or are you intending to still require
that pg_control be the first file backed up.
Another possibility would be that the start function writes the readable file and
also copies pg_control.

Andreas

#175Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Zeugswetter Andreas SB SD (#174)
Re: Point in Time Recovery

Zeugswetter Andreas SB SD wrote:

I was wondering about this point - might it not be just as reasonable
for the copied file to *be* an exact image of pg_control? Then a very
simple variant of pg_controldata (or maybe even just adding switches to
pg_controldata itself) would enable the relevant info to be extracted

We didn't do that so admins could easily read the file contents.

If you use a readable file you will also need a feature for restore (or a tool)
to create an appropriate pg_control file, or are you intending to still require
that pg_control be the first file backed up.
Another possibility would be that the start function writes the readable file and
also copies pg_control.

We will back up pg_control in the tar file but it doesn't have to have
all correct information. The WAL replay will set it properly I think.
In fact it has to start recovery checkpoint settings, not the backup
setting at all.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#176Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#174)
Re: Point in Time Recovery

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

If you use a readable file you will also need a feature for restore
(or a tool) to create an appropriate pg_control file, or are you
intending to still require that pg_control be the first file backed
up.

No, the entire point of this exercise is to get rid of that assumption.
You do need *a* copy of pg_control, but the only reason you'd need to
back it up first rather than later is so that its checkpoint pointer
points to the last checkpoint before the dump starts. Which is the
information we want to put in the archive-label file insted.

If a copy of pg_control were sufficient then I'd be all for using it as
the archive-label file, but it's *not* sufficient because you also need
the ending WAL offset. So we need a different file layout in any case,
and we may as well take some pity on the poor DBA and make the file
easily human-readable.

regards, tom lane

#177Mark Kirkwood
markir@coretech.co.nz
In reply to: Tom Lane (#176)
Re: Point in Time Recovery

Ok - that is a much better way of doing it!

regards

Mark

Tom Lane wrote:

Show quoted text

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

If you use a readable file you will also need a feature for restore
(or a tool) to create an appropriate pg_control file, or are you
intending to still require that pg_control be the first file backed
up.

No, the entire point of this exercise is to get rid of that assumption.
You do need *a* copy of pg_control, but the only reason you'd need to
back it up first rather than later is so that its checkpoint pointer
points to the last checkpoint before the dump starts. Which is the
information we want to put in the archive-label file insted.

If a copy of pg_control were sufficient then I'd be all for using it as
the archive-label file, but it's *not* sufficient because you also need
the ending WAL offset. So we need a different file layout in any case,
and we may as well take some pity on the poor DBA and make the file
easily human-readable.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#178Simon@2ndquadrant.com
simon@2ndquadrant.com
In reply to: Bruce Momjian (#172)
Re: [HACKERS] Point in Time Recovery

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Now, if you say people will rarely turn archiving on/off, then one
parameter seems to make more sense.

I really can't envision a situation where people would do that. If you
need PITR at all then you need it 24x7.

I agree. The second parameter is only there to clarify the intent.

8.0 does introduce two good reasons to turn it on/off, however:
- index build speedups
- COPY speedups

I would opt to make enabling/disabling archive_command require a postmaster
restart. That way there would be no capability to take advantage of the
incentive to turn it on/off.

For TODO:

It would be my intention (in 8.1) to make those available via switches e.g.
NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).

I would also aim to make the first Insert Select into an empty table not
logged (optionally). This is an important optimization for Oracle, teradata
and DB2 (which uses NOT LOGGED INITIALLY).

Best Regards, Simon Riggs

#179Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon@2ndquadrant.com (#178)
Re: [HACKERS] Point in Time Recovery

"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:

I would opt to make enabling/disabling archive_command require a postmaster
restart. That way there would be no capability to take advantage of the
incentive to turn it on/off.

We're generally not in the habit of making GUC parameters more rigid
than the implementation absolutely requires.

It would be my intention (in 8.1) to make those available via switches e.g.
NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).

Isn't this in direct conflict with your opinion above? And I cannot say
that I think this one is a good idea. We do not have support for
selective catalog xlogging; if you do something like this then you
*will* have a broken database after recovery, because it will contain
those indexes but with invalid contents.

I would also aim to make the first Insert Select into an empty table not
logged (optionally). This is an important optimization for Oracle, teradata
and DB2 (which uses NOT LOGGED INITIALLY).

This is even worse: not only do you have a broken database, but you have
no way to recover. (At least with an unlogged index you could fix it by
REINDEX.) If you don't care about longevity of the table, then make it
a temp table.

The fact that Oracle does it does not automatically make it a good idea.

regards, tom lane

#180Simon@2ndquadrant.com
simon@2ndquadrant.com
In reply to: Tom Lane (#179)
Re: NOT LOGGED options (was Point in Time Recovery )

Tom Lane wrote
"Simon@2ndquadrant.com" <simon@2ndquadrant.com> writes:

It would be my intention (in 8.1) to make those available via

switches e.g.

NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).

Isn't this in direct conflict with your opinion above? And I cannot say
that I think this one is a good idea. We do not have support for
selective catalog xlogging; if you do something like this then you
*will* have a broken database after recovery, because it will contain
those indexes but with invalid contents.

No, its not in direct conflict. Turning OFF archive_mode would have a system
wide effect. The options described allow individual applications to make a
choice about whether certain very large operations are recoverable, or not.
I don't ever personally want to turn off system wide PITR, but there will be
times when I choose to avoid overhead on individual ops when the situation
dictates. This goes with your oft-mentioned dislike of systems that think
they know better than you do...

The first two optimizations have been included in 8.0 when archive_mode is
off. If there is a problem, then it will effect crash recovery of those
systems also. I suggest using exactly this optimisation, though under user
(application) control, rather than sysadmin control.

The challenges you mention have a solution. I wanted to add these to TODO,
not yet to discuss detailed implementation.

I would also aim to make the first Insert Select into an empty table not
logged (optionally). This is an important optimization for

Oracle, teradata

and DB2 (which uses NOT LOGGED INITIALLY).

This is even worse: not only do you have a broken database, but you have
no way to recover. (At least with an unlogged index you could fix it by
REINDEX.) If you don't care about longevity of the table, then make it
a temp table.

It is frequently possible to use that route, though the option remains in
frequent use in other situations.

The fact that Oracle does it does not automatically make it a good idea.

Amen to that. You will note that unless compatability has been a
requirement, there have been times I have not followed the Oracle path, e.g.
PITR design.

I admit it must seem strange that I tried so hard to put PITR in place, only
to suggest removing it, optionally...

Overall, the options I describe here have been in production use in major
enterprise Data Warehouse systems for almost 15 years now. Oracle and DB2
copied the original Teradata implementation; slowly because, they too,
didn't quickly or easily accept the wisdom. There is abosultely no doubt of
the true value of these optimisations - the TPC-H tests for all vendors make
use of those (hidden in the details of which load utility options are used,
or simply the default behaviour).

Logging only has value when the mean time to recover is low enough to make
recovery worthwhile. This can catch you in a bind because you have to decide
whether to reduce MTTR at the expense of 100% data recovery. For some big
systems, recovery is only an option if you exclude the biggest table(s). In
a Data Warehouse, where data is loaded in large volumes, it may only be
feasible to load it when you have this optimisation. In a recovery
situation, re-loading the largest fact tables from their original source
data files is more likely to be the best option, or in some cases, skipped
entirely in favour of loading new data.

I don't claim that everybody would want this, only that it is an extremely
beneficial optimisation for many very large databases - which is much of my
focus.

You've pointed out that I'm new "round here", which is certainly true - but
I have been many places... There are and will be many differences in
thinking that emerge from this; I regard all of this as synergy, not
argument.

Best Regards, Simon Riggs

#181Manfred Spraul
manfred@colorfullife.com
In reply to: Simon@2ndquadrant.com (#180)
Re: NOT LOGGED options (was Point in Time Recovery )

Simon@2ndquadrant.com wrote:

Tom Lane wrote

NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off PITR system
wide. (Just as this is possible in Oracle and Teradata).

Isn't this in direct conflict with your opinion above? And I cannot say
that I think this one is a good idea. We do not have support for
selective catalog xlogging;

Is it possible to skip the xlog fsync for NOT LOGGED transactions?

--
Manfred

#182Simon Riggs
simon@2ndquadrant.com
In reply to: Manfred Spraul (#181)
Re: NOT LOGGED options (was Point in Time Recovery )

Manfred Spraul
Simon@2ndquadrant.com wrote:

Tom Lane wrote

NOT LOGGED options on CREATE INDEX and COPY, to allow users to take
advantage of the no logging optimization without turning off

PITR system

wide. (Just as this is possible in Oracle and Teradata).

Isn't this in direct conflict with your opinion above? And I cannot say
that I think this one is a good idea. We do not have support for
selective catalog xlogging;

Is it possible to skip the xlog fsync for NOT LOGGED transactions?

Hmm...good thinking...however,

For very large operations, its the volume of the xlog writes thats the
problem, not the fsync of the logs. The type of things I'm thinking about
are large CREATE INDEX and large COPY operations, for very large tasks i.e.

1Gb. These are most useful in data warehousing operations - which is about

20% of the user base according to the survey stats from www.postgresql.org.

The wal buffer only gets synced at end of transaction, or when the buffer is
full. On long operations there is still only one commit, so not fsyncing
there won't gain much. The buffer will fill up repeatedly and require
flushing - which you can't really skip because when you get to the commit
you need to be certain that everything is down to disk - there's not much
point fsyncing the commit if the previous wal records haven't been.

If there was a way to tell whether a block in the wal buffer had been
written by a NOT LOGGED transaction, then it might be possible to vary the
fsync behaviour accordingly. That's a good idea if thats what you meant,
though it would mean changing some critical, well tested code that every wal
record goes through. I'd rather simply not write wal at all for the certain
specific situations when the user requests it - there are already decision
points in the code for both the situations I've mentioned, since these have
been optimised in 8.0 for when archive_command has not been set. It would be
a simply matter to add in a check at that point.

Anyway...this is probably 8.1 stuff now.

Best Regards, Simon Riggs

#183JEDIDIAH
jedi@nomad.mishnet
In reply to: Tom Lane (#149)
Re: [HACKERS] Point in Time Recovery

On 2004-07-28, Bruce Momjian <pgman@candle.pha.pa.us> wrote:

Here is another open PITR issue that I think will have to be addressed
in 7.6. If you do a critical transaction, but do nothing else for eight
hours, that critical transaction hasn't been archived yet. It is still
sitting in pg_xlog until the WAL file fills.

I think we will need to document this behavior and address it in some
way in 7.6. We can't assume that we can send multiple copies of pg_xlog
to the archive (partial and full ones) because we might be going to a

If a particular transaction is so important that it absolutely
positively needs to be archived offline for PITR, then why not just mark
it that way or allow for the application to trigger archival of this
critical REDO?

tape drive. However, this is a non-intuitive behavior of our archiver.
We might need to tell people to archive the most recent WAL file every
minute to some other location or something.

[deletia]

--
Negligence will never equal intent, no matter how you
attempt to distort reality to do so. This is what separates |||
the real butchers from average Joes (or Fritzes) caught up in / | \
events not in their control.