Point in Time Recovery
Taking advantage of the freeze bubble allowed us... there are some last
minute features to add.
Summarising earlier thoughts, with some detailed digging and design from
myself in last few days - we're now in a position to add Point-in-Time
Recovery, on top of whats been achieved.
The target for the last record to recover to can be specified in 2 ways:
- by transactionId - not that useful, unless you have a means of
identifying what has happened from the log, then using that info to
specify how to recover - coming later - not in next few days :(
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
If we did, would that be portable?
Suggestions welcome, because I know very little of the details of
various *nix systems and win* on that topic.
Only COMMIT and ABORT records have timestamps, allowing us to circumvent
any discussion about partial transaction recovery and nested
transactions.
When we do recover, stopping at the timestamp is just half the battle.
We need to leave the xlog in which we stop in a state from which we can
enter production smoothly and cleanly. To do this, we could:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.
Also, I think it is straightforward to introduce control file duplexing,
with a second copy stored and maintained in the pg_xlog directory. This
would provide additional protection for pg_control, which takes on more
importance now that archive recovery is working. pg_xlog is a natural
home, since on busy systems it's on its own disk away from everything
else, ensuring that at least one copy survives. I can't see a downside
to that, but others might... We can introduce user specifiable
duplexing, in later releases.
For later, I envisage an off-line utility that can be used to inspect
xlog records. This could provide a number of features:
- validate archived xlogs, to check they are sound.
- produce summary reports, to allow identification of transactionIds and
the effects of particular transactions
- performance analysis to allow decisions to be made about whether group
commit features could be utilised to good effect
(Not now...)
Best regards, Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?
Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.
regards, tom lane
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.
Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?
eh? Which way round? The second plan was the one where I would destroy
data by overwriting it, thats exactly why I preferred the first.
Actually, the files are always copied from archive, so re-recovery is
always an available option in the design thats been implemented.
No matter...
Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file,
Now thats a much better plan...I suppose I just have to rack up the
recovery pointer to the first record on the first page of a new xlog
file, similar to first plan, but just fast-forwarding rather than
forwarding.
My only issue was to do with the secondary Checkpoint marker, which is
always reset to the place you just restored FROM, when you complete a
recovery. That could lead to a situation where you recover, then before
next checkpoint, fail and lose last checkpoint marker, then crash
recover from previous checkpoint (again), but this time replay the
records you were careful to avoid.
which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations.
Thats a good idea, if only because you so easily screw your test data
during multiple recovery situations. But if its good during testing, it
must be good in production too...since you may well perform
recovery...run for a while, then discover that you got it wrong first
time, then need to re-recover again. I already added that to my list of
gotchas and that would solve it.
I was going to say hats off to the Blue-hued ones, when I remembered
this little gem from last year
http://www.danskebank.com/link/ITreport20030403uk/$file/ITreport20030403uk.pdf
I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.
Well, I'm not sure about StartUpId....but certainly the high 2 bytes of
LogId looks pretty certain never to be anything but zeros. You have 2.4
x 10^14...which is 9,000 years at 1000 log file/sec
We could use the scheme you descibe:
add xFFFF to the logid every time you complete an archive recovery...so
the log files look like 0001000000000CE3 after youve recovered a load of
files that look like 0000000000000CE3
If you used StartUpID directly, you might just run out....but its very
unlikely you would ever perform 65000 recovery situations - unless
you've run the <expletive> code as often as I have :(.
Doing that also means we don't have to work out how to do that with
StartUpID. Of course, altering the length and makeup of the xlog files
is possible too, but that will cause other stuff to stop working....
[We'll have to give this a no-SciFi name, unless we want to make
in-roads into the Dr.Who fanbase :) Don't get them started. Better
still, dont give it a name at all.]
I'll sleep on that lot.
Best regards, Simon Riggs
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?
Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.
I'm tempted by the first plan, because it is more straightforward and
stands much less chance of me introducing 50 wierd bugs just before
close.
But what if you restore because after that PIT everything went haywire
including the log ? Did you mean to apply the remaining changes but not
commit those xids ? I think it is essential to only leave xlogs around
that allow a subsequent rollforward from the same old full backup.
Or is an instant new full backup required after restore ?
Andreas
Import Notes
Resolved by subject fallback
On Tue, 6 Jul 2004, Zeugswetter Andreas SB SD wrote:
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.
I'd also think that seconds are absolutely sufficient. From my daily
experience I can say that you're normally lucky to know the time
on one minute basis.
If you need to get closer, a log sniffer is unavoidable ...
Greetings, Klaus
--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)
Simon Riggs wrote:
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.
At the risk of irritating people, I'll repeat what I suggested a few
weeks ago...
Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
pitr_id if they record it).
IMHO time is seldom relevant, event boundaries are.
If you want to add special syntax for this, fine. If not, an INSERT
statement is a convenient way to do this anyway.
--
Richard Huxton
Archonet Ltd
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.
OK, thanks. I'll just leave the time_t datatype just the way it is.
Best Regards, Simon Riggs
On Tue, 2004-07-06 at 20:00, Richard Huxton wrote:
Simon Riggs wrote:
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Pretty much everybody supports gettimeofday() (time_t and separate
integer microseconds); you might as well use that. Note that the actual
resolution is not necessarily microseconds, and it'd still not be
certain that successive commits have distinct timestamps --- so maybe
this refinement would be pointless. You'll still have to design a user
interface that allows selection without the assumption of distinct
timestamps.Well, I agree, though without the desired-for UI now, I think some finer
grained mechanism would be good. This means extending the xlog commit
record by a couple of bytes...OK, lets live a little.At the risk of irritating people, I'll repeat what I suggested a few
weeks ago...
All feedback is good. Thanks.
Add a table: pg_pitr_checkpt (pitr_id SERIAL, pitr_ts timestamptz,
pitr_comment text)
Let the user insert rows in transactions as desired. Let them stop the
restore when a specific (pitr_ts,pitr_comment) gets inserted (or on
pitr_id if they record it).
It's a good plan, but the recovery is currently offline recovery and no
SQL is possible. So no way to insert, no way to access tables until
recovery completes. I like that plan and probably would have used it if
it was viable.
IMHO time is seldom relevant, event boundaries are.
Agreed, but time is the universally agreed way of describing two events
as being simultaneous. No other way to say "recover to the point when
the message queue went wild".
As of last post to Andreas, I've said I'll not bother changing the
granularity of the timestamp.
If you want to add special syntax for this, fine. If not, an INSERT
statement is a convenient way to do this anyway.
The special syntax isn't hugely important - I did suggest a kind of
SQL-like syntax previously, but thats gone now. Invoking recovery via a
command file IS, so we are able to tell the system its not in crash
recovery AND that when you've finished I want you to respond to crashes
without re-entering archive recovery.
Thanks for your comments. I'm not making this more complex than needs
be; in fact much of the code is very simple - its just the planning
that's complex.
Best regards, Simon Riggs
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.
Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problems
Tom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]
Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible).
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.
When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.
At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).
Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).
How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.
Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.
Comments much appreciated, assuming you read this far...
Best regards, Simon Riggs
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.
Imho you should take a close look at StartUpId, I think it is exactly this
"large number". Maybe you can add +2 to intentionally leave a hole.
Once you increment, I think it is very essential to checkpoint and double
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.
Should we implement timelines?
Yes :-)
Andreas
Import Notes
Resolved by subject fallback
On Wed, 2004-07-07 at 14:17, Zeugswetter Andreas SB SD wrote:
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.Imho you should take a close look at StartUpId, I think it is exactly this
"large number". Maybe you can add +2 to intentionally leave a hole.Once you increment, I think it is very essential to checkpoint and double
check pg_control, cause otherwise a crashrecovery would read the wrong xlogs.
Thanks for your thoughts - you have made me rethink this over some
hours. Trouble is, on this occasion, the other suggestion still seems
the best one, IMVHO.
If we number timelines based upon StartUpId, then there is still some
potential for conflict and this is what we're seeking to avoid.
Simply adding FFFF to the LogId puts the new timeline so far into the
previous timelines future that there isn't any problems. We only
increment the timeline when we recover, so we're not eating up the
numbers quickly. Simply adding a number means that there isn't any
conflict with any normal operations. The timelines aren't numbered
separately, so I'm not introducing a new version of
StartUpID...technically there isn't a new timeline, just a chunk of
geological time between them.
We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.
Best regards, Simon Riggs
On Thu, 8 Jul 2004, Simon Riggs wrote:
We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.
I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.
On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder "Why is it
doing that? Is this a bug or a feature?"
Just my 2ct :-))
Greetings, Klaus
--
Full Name : Klaus Naumann | (http://www.mgnet.de/) (Germany)
Phone / FAX : ++49/177/7862964 | E-Mail: (kn@mgnet.de)
On Thu, 2004-07-08 at 07:57, spock@mgnet.de wrote:
On Thu, 8 Jul 2004, Simon Riggs wrote:
We don't need to mention timelines in the docs, nor do we need to alter
pg_controldata to display it...just a comment in the code to explain why
we add a large number to the LogId after each recovery completes.I'd disagree on that. Knowing what exactly happens when recovering the
database is a must. It makes people feel more safe about the process. This
is because the software doesn't do things you wouldn't expect.On Oracle e.g. you create a new database incarnation when you recover a
database (incomplete). Which means your System Change Number and your Log
Sequence is reset to 0.
Knowing this is crucial because you otherwise would wonder "Why is it
doing that? Is this a bug or a feature?"
OK, will do.
Best regards, Simon Riggs
On 7/6/2004 3:58 PM, Simon Riggs wrote:
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.
TransactionID and timestamp is only sufficient if the transactions are
selected by their commit order. Especially in read committed mode,
consider this execution:
xid-1: start
xid-2: start
xid-2: update field x
xid-2: commit
xid-1: update field y
xid-1: commit
In this case, the update done by xid-1 depends on the row created by
xid-2. So logically xid-2 precedes xid-1, because it made its changes
earlier.
So you have to apply the log until you find the commit record of the
transaction you want apply last, and then stamp all transactions that
where in progress at that time as aborted.
Jan
OK, thanks. I'll just leave the time_t datatype just the way it is.
Best Regards, Simon Riggs
---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #
On Sat, 2004-07-10 at 15:17, Jan Wieck wrote:
On 7/6/2004 3:58 PM, Simon Riggs wrote:
On Tue, 2004-07-06 at 08:38, Zeugswetter Andreas SB SD wrote:
- by time - but the time stamp on each xlog record only specifies to the
second, which could easily be 10 or more commits (we hope....)Should we use a different datatype than time_t for the commit timestamp,
one that offers more fine grained differentiation between checkpoints?Imho seconds is really sufficient. If you know a more precise position
you will probably know it from backend log or an xlog sniffer. With those
you can easily use the TransactionId way.TransactionID and timestamp is only sufficient if the transactions are
selected by their commit order. Especially in read committed mode,
consider this execution:xid-1: start
xid-2: start
xid-2: update field x
xid-2: commit
xid-1: update field y
xid-1: commitIn this case, the update done by xid-1 depends on the row created by
xid-2. So logically xid-2 precedes xid-1, because it made its changes
earlier.So you have to apply the log until you find the commit record of the
transaction you want apply last, and then stamp all transactions that
where in progress at that time as aborted.
Agreed.
I've implemented this exactly as you say....
This turns out to be very easy because:
- when looking where to stop we only ever stop at commit or aborts -
these are the only records that have timestamps anyway...
- any record that isn't specifically committed is not updated in the
clog and therefore not visible. The clog starts in indeterminate state,
0 and is then updated to either committed or aborted. Aborted and
indeterminate are handled similarly in the current code, to allow for
crash recovery - PITR doesn't change anything there.
So, PITR doesn't do anything that crash recovery doen't already do.
Crash recovery makes no attempt to keep track of in-progress
transactions and doesn't make a special journey to the clog to
specifically mark them as aborted - they just are by default.
So, what we mean by "stop at a transactionId" is "stop applying redo at
the commit/abort record for that transactionId." It has to be an exact
match, not a "greater than", for exactly the reason you mention. That
means that although we stop at the commit record of transactionId X, we
may also have applied records for transactions with later transactionIds
e.g. X+1, X+2...etc (without limit or restriction).
(I'll even admit that as first, I did think we could get away with the
"less than" test that you are warning me about. Overall, I've spent more
time on theory/analysis than on coding, on the idea that you can improve
poor code, but wrong code just needs to be thrown away).
Timestamps are more vague...When time is used, there might easily be 10+
transactions whose commit/abort records have identical timestamp values.
So we either stop at the first or last record depending upon whether we
specified inclusive or exclusive on the recovery target value.
The hard bit, IMHO, is what we do with the part of the log that we have
chosen not to apply....which has been discussed on list in detail also.
Thanks for keeping an eye out for possible errors - this one is
something I'd thought through and catered for (there are comments in my
current latest published code to that effect, although I have not yet
finished coding the clean-up-after-stopping part).
This implies nothing with regard to other possible errors or oversights
and so I very much welcome any questioning of this nature - I am as
prone to error as the next man. It's important we get this right.
Best regards, Simon Riggs
On Tue, 2004-07-06 at 22:40, Simon Riggs wrote:
On Mon, 2004-07-05 at 22:46, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
- when we stop, keep reading records until EOF, just don't apply them.
When we write a checkpoint at end of recovery, the unapplied
transactions are buried alive, never to return.
- stop where we stop, then force zeros to EOF, so that no possible
record remains of previous transactions.Go with plan B; it's best not to destroy data (what if you chose the
wrong restart point the first time)?Actually this now reminds me of a discussion I had with Patrick
Macdonald some time ago. The DB2 practice in this connection is that
you *never* overwrite existing logfile data when recovering. Instead
you start a brand new xlog segment file, which is given a new "branch
number" so it can be distinguished from the future-time xlog segments
that you chose not to apply. I don't recall what the DB2 terminology
was exactly --- not "branch number" I don't think --- but anyway the
idea is that when you restart the database after an incomplete recovery,
you are now in a sort of parallel universe that has its own history
after the branch point (PITR stop point). You need to be able to
distinguish archived log segments of this parallel universe from those
of previous and subsequent incarnations. I'm not sure whether Vadim
intended our StartUpID to serve this purpose, but it could perhaps be
used that way, if we reflected it in the WAL file names.Some more thoughts...focusing on the what do we do after we've finished
recovering. The objectives, as I see them, are to put the system into a
state, that preserves these features:
1. we never overwrite files, in case we want to re-run recovery
2. we never write files that MIGHT have been written previously
3. we need to ensure that any xlog records skipped at admins request (in
PITR mode) are never in a position to be re-applied to this timeline.
4. ensure we can re-recover, if we need to, without further problemsTom's concept above, I'm going to call timelines. A timeline is the
sequence of logs created by the execution of a server. If you recover
the database, you create a new timeline. [This is because, if you've
invoked PITR you absolutely definitely want log records written to, say,
xlog15 to be different to those that were written to xlog15 in a
previous timeline that you have chosen not to reapply.]Objective (1) is complex.
When we are restoring, we always start with archived copies of the xlog,
to make sure we don't finish too soon. We roll forward until we either
reach PITR stop point, or we hit end of archived logs. If we hit end of
logs on archive, then we switch to a local copy, if one exists that is
higher than those, we carry on rolling forward until either we reach
PITR stop point, or we hit end of that log. (Hopefully, there isn't more
than one local xlog higher than the archive, but its possible).
If we are rolling forward on local copies, then they are our only
copies. We'd really like to archive them ASAP, but the archiver's not
running yet - we don't want to force that situation in case the archive
device (say a tape) is the one being used to recover right now. So we
write an archive_status of .ready for that file, ensuring that the
checkpoint won't remove it until it gets copied to archive, whenever
that starts working again. Objective (1) met.When we have finished recovering we:
- create a new xlog at the start of a new ++timeline
- copy the last applied xlog record to it as the first record
- set the record pointer so that it matches
That way, when we come up and begin running, we never overwrite files
that might have been written previously. Objective (2) met.
We do the other stuff because recovery finishes up by pointing to the
last applied record...which is what was causing all of this extra work
in the first place.At this point, we also reset the secondary checkpoint record, so that
should recovery be required again before next checkpoint AND the
shutdown checkpoint record written after recovery completes is
wrong/damaged, the recovery will not autorewind back past the PITR stop
point and attempt to recover the records we have just tried so hard to
reverse/ignore. Objective (3) met. (Clearly, that situation seems
unlikely, but I feel we must deal with it...a newly restored system is
actually very fragile, so a crash again within 3 minutes or so is very
commonplace, as far as these things go).Should we need to re-recover, we can do so because the new timeline
xlogs are further forward than the old timeline, so never get seen by
any processes (all of which look backwards). Re-recovery is possible
without problems, if required. This means you're a lot safer from some
of the mistakes you might of made, such as deciding you need to go into
recovery, then realising it wasn't required (or some other painful
flapping as goes on in computer rooms at 3am).How do we implement timelines?
The main presumption in the code is that xlogs are sequential. That has
two effects:
1. during recovery, we try to open the "next" xlog by adding one to the
numbers and then looking for that file
2. during checkpoint, we look for filenames less than the current
checkpoint marker
Creating a timeline by adding a larger number to LogId allows us to
prevent (1) from working, yet without breaking (2).
Well, Tom does seem to have something with regard to StartUpIds. I feel
it is easier to force a new timeline by adding a very large number to
the LogId IF, and only if, we have performed an archive recovery. That
way, we do not change at all the behaviour of the system for people that
choose not to implement archive_mode.Should we implement timelines?
Yes, I think we should. I've already hit the problems that timelines
solve in my testing and so that means they'll be hit when you don't need
the hassle.
I'm still wrestling with the cleanup-after-stopping-at-point-in-time
code and have some important conclusions.
Moving forward on a timeline is somewhat tricky for xlogs, as shown
above,...but...
My earlier treatment seems to have neglected to include the clog also.
If we stop before end of log, then we also have potentially many (though
presumably at least one) committed transactions that we do not want to
be told about ever again.
The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).
Please tell me that we can ignore the state of the clog, but I think we
can't - if a new xid re-used a previous xid that had committed AND then
we crashed...we would have inconsistent data. Unless we physically write
zeros to clog for every begin transaction after a recovery...err, no...
The only recourse that I can see is to "truncate the future" of the
clog, which would mean:
- keeping track of the highest xid provided by any record from the xlog,
in xact.c, xact_redo
- using that xid to write zeros to the clog after this point until EOF
- drop any clog segment files past the new "high" segment
- no idea how that effects NT or not...
The timeline idea works for xlog because once we've applied the xlog
records and checkpointed, we can discard the xlog records. We can't do
that with clog records (unless we followed recovery with a vacuum full -
which is possible, but not hugely desirable) - though this doesn't solve
the issue that xlog records don't have any prescribed position in the
file, clog records do.
Right now, I don't know where to start with the clog code and the
opportunity for code-overlap with NT seems way high. These problems can
be conquered, given time and "given enough eyeballs".
I'm all ears for some bright ideas...but I'm getting pretty wary that we
may introduce some unintended features if we try to get this stabilised
within two weeks. My current conclusion is: lets commit archive recovery
in this release, then wait until next dot release for full recovery
target features. We've hit all the features which were a priority and
the fundamental architecture is there, so i think it is time to be happy
with what we've got, for now.
Comments, please....remembering that I'd love it if I've missed
something that simplifies the task. Fire away.
Best regards, Simon Riggs
The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).
Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?
Andreas
Import Notes
Resolved by subject fallback
On Tue, 2004-07-13 at 13:18, Zeugswetter Andreas SB SD wrote:
The starting a new timeline thought works for xlogs, but not for clogs.
No matter how far you go into the future, there is a small (yet
vanishing) possibility that there is a yet undiscovered committed
transaction in the future. (Because transactions are ordered in the clog
because xids are assigned sequentially at txn start, but not ordered in
the xlog where they are recorded in the order the txns complete).Won't ExtendCLOG take care of this with ZeroCLOGPage ? Else the same problem
would arise at xid wraparound, no ?
Sounds like a good start...
When PITR ends, we need to stop mid-way through a file. Does that handle
that situation?
Simon Riggs
Simon Riggs <simon@2ndquadrant.com> writes:
Please tell me that we can ignore the state of the clog,
We can.
The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.
This does not apply to clog because it's not archived. It's no more
than a data file. If you think you have trouble recreating clog then
you have the same issues recreating data files.
regards, tom lane
On Tue, 2004-07-13 at 15:29, Tom Lane wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
Please tell me that we can ignore the state of the clog,
We can.
In general, you are of course correct.
The reason that keeping track of timelines is interesting for xlog is
simply to take pity on the poor DBA who needs to distinguish the various
archived xlog files he's got laying about, and so that we can detect
errors like supplying inconsistent sets of xlog segments during restore.This does not apply to clog because it's not archived. It's no more
than a data file. If you think you have trouble recreating clog then
you have the same issues recreating data files.
I'm getting carried away with the improbable....but this is the rather
strange, but possible scenario I foresee:
A sequence of times...
1. We start archiving xlogs
2. We take a checkpoint
3. we commit an important transaction
4. We take a backup
5. We take a checkpoint
As stands currently, when we restore the backup, controlfile says that
last checkpoint was at 2, so we rollforward from 2 THRU 4 and continue
on past 5 until end of logs. Normally, end of logs isn't until after
4...
When we specify a recovery target, it is possible to specify the
rollforward to complete just before point 3. So we use the backup taken
at 4 to rollforward to a point in the past (from the backups
perspective). The backup taken at 4 may now have data and clog records
written by bgwriter.
Given that time between checkpoints is likely to be longer than
previously was the case...this becomes a non-zero situation.
I was trying to solve this problem head on, but the best way is to make
sure we never allow ourselves such a muddled situation:
ISTM the way to avoid this is to insist that we always rollforward
through at least one checkpoint to guarantee that this will not occur.
...then I can forget I ever mentioned the ****** clog again.
I'm ignoring this issue for now....whether it exists or not!
Best Regards, Simon Riggs