PITR, checkpoint, and local relations
As per earlier discussion, I'm working on the hot backup issues as part
of the PITR support. While I was looking at the buffer manager and the
relcache/MyDb issues to figure out the best way to work this, it
occurred to me that PITR will introduce a big problem with the way we
handle local relations.
The basic problem is that local relations (rd_myxactonly == true) are
not part of a checkpoint, so there is no way to get a lower bound on the
starting LSN needed to recover a local relation. In the past this did
not matter, because either the local file would be (effectively)
discarded during recovery because it had not yet become visible, or the
file would be flushed before the transaction creating it made it
visible. Now this is a problem.
So I need a decision from the core team on what to do about the local
buffer manager. My preference would be to forget about the local buffer
manager entirely, or if not that then to allow it only for _true_
temporary data. The only alternative I can devise is to create some way
for all other backends to participate in a checkpoint, perhaps using a
signal. I'm not sure this can be done safely.
Anyway, I'm glad the tuplesort stuff doesn't try to use relation files
:-)
Can the core team let me know if this is acceptable, and whether I
should move ahead with changes to the buffer manager (and some other
stuff) needed to avoid special treatment of rd_myxactonly relations?
Also to Richard: have you guys at multera dealt with this issue already?
Is there some way around this that I'm missing?
Regards,
John Nield
Just as an example of this problem, imagine the following sequence:
1) Transaction TX1 creates a local relation LR1 which will eventually
become a globally visible table. Tuples are inserted into the local
relation, and logged to the WAL file. Some tuples remain in the local
buffer cache and are not yet written out, although they are logged. TX1
is still in progress.
2) Backup starts, and checkpoint is called to get a minimum starting LSN
(MINLSN) for the backed-up files. Only the global buffers are flushed.
3) Backup process copies LR1 into the backup directory. (postulate some
way of coordinating with the local buffer manager, a problem I have not
solved).
4) TX1 commits and flushes its local buffers. A dirty buffer exists
whose LSN is before MINLSN. LR1 becomes globally visible.
5) Backup finishes copying all the files, including the local relations,
and then flushes the log. The log files between MINLSN and the current
LSN are copied to the backup directory, and backup is done.
6) Sometime later, a system administrator restores the backup and plays
the logs forward starting at MINLSN. LR1 will be corrupt, because some
of the log entries required for its restoration will be before MINLSN.
This corruption will not be detected until something goes wrong.
BTW: The problem doesn't only happen with backup! It occurs at every
checkpoint as well, I just missed it until I started working on the hot
backup issue.
--
J. R. Nield
jrnield@usol.com
J.R needs comments on this. PITR has problems because local relations
aren't logged to WAL. Suggestions?
---------------------------------------------------------------------------
J. R. Nield wrote:
As per earlier discussion, I'm working on the hot backup issues as part
of the PITR support. While I was looking at the buffer manager and the
relcache/MyDb issues to figure out the best way to work this, it
occurred to me that PITR will introduce a big problem with the way we
handle local relations.The basic problem is that local relations (rd_myxactonly == true) are
not part of a checkpoint, so there is no way to get a lower bound on the
starting LSN needed to recover a local relation. In the past this did
not matter, because either the local file would be (effectively)
discarded during recovery because it had not yet become visible, or the
file would be flushed before the transaction creating it made it
visible. Now this is a problem.So I need a decision from the core team on what to do about the local
buffer manager. My preference would be to forget about the local buffer
manager entirely, or if not that then to allow it only for _true_
temporary data. The only alternative I can devise is to create some way
for all other backends to participate in a checkpoint, perhaps using a
signal. I'm not sure this can be done safely.Anyway, I'm glad the tuplesort stuff doesn't try to use relation files
:-)Can the core team let me know if this is acceptable, and whether I
should move ahead with changes to the buffer manager (and some other
stuff) needed to avoid special treatment of rd_myxactonly relations?Also to Richard: have you guys at multera dealt with this issue already?
Is there some way around this that I'm missing?Regards,
John Nield
Just as an example of this problem, imagine the following sequence:
1) Transaction TX1 creates a local relation LR1 which will eventually
become a globally visible table. Tuples are inserted into the local
relation, and logged to the WAL file. Some tuples remain in the local
buffer cache and are not yet written out, although they are logged. TX1
is still in progress.2) Backup starts, and checkpoint is called to get a minimum starting LSN
(MINLSN) for the backed-up files. Only the global buffers are flushed.3) Backup process copies LR1 into the backup directory. (postulate some
way of coordinating with the local buffer manager, a problem I have not
solved).4) TX1 commits and flushes its local buffers. A dirty buffer exists
whose LSN is before MINLSN. LR1 becomes globally visible.5) Backup finishes copying all the files, including the local relations,
and then flushes the log. The log files between MINLSN and the current
LSN are copied to the backup directory, and backup is done.6) Sometime later, a system administrator restores the backup and plays
the logs forward starting at MINLSN. LR1 will be corrupt, because some
of the log entries required for its restoration will be before MINLSN.
This corruption will not be detected until something goes wrong.BTW: The problem doesn't only happen with backup! It occurs at every
checkpoint as well, I just missed it until I started working on the hot
backup issue.--
J. R. Nield
jrnield@usol.com---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
--
Bruce Momjian | http://candle.pha.pa.us
pgman@candle.pha.pa.us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026
On Thu, 2002-08-01 at 17:14, Bruce Momjian wrote:
J.R needs comments on this. PITR has problems because local relations
aren't logged to WAL. Suggestions?
I'm sorry if it wasn't clear. The issue is not that local relations
aren't logged to WAL, they are. The issue is that you can't checkpoint
them. That means if you need a lower bound on the LSN to recover from,
then you either need to wait for transactions using them all to commit
and flush their local buffers, or there needs to be a async way to tell
them all to flush.
I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.
Does this sound like a good idea?
--
J. R. Nield
jrnield@usol.com
"J. R. Nield" <jrnield@usol.com> writes:
I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.
Does this sound like a good idea?
No. What happened to "simple"?
Before I'd accept anything like that, I'd rip out the local buffer
manager and just do everything in the shared manager. I've never
seen any proof that the local manager buys any noticeable performance
gain anyway ... how many people really do anything much with a table
during its first transaction of existence?
regards, tom lane
Ok. This is what I wanted to hear, but I had assumed someone decided to
put it in for a reason, and I wasn't going to submit a patch to pull-out
the local buffer manager without clearing it first.
The main area where it seems to get heavy use is during index builds,
and for 'CREATE TABLE AS SELECT...'.
So I will remove the local buffer manager as part of the PITR patch,
unless there is further objection.
On Fri, 2002-08-02 at 00:49, Tom Lane wrote:
"J. R. Nield" <jrnield@usol.com> writes:
I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.Does this sound like a good idea?
No. What happened to "simple"?
Before I'd accept anything like that, I'd rip out the local buffer
manager and just do everything in the shared manager. I've never
seen any proof that the local manager buys any noticeable performance
gain anyway ... how many people really do anything much with a table
during its first transaction of existence?regards, tom lane
--
J. R. Nield
jrnield@usol.com
"J. R. Nield" <jrnield@usol.com> writes:
Ok. This is what I wanted to hear, but I had assumed someone decided to
put it in for a reason, and I wasn't going to submit a patch to pull-out
the local buffer manager without clearing it first.
The main area where it seems to get heavy use is during index builds,
Yeah. I do not think it really saves any I/O: unless you abort your
index build, the data is eventually going to end up on disk anyway.
What it saves is contention for shared buffers (the overhead of
acquiring BufMgrLock, for example).
Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.
That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.
regards, tom lane
On Fri, 2002-08-02 at 10:01, Tom Lane wrote:
Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.
What happens when we have a local file that is created before the
backup, and it becomes global during the backup?
In order to copy this file, I either need:
1) A copy of all its blocks at the time backup started (or later), plus
all log records between then and the end of the backup.
OR
2) All the log records from the time the local file was created until
the end of the backup.
In the case of an idle uncommitted transaction that suddenly commits
during backup, case 2 might be very far back in the log file. In fact,
the log file might be archived to tape by then.
So I must do case 1, and checkpoint the local relations.
This brings up the question: why do I need to bother backing up files
that were local before the backup started, but became global during the
backup.
We already know that for the backup to be consistent after we restore
it, we must play the logs forward to the completion of the backup to
repair our "fuzzy copies" of the database files. Since the transaction
that makes the local-file into a global one has committed during our
backup, its log entries will be played forward as well.
What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?
That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.
This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?
--
J. R. Nield
jrnield@usol.com
pg_copy does not handle "local relations" as you would suspect. To find the
tables and indexes to backup the backend in processing the "ALTER SYSTEM
BACKUP" statement reads the pg_class table. Any tables in the process of
coming into existence of course are not visible. If somehow they were then
the backup would backup up their contents. Any in private memory changes
would be captured during crash recovery on the copy of the database. So the
question is: is it possible to read the names of the "local relations" from
the pg_class table even though there creation has not yet been committed?
-regards
richt
Show quoted text
-----Original Message-----
From: J. R. Nield [mailto:jrnield@usol.com]
Sent: Friday, August 02, 2002 12:27 PM
To: Tom Lane
Cc: Bruce Momjian; Richard Tucker; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relationsOn Fri, 2002-08-02 at 10:01, Tom Lane wrote:
Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.What happens when we have a local file that is created before the
backup, and it becomes global during the backup?In order to copy this file, I either need:
1) A copy of all its blocks at the time backup started (or later), plus
all log records between then and the end of the backup.OR
2) All the log records from the time the local file was created until
the end of the backup.In the case of an idle uncommitted transaction that suddenly commits
during backup, case 2 might be very far back in the log file. In fact,
the log file might be archived to tape by then.So I must do case 1, and checkpoint the local relations.
This brings up the question: why do I need to bother backing up files
that were local before the backup started, but became global during the
backup.We already know that for the backup to be consistent after we restore
it, we must play the logs forward to the completion of the backup to
repair our "fuzzy copies" of the database files. Since the transaction
that makes the local-file into a global one has committed during our
backup, its log entries will be played forward as well.What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?--
J. R. Nield
jrnield@usol.com
"J. R. Nield" <jrnield@usol.com> writes:
What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?
This is certainly a non-problem. You see a WAL log entry, you apply it.
Whether the transaction actually commits later is not your concern (at
least not at that point).
This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?
After thinking more, I think you are right, but you didn't explain it
well. The problem is not really relevant to PITR at all, but is a hole
in the initial design of WAL. Consider
transaction starts
transaction creates local rel
transaction writes in local rel...
CHECKPOINT
transaction writes in local rel...
CHECKPOINT
transaction writes in local rel...
transaction flushes local rel pages to disk
transaction commits
system crash
We'll try to replay the log from the latest checkpoint. This works only
if all the local-rel page flushes actually made it to disk, otherwise
the updates of the local rel that happened before the last checkpoint
may be lost. (I think there is still an fsync in local-rel commit to
ensure the flushes happen, but it's sure messy to do it that way.)
We could possibly fix this by logging the local-rel-flush page writes
themselves in the WAL log, but that'd probably more than ruin the
efficiency advantage of the local bufmgr. So I'm back to the idea
that removing it is the way to go. Certainly that would provide
nontrivial simplifications in a number of places (no tests on local vs
global buffer anymore, no special cases for local rel commit, etc).
Might be useful to temporarily dike it out and see what the penalty
for building a large index is.
regards, tom lane
On Fri, 2002-08-02 at 13:50, Richard Tucker wrote:
pg_copy does not handle "local relations" as you would suspect. To find the
tables and indexes to backup the backend in processing the "ALTER SYSTEM
BACKUP" statement reads the pg_class table. Any tables in the process of
coming into existence of course are not visible. If somehow they were then
the backup would backup up their contents. Any in private memory changes
would be captured during crash recovery on the copy of the database. So the
question is: is it possible to read the names of the "local relations" from
the pg_class table even though there creation has not yet been committed?
-regards
richt
No, not really. At least not a consistent view.
The way to do this is using the filesystem to discover the relfilnodes,
and there are a couple of ways to deal with the problem of files being
pulled out from under you, but you have to be careful about what the
buffer manager does when a file gets dropped.
The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup
Any other file, while it may be copied, doesn't need to be in the backup
because either it will be created and rebuilt during play-forward
recovery, or it will be deleted during play-forward recovery, or both,
assuming those operations are logged. They really must be logged to do
what we want to do.
Also, you can't use the normal relation_open stuff, because local
relations will not have a catalog entry, and it looks like there are
catcache/sinval issues that I haven't completely covered. So you've got
to do 'blind reads' through the buffer manager, which involves a minor
extension to the buffer manager to support this if local relations go
through the shared buffers, or coordinating with the local buffer
manager if they continue to work as they do now, which involves major
changes.
We also have to checkpoint at the start, and flush the log at the end.
--
J. R. Nield
jrnield@usol.com
"J. R. Nield" <jrnield@usol.com> writes:
The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup
Right, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?
(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)
regards, tom lane
On Fri, 2002-08-02 at 16:01, Tom Lane wrote:
"J. R. Nield" <jrnield@usol.com> writes:
The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backupRight, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?
There is no need to read uncommitted system catalog entries. Just take a
snapshot of the directory to get the OID's. You don't care whether the
get deleted before you get to them, because the log will take care of
that.
(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)
How do you get atomic block copies otherwise?
regards, tom lane
--
J. R. Nield
jrnield@usol.com
The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backupRight, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?
Right.
It looks like insert/update/etc ops over local relations are
WAL-logged, and it's Ok (we have to do this).
So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.
Vadim
Import Notes
Resolved by subject fallback
(In particular, I *strongly* object to using the buffer
manager at all
for reading files for backup. That's pretty much
guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS
directory search
will do fine for finding what you need to read, too.)
How do you get atomic block copies otherwise?
You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.
Vadim
Import Notes
Resolved by subject fallback
On Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote:
You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.
I thought half the point of PITR was to be able to turn off pre-image
logging so you can trade potential recovery time for speed without fear
of data-loss. Didn't we have this discussion before?
How is this any worse than a table scan?
--
J. R. Nield
jrnield@usol.com
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 02, 2002 4:02 PM
To: J. R. Nield
Cc: Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations"J. R. Nield" <jrnield@usol.com> writes:
The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backupRight, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?
You do need to make sure to backup the pg_xlog directory last and you need
to make sure no wal file gets reused while backing up everything else.
Show quoted text
(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)regards, tom lane
"J. R. Nield" <jrnield@usol.com> writes:
(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)
How do you get atomic block copies otherwise?
Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?
regards, tom lane
"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:
So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.
Hmm. Temporary relations are a whole different story.
It would be nice if updates on temp relations never got WAL-logged at
all, but I'm not sure how feasible that is. Right now we don't really
distinguish temp relations from ordinary ones --- in particular, they
have pg_class entries, which surely will get WAL-logged even if we
persuade the buffer manager not to do it for the data pages. Is that
a problem? Not sure.
regards, tom lane
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane
Sent: Friday, August 02, 2002 5:25 PM
To: J. R. Nield
Cc: Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations"J. R. Nield" <jrnield@usol.com> writes:
(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteedto blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)How do you get atomic block copies otherwise?
Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?
If the OS block size is 4k and the PostgreSQL block size is 8k do we know
for sure that the write call does not break this into two 4k writes to the
OS buffer cache?
Show quoted text
regards, tom lane
---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of J. R. Nield
Sent: Friday, August 02, 2002 5:12 PM
To: Mikheev, Vadim
Cc: Tom Lane; Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relationsOn Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote:
You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.I thought half the point of PITR was to be able to turn off pre-image
logging so you can trade potential recovery time for speed without fear
of data-loss. Didn't we have this discussion before?
Suppose you can turn off/on PostgreSQL's atomic write on the fly. Which
means turning on or off whether XLoginsert writes a copy of the block into
the log file upon first modification after a checkpoint.
So ALTER SYSTEM BEGIN BACKUP would turn on atomic write and then checkpoint
the database.
So while the OS copy of the data files is going on the atomic write would be
enabled. So any read of a partial write would be fixed up by the usual crash
recovery mechanism.
Show quoted text
How is this any worse than a table scan?
--
J. R. Nield
jrnield@usol.com---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster