PITR, checkpoint, and local relations

Started by J. R. Nieldover 23 years ago61 messages
#1J. R. Nield
jrnield@usol.com

As per earlier discussion, I'm working on the hot backup issues as part
of the PITR support. While I was looking at the buffer manager and the
relcache/MyDb issues to figure out the best way to work this, it
occurred to me that PITR will introduce a big problem with the way we
handle local relations.

The basic problem is that local relations (rd_myxactonly == true) are
not part of a checkpoint, so there is no way to get a lower bound on the
starting LSN needed to recover a local relation. In the past this did
not matter, because either the local file would be (effectively)
discarded during recovery because it had not yet become visible, or the
file would be flushed before the transaction creating it made it
visible. Now this is a problem.

So I need a decision from the core team on what to do about the local
buffer manager. My preference would be to forget about the local buffer
manager entirely, or if not that then to allow it only for _true_
temporary data. The only alternative I can devise is to create some way
for all other backends to participate in a checkpoint, perhaps using a
signal. I'm not sure this can be done safely.

Anyway, I'm glad the tuplesort stuff doesn't try to use relation files
:-)

Can the core team let me know if this is acceptable, and whether I
should move ahead with changes to the buffer manager (and some other
stuff) needed to avoid special treatment of rd_myxactonly relations?

Also to Richard: have you guys at multera dealt with this issue already?
Is there some way around this that I'm missing?

Regards,

John Nield

Just as an example of this problem, imagine the following sequence:

1) Transaction TX1 creates a local relation LR1 which will eventually
become a globally visible table. Tuples are inserted into the local
relation, and logged to the WAL file. Some tuples remain in the local
buffer cache and are not yet written out, although they are logged. TX1
is still in progress.

2) Backup starts, and checkpoint is called to get a minimum starting LSN
(MINLSN) for the backed-up files. Only the global buffers are flushed.

3) Backup process copies LR1 into the backup directory. (postulate some
way of coordinating with the local buffer manager, a problem I have not
solved).

4) TX1 commits and flushes its local buffers. A dirty buffer exists
whose LSN is before MINLSN. LR1 becomes globally visible.

5) Backup finishes copying all the files, including the local relations,
and then flushes the log. The log files between MINLSN and the current
LSN are copied to the backup directory, and backup is done.

6) Sometime later, a system administrator restores the backup and plays
the logs forward starting at MINLSN. LR1 will be corrupt, because some
of the log entries required for its restoration will be before MINLSN.
This corruption will not be detected until something goes wrong.

BTW: The problem doesn't only happen with backup! It occurs at every
checkpoint as well, I just missed it until I started working on the hot
backup issue.

--
J. R. Nield
jrnield@usol.com

#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: J. R. Nield (#1)
Re: PITR, checkpoint, and local relations

J.R needs comments on this. PITR has problems because local relations
aren't logged to WAL. Suggestions?

---------------------------------------------------------------------------

J. R. Nield wrote:

As per earlier discussion, I'm working on the hot backup issues as part
of the PITR support. While I was looking at the buffer manager and the
relcache/MyDb issues to figure out the best way to work this, it
occurred to me that PITR will introduce a big problem with the way we
handle local relations.

The basic problem is that local relations (rd_myxactonly == true) are
not part of a checkpoint, so there is no way to get a lower bound on the
starting LSN needed to recover a local relation. In the past this did
not matter, because either the local file would be (effectively)
discarded during recovery because it had not yet become visible, or the
file would be flushed before the transaction creating it made it
visible. Now this is a problem.

So I need a decision from the core team on what to do about the local
buffer manager. My preference would be to forget about the local buffer
manager entirely, or if not that then to allow it only for _true_
temporary data. The only alternative I can devise is to create some way
for all other backends to participate in a checkpoint, perhaps using a
signal. I'm not sure this can be done safely.

Anyway, I'm glad the tuplesort stuff doesn't try to use relation files
:-)

Can the core team let me know if this is acceptable, and whether I
should move ahead with changes to the buffer manager (and some other
stuff) needed to avoid special treatment of rd_myxactonly relations?

Also to Richard: have you guys at multera dealt with this issue already?
Is there some way around this that I'm missing?

Regards,

John Nield

Just as an example of this problem, imagine the following sequence:

1) Transaction TX1 creates a local relation LR1 which will eventually
become a globally visible table. Tuples are inserted into the local
relation, and logged to the WAL file. Some tuples remain in the local
buffer cache and are not yet written out, although they are logged. TX1
is still in progress.

2) Backup starts, and checkpoint is called to get a minimum starting LSN
(MINLSN) for the backed-up files. Only the global buffers are flushed.

3) Backup process copies LR1 into the backup directory. (postulate some
way of coordinating with the local buffer manager, a problem I have not
solved).

4) TX1 commits and flushes its local buffers. A dirty buffer exists
whose LSN is before MINLSN. LR1 becomes globally visible.

5) Backup finishes copying all the files, including the local relations,
and then flushes the log. The log files between MINLSN and the current
LSN are copied to the backup directory, and backup is done.

6) Sometime later, a system administrator restores the backup and plays
the logs forward starting at MINLSN. LR1 will be corrupt, because some
of the log entries required for its restoration will be before MINLSN.
This corruption will not be detected until something goes wrong.

BTW: The problem doesn't only happen with backup! It occurs at every
checkpoint as well, I just missed it until I started working on the hot
backup issue.

--
J. R. Nield
jrnield@usol.com

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#3J. R. Nield
jrnield@usol.com
In reply to: Bruce Momjian (#2)
Re: PITR, checkpoint, and local relations

On Thu, 2002-08-01 at 17:14, Bruce Momjian wrote:

J.R needs comments on this. PITR has problems because local relations
aren't logged to WAL. Suggestions?

I'm sorry if it wasn't clear. The issue is not that local relations
aren't logged to WAL, they are. The issue is that you can't checkpoint
them. That means if you need a lower bound on the LSN to recover from,
then you either need to wait for transactions using them all to commit
and flush their local buffers, or there needs to be a async way to tell
them all to flush.

I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.

Does this sound like a good idea?

--
J. R. Nield
jrnield@usol.com

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#3)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.

Does this sound like a good idea?

No. What happened to "simple"?

Before I'd accept anything like that, I'd rip out the local buffer
manager and just do everything in the shared manager. I've never
seen any proof that the local manager buys any noticeable performance
gain anyway ... how many people really do anything much with a table
during its first transaction of existence?

regards, tom lane

#5J. R. Nield
jrnield@usol.com
In reply to: Tom Lane (#4)
Re: PITR, checkpoint, and local relations

Ok. This is what I wanted to hear, but I had assumed someone decided to
put it in for a reason, and I wasn't going to submit a patch to pull-out
the local buffer manager without clearing it first.

The main area where it seems to get heavy use is during index builds,
and for 'CREATE TABLE AS SELECT...'.

So I will remove the local buffer manager as part of the PITR patch,
unless there is further objection.

On Fri, 2002-08-02 at 00:49, Tom Lane wrote:

"J. R. Nield" <jrnield@usol.com> writes:

I am working on a way to do this with a signal, using holdoffs around
calls into the storage-manager and VFS layers to prevent re-entrant
calls. The local buffer manager is simple enough that it should be
possible to flush them from within a signal handler at most times, but
the VFS and storage manager are not safe to re-enter from a handler.

Does this sound like a good idea?

No. What happened to "simple"?

Before I'd accept anything like that, I'd rip out the local buffer
manager and just do everything in the shared manager. I've never
seen any proof that the local manager buys any noticeable performance
gain anyway ... how many people really do anything much with a table
during its first transaction of existence?

regards, tom lane

--
J. R. Nield
jrnield@usol.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#5)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

Ok. This is what I wanted to hear, but I had assumed someone decided to
put it in for a reason, and I wasn't going to submit a patch to pull-out
the local buffer manager without clearing it first.

The main area where it seems to get heavy use is during index builds,

Yeah. I do not think it really saves any I/O: unless you abort your
index build, the data is eventually going to end up on disk anyway.
What it saves is contention for shared buffers (the overhead of
acquiring BufMgrLock, for example).

Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.

That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.

regards, tom lane

#7J. R. Nield
jrnield@usol.com
In reply to: Tom Lane (#6)
Re: PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 10:01, Tom Lane wrote:

Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.

What happens when we have a local file that is created before the
backup, and it becomes global during the backup?

In order to copy this file, I either need:

1) A copy of all its blocks at the time backup started (or later), plus
all log records between then and the end of the backup.

OR

2) All the log records from the time the local file was created until
the end of the backup.

In the case of an idle uncommitted transaction that suddenly commits
during backup, case 2 might be very far back in the log file. In fact,
the log file might be archived to tape by then.

So I must do case 1, and checkpoint the local relations.

This brings up the question: why do I need to bother backing up files
that were local before the backup started, but became global during the
backup.

We already know that for the backup to be consistent after we restore
it, we must play the logs forward to the completion of the backup to
repair our "fuzzy copies" of the database files. Since the transaction
that makes the local-file into a global one has committed during our
backup, its log entries will be played forward as well.

What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?

That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.

This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?

--
J. R. Nield
jrnield@usol.com

#8Richard Tucker
richt@multera.com
In reply to: J. R. Nield (#7)
Re: PITR, checkpoint, and local relations

pg_copy does not handle "local relations" as you would suspect. To find the
tables and indexes to backup the backend in processing the "ALTER SYSTEM
BACKUP" statement reads the pg_class table. Any tables in the process of
coming into existence of course are not visible. If somehow they were then
the backup would backup up their contents. Any in private memory changes
would be captured during crash recovery on the copy of the database. So the
question is: is it possible to read the names of the "local relations" from
the pg_class table even though there creation has not yet been committed?
-regards
richt

Show quoted text

-----Original Message-----
From: J. R. Nield [mailto:jrnield@usol.com]
Sent: Friday, August 02, 2002 12:27 PM
To: Tom Lane
Cc: Bruce Momjian; Richard Tucker; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 10:01, Tom Lane wrote:

Just out of curiosity, though, what does it matter? On re-reading your
message I think you are dealing with a non problem, or at least the
wrong problem. Local relations do not need to be checkpointed, because
by definition they were created by a transaction that hasn't committed
yet. They must be, and are, checkpointed to disk before the transaction
commits; but up till that time, if you have a crash then the entire
relation should just go away.

What happens when we have a local file that is created before the
backup, and it becomes global during the backup?

In order to copy this file, I either need:

1) A copy of all its blocks at the time backup started (or later), plus
all log records between then and the end of the backup.

OR

2) All the log records from the time the local file was created until
the end of the backup.

In the case of an idle uncommitted transaction that suddenly commits
during backup, case 2 might be very far back in the log file. In fact,
the log file might be archived to tape by then.

So I must do case 1, and checkpoint the local relations.

This brings up the question: why do I need to bother backing up files
that were local before the backup started, but became global during the
backup.

We already know that for the backup to be consistent after we restore
it, we must play the logs forward to the completion of the backup to
repair our "fuzzy copies" of the database files. Since the transaction
that makes the local-file into a global one has committed during our
backup, its log entries will be played forward as well.

What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?

That mechanism is there already --- perhaps it needs a few tweaks for
PITR but I do not see any need for cross-backend flush commands for
local relations.

This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?

--
J. R. Nield
jrnield@usol.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#7)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

What would happen if a transaction with a local relation commits during
backup, and there are log entries inserting the catalog tuples into
pg_class. Should I not apply those on restore? How do I know?

This is certainly a non-problem. You see a WAL log entry, you apply it.
Whether the transaction actually commits later is not your concern (at
least not at that point).

This problem is subtle, and I'm maybe having difficulty explaining it
properly. Do you understand the issue I'm raising? Have I made some kind
of blunder, so that this is really not a problem?

After thinking more, I think you are right, but you didn't explain it
well. The problem is not really relevant to PITR at all, but is a hole
in the initial design of WAL. Consider

transaction starts
transaction creates local rel
transaction writes in local rel...
CHECKPOINT
transaction writes in local rel...
CHECKPOINT
transaction writes in local rel...
transaction flushes local rel pages to disk
transaction commits
system crash

We'll try to replay the log from the latest checkpoint. This works only
if all the local-rel page flushes actually made it to disk, otherwise
the updates of the local rel that happened before the last checkpoint
may be lost. (I think there is still an fsync in local-rel commit to
ensure the flushes happen, but it's sure messy to do it that way.)

We could possibly fix this by logging the local-rel-flush page writes
themselves in the WAL log, but that'd probably more than ruin the
efficiency advantage of the local bufmgr. So I'm back to the idea
that removing it is the way to go. Certainly that would provide
nontrivial simplifications in a number of places (no tests on local vs
global buffer anymore, no special cases for local rel commit, etc).

Might be useful to temporarily dike it out and see what the penalty
for building a large index is.

regards, tom lane

#10J. R. Nield
jrnield@usol.com
In reply to: Richard Tucker (#8)
Re: PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 13:50, Richard Tucker wrote:

pg_copy does not handle "local relations" as you would suspect. To find the
tables and indexes to backup the backend in processing the "ALTER SYSTEM
BACKUP" statement reads the pg_class table. Any tables in the process of
coming into existence of course are not visible. If somehow they were then
the backup would backup up their contents. Any in private memory changes
would be captured during crash recovery on the copy of the database. So the
question is: is it possible to read the names of the "local relations" from
the pg_class table even though there creation has not yet been committed?
-regards
richt

No, not really. At least not a consistent view.

The way to do this is using the filesystem to discover the relfilnodes,
and there are a couple of ways to deal with the problem of files being
pulled out from under you, but you have to be careful about what the
buffer manager does when a file gets dropped.

The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup

Any other file, while it may be copied, doesn't need to be in the backup
because either it will be created and rebuilt during play-forward
recovery, or it will be deleted during play-forward recovery, or both,
assuming those operations are logged. They really must be logged to do
what we want to do.

Also, you can't use the normal relation_open stuff, because local
relations will not have a catalog entry, and it looks like there are
catcache/sinval issues that I haven't completely covered. So you've got
to do 'blind reads' through the buffer manager, which involves a minor
extension to the buffer manager to support this if local relations go
through the shared buffers, or coordinating with the local buffer
manager if they continue to work as they do now, which involves major
changes.

We also have to checkpoint at the start, and flush the log at the end.
--
J. R. Nield
jrnield@usol.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#10)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup

Right, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?

(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)

regards, tom lane

#12J. R. Nield
jrnield@usol.com
In reply to: Tom Lane (#11)
Re: PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 16:01, Tom Lane wrote:

"J. R. Nield" <jrnield@usol.com> writes:

The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup

Right, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?

There is no need to read uncommitted system catalog entries. Just take a
snapshot of the directory to get the OID's. You don't care whether the
get deleted before you get to them, because the log will take care of
that.

(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)

How do you get atomic block copies otherwise?

regards, tom lane

--
J. R. Nield
jrnield@usol.com

#13Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: J. R. Nield (#12)
Re: PITR, checkpoint, and local relations

The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup

Right, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?

Right.

It looks like insert/update/etc ops over local relations are
WAL-logged, and it's Ok (we have to do this).

So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.

Vadim

#14Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#13)
Re: PITR, checkpoint, and local relations

(In particular, I *strongly* object to using the buffer

manager at all

for reading files for backup. That's pretty much

guaranteed to blow out

buffer cache. Use plain OS-level file reads. An OS

directory search

will do fine for finding what you need to read, too.)

How do you get atomic block copies otherwise?

You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

Vadim

#15J. R. Nield
jrnield@usol.com
In reply to: Mikheev, Vadim (#14)
Re: PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote:

You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to turn off pre-image
logging so you can trade potential recovery time for speed without fear
of data-loss. Didn't we have this discussion before?

How is this any worse than a table scan?

--
J. R. Nield
jrnield@usol.com

#16Richard Tucker
richt@multera.com
In reply to: Tom Lane (#11)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 02, 2002 4:02 PM
To: J. R. Nield
Cc: Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

The predicate for files we MUST (fuzzy) copy is:
File exists at start of backup && File exists at end of backup

Right, which seems to me to negate all these claims about needing a
(horribly messy) way to read uncommitted system catalog entries, do
blind reads, etc. What's wrong with just exec'ing tar after having
done a checkpoint?

You do need to make sure to backup the pg_xlog directory last and you need
to make sure no wal file gets reused while backing up everything else.

Show quoted text

(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#12)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed to blow out
buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)

How do you get atomic block copies otherwise?

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#13)
Re: PITR, checkpoint, and local relations

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.

Hmm. Temporary relations are a whole different story.

It would be nice if updates on temp relations never got WAL-logged at
all, but I'm not sure how feasible that is. Right now we don't really
distinguish temp relations from ordinary ones --- in particular, they
have pg_class entries, which surely will get WAL-logged even if we
persuade the buffer manager not to do it for the data pages. Is that
a problem? Not sure.

regards, tom lane

#19Richard Tucker
richt@multera.com
In reply to: Tom Lane (#17)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Tom Lane
Sent: Friday, August 02, 2002 5:25 PM
To: J. R. Nield
Cc: Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

(In particular, I *strongly* object to using the buffer manager at all
for reading files for backup. That's pretty much guaranteed

to blow out

buffer cache. Use plain OS-level file reads. An OS directory search
will do fine for finding what you need to read, too.)

How do you get atomic block copies otherwise?

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

If the OS block size is 4k and the PostgreSQL block size is 8k do we know
for sure that the write call does not break this into two 4k writes to the
OS buffer cache?

Show quoted text

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#20Richard Tucker
richt@multera.com
In reply to: J. R. Nield (#15)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of J. R. Nield
Sent: Friday, August 02, 2002 5:12 PM
To: Mikheev, Vadim
Cc: Tom Lane; Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

On Fri, 2002-08-02 at 16:59, Mikheev, Vadim wrote:

You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to turn off pre-image
logging so you can trade potential recovery time for speed without fear
of data-loss. Didn't we have this discussion before?

Suppose you can turn off/on PostgreSQL's atomic write on the fly. Which
means turning on or off whether XLoginsert writes a copy of the block into
the log file upon first modification after a checkpoint.
So ALTER SYSTEM BEGIN BACKUP would turn on atomic write and then checkpoint
the database.
So while the OS copy of the data files is going on the atomic write would be
enabled. So any read of a partial write would be fixed up by the usual crash
recovery mechanism.

Show quoted text

How is this any worse than a table scan?

--
J. R. Nield
jrnield@usol.com

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

#21Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Richard Tucker (#20)
Re: PITR, checkpoint, and local relations

So, we only have to use shared buffer pool for local (but probably
not for temporary) relations to close this issue, yes? I personally
don't see any performance issues if we do this.

Hmm. Temporary relations are a whole different story.

It would be nice if updates on temp relations never got WAL-logged at
all, but I'm not sure how feasible that is. Right now we don't really

There is no any point to log them.

distinguish temp relations from ordinary ones --- in particular, they
have pg_class entries, which surely will get WAL-logged even if we
persuade the buffer manager not to do it for the data pages. Is that
a problem? Not sure.

It was not about any problem. I just mean that local buffer pool
still could be used for temporary relations if someone thinks
that it has any sence, anyone?

Vadim

#22Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#21)
Re: PITR, checkpoint, and local relations

You don't need it.
As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to turn
off pre-image logging so you can trade potential recovery

Correction - *after*-image.

time for speed without fear of data-loss. Didn't we have
this discussion before?

Sorry, I missed this.

So, it's already discussed what to do about partial
block updates? When system crashed just after LSN,
but not actual tuple etc, was stored in on-disk block
and on restart you compare log record' LSN with
data block' LSN, they are equal and so you *assume*
that actual data are in place too, what is not the case?

I always thought that the whole point of PITR is to be
able to restore DB fast (faster than pg_restore) *AND*
up to the last committed transaction (assuming that
log is Ok).

Vadim

#23Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#22)
Re: PITR, checkpoint, and local relations

How do you get atomic block copies otherwise?

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

Good point.

Vadim

#24Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#23)
Re: PITR, checkpoint, and local relations

As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to
turn off pre-image logging so you can trade potential
recovery time for speed without fear of data-loss.
Didn't we have this discussion before?

Suppose you can turn off/on PostgreSQL's atomic write on
the fly. Which means turning on or off whether XLoginsert
writes a copy of the block into the log file upon first
modification after a checkpoint.
So ALTER SYSTEM BEGIN BACKUP would turn on atomic write
and then checkpoint the database.
So while the OS copy of the data files is going on the
atomic write would be enabled. So any read of a partial
write would be fixed up by the usual crash recovery mechanism.

Yes, simple way to satisfy everyone.

Vadim

#25J. R. Nield
jrnield@usol.com
In reply to: Mikheev, Vadim (#23)
Re: PITR, checkpoint, and local relations

Are you sure this is true for all ports? And if so, why would it be
cheaper for the kernel to do it in its buffer manager, compared to us
doing it in ours? This just seems bogus to rely on. Does anyone know
what POSIX has to say about this?

On Fri, 2002-08-02 at 18:01, Mikheev, Vadim wrote:

How do you get atomic block copies otherwise?

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

Good point.

Vadim

--
J. R. Nield
jrnield@usol.com

#26Richard Tucker
richt@multera.com
In reply to: Mikheev, Vadim (#24)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org]On Behalf Of Mikheev, Vadim
Sent: Friday, August 02, 2002 6:16 PM
To: 'richt@multera.com'; J. R. Nield
Cc: Tom Lane; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

As long as whole block is saved in log on first after
checkpoint (you made before backup) change to block.

I thought half the point of PITR was to be able to
turn off pre-image logging so you can trade potential
recovery time for speed without fear of data-loss.
Didn't we have this discussion before?

Suppose you can turn off/on PostgreSQL's atomic write on
the fly. Which means turning on or off whether XLoginsert
writes a copy of the block into the log file upon first
modification after a checkpoint.
So ALTER SYSTEM BEGIN BACKUP would turn on atomic write
and then checkpoint the database.
So while the OS copy of the data files is going on the
atomic write would be enabled. So any read of a partial
write would be fixed up by the usual crash recovery mechanism.

Yes, simple way to satisfy everyone.

By the way I could supply a patch which turns off the atomic write feature.
It is disabled via a configuration parameter. If the flag enabling /
disabling the feature were added to shared memory, XLogCtl struture, then it
could be toggled at runtime.

So I think what will work then is pg_copy (hot backup) would:
1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on atomic write,
checkpoints the database and disables further checkpoints (so wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read the database
directory to find which files it should backup rather than pg_class and for
each file just use system(cp...) to copy it to the backup directory.
3) ALTER SYSTEM FINISH BACKUP does at it does now and backs up the pg_xlog
directory and renables database checkpointing.

Does this sound right?

BTW I will be on vacation until next Wednesday.

Show quoted text

Vadim

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

#27Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Richard Tucker (#26)
Re: PITR, checkpoint, and local relations

Are you sure this is true for all ports?

Well, maybe you're right and it's not.
But with "after-image blocks in log after checkpoint"
you really shouldn't worry about block atomicity, right?
And ability to turn blocks logging on/off, as suggested
by Richard, looks as appropriate for everyone, ?

And if so, why would it be cheaper for the kernel to do it in
its buffer manager, compared to us doing it in ours? This just
seems bogus to rely on. Does anyone know what POSIX has to say
about this?

Does "doing it in ours" mean reading all data files through
our shared buffer pool? Sorry, I just don't see point in this
when tar ect will work just fine. At least for the first release
tar is SuperOK, because of there must be and will be other
problems/bugs, unrelated to how to read data files, and so
the sooner we start testing the better.

Vadim

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Tucker (#26)
Re: PITR, checkpoint, and local relations

Richard Tucker <richt@multera.com> writes:

1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on atomic write,
checkpoints the database and disables further checkpoints (so wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read the database
directory to find which files it should backup rather than pg_class and for
each file just use system(cp...) to copy it to the backup directory.
3) ALTER SYSTEM FINISH BACKUP does at it does now and backs up the pg_xlog
directory and renables database checkpointing.

Does this sound right?

I really dislike the notion of turning off checkpointing. What if the
backup process dies or gets stuck (eg, it's waiting for some operator to
change a tape, but the operator has gone to lunch)? IMHO, backup
systems that depend on breaking the system's normal operational behavior
are broken. It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

regards, tom lane

#29Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#28)
Re: PITR, checkpoint, and local relations

So I think what will work then is pg_copy (hot backup) would:
1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on
atomic write,
checkpoints the database and disables further checkpoints (so
wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read
the database
directory to find which files it should backup rather than
pg_class and for
each file just use system(cp...) to copy it to the backup directory.

Did you consider saving backup on the client host (ie from where
pg_copy started)?

3) ALTER SYSTEM FINISH BACKUP does at it does now and backs
up the pg_xlog
directory and renables database checkpointing.

Well, wouldn't be single command ALTER SYSTEM BACKUP enough?
What's the point to have 3 commands?

(If all of this is already discussed then sorry - I'm not going
to start new discussion).

Vadim

#30Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Mikheev, Vadim (#29)
Re: PITR, checkpoint, and local relations

I really dislike the notion of turning off checkpointing. What if the
backup process dies or gets stuck (eg, it's waiting for some
operator to
change a tape, but the operator has gone to lunch)? IMHO, backup
systems that depend on breaking the system's normal
operational behavior
are broken. It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

But you have to prevent log files reusing while you copy data files.
That's why I asked are 3 commands from pg_copy required and couldn't
be backup accomplished by issuing single command

ALTER SYSTEM BACKUP <dir | stdout (to copy data to client side)>

(even from pgsql) so backup process would die with entire system -:)
As for tape changing, maybe we could use some timeout and then just
stop backup process.

Vadim

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#30)
Re: PITR, checkpoint, and local relations

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal operational behavior
in place, not muck with it.

regards, tom lane

#32Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Tom Lane (#31)
Re: PITR, checkpoint, and local relations

It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation

in between is

a bad design.

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal
operational behavior in place, not muck with it.

Well, PITR without log archiving could be alternative to
pg_dump/pg_restore, but I agreed that it's not the big
feature to worry about.

Vadim

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Mikheev, Vadim (#32)
Re: PITR, checkpoint, and local relations

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal
operational behavior in place, not muck with it.

Well, PITR without log archiving could be alternative to
pg_dump/pg_restore, but I agreed that it's not the big
feature to worry about.

Seems like a pointless "feature" to me. A pg_dump dump serves just
as well to capture a snapshot --- in fact better, since it's likely
smaller, definitely more portable, amenable to selective restore, etc.

I think we should design the PITR dump to do a good job for PITR,
not a poor job of both PITR and pg_dump.

regards, tom lane

#34Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#28)
Re: PITR, checkpoint, and local relations

Tom Lane wrote:

Richard Tucker <richt@multera.com> writes:

1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on atomic write,
checkpoints the database and disables further checkpoints (so wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read the database
directory to find which files it should backup rather than pg_class and for
each file just use system(cp...) to copy it to the backup directory.
3) ALTER SYSTEM FINISH BACKUP does at it does now and backs up the pg_xlog
directory and renables database checkpointing.

Does this sound right?

I really dislike the notion of turning off checkpointing. What if the
backup process dies or gets stuck (eg, it's waiting for some operator to
change a tape, but the operator has gone to lunch)? IMHO, backup
systems that depend on breaking the system's normal operational behavior
are broken. It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

Yes, and we have the same issue with turning on/off after-image writes.
How do we reset this from a PITR crash?; however, the failure mode is
only poorer performance, but it may be that way for a long time without
the administrator knowing it.

I wonder if we could SET the value in a transaction and keep the session
connection open. When we complete, we abort the transaction and
disconnect. If we die, the session terminates and the SET variable goes
back to the original value. (I am using the ignore SET in aborted
transactions feature.)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#35Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#31)
Re: PITR, checkpoint, and local relations

Tom Lane wrote:

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal operational behavior
in place, not muck with it.

But what if you normally continuous LOG to tape, and now you want to
backup to tape. You can't use the same tape drive for both operations.
Is that typical? I know sites that had only one tape drive that did
that.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#36Mikheev, Vadim
vmikheev@SECTORBASE.COM
In reply to: Bruce Momjian (#35)
Re: PITR, checkpoint, and local relations

Well, PITR without log archiving could be alternative to
pg_dump/pg_restore, but I agreed that it's not the big
feature to worry about.

Seems like a pointless "feature" to me. A pg_dump dump serves just
as well to capture a snapshot --- in fact better, since it's likely
smaller, definitely more portable, amenable to selective restore, etc.

But pg_restore probably will take longer time than copy data files
back and re-apply log.

I think we should design the PITR dump to do a good job for PITR,
not a poor job of both PITR and pg_dump.

As I already said - agreed -:)

Vadim

#37Christopher Kings-Lynne
chriskl@familyhealth.com.au
In reply to: Bruce Momjian (#2)
Re: PITR, checkpoint, and local relations

The main area where it seems to get heavy use is during index builds,
and for 'CREATE TABLE AS SELECT...'.

So I will remove the local buffer manager as part of the PITR patch,
unless there is further objection.

Would someone mind filling me in as to what the local bugger manager is and
how it is different (and not useful) compared to the shared buffer manager?

Chris

#38Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Christopher Kings-Lynne (#37)
Re: PITR, checkpoint, and local relations

Christopher Kings-Lynne wrote:

The main area where it seems to get heavy use is during index builds,
and for 'CREATE TABLE AS SELECT...'.

So I will remove the local buffer manager as part of the PITR patch,
unless there is further objection.

Would someone mind filling me in as to what the local bugger manager is and
how it is different (and not useful) compared to the shared buffer manager?

Sure. I think I can handle that.

When you create a table in a transaction, there isn't any committed
state to the table yet, so any table modifications are kept in a local
buffer, which is local memory to the backend(?). No one needs to see it
because it isn't visible to anyone yet. Same for indexes.

Anyway, the WAL activity doesn't handle local buffers the same as shared
buffers because there is no crisis if the system crashes.

There is debate on whether the local buffers are even valuable
considering the headache they cause in other parts of the system.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#38)
Re: PITR, checkpoint, and local relations

Bruce Momjian <pgman@candle.pha.pa.us> writes:

There is debate on whether the local buffers are even valuable
considering the headache they cause in other parts of the system.

More specifically, the issue is that when (if) you commit, the contents
of the new table now have to be pushed out to shared storage. This is
moderately annoying in itself (among other things, it implies fsync'ing
those tables before commit). But the real reason it comes up now is
that the proposed PITR scheme can't cope gracefully with tables that
are suddenly there but weren't participating in checkpoints before.

It looks to me like we should stop using local buffers for ordinary
tables that happen to be in their first transaction of existence.
But, per Vadim's suggestion, we shouldn't abandon the local buffer
manager altogether. What we could and should use it for is TEMP tables,
which have no need to be checkpointed or WAL-logged or fsync'd or
accessible to other backends *ever*. Also, a temp table can leave
blocks in local buffers across transactions, which makes local buffers
considerably more useful than they are now.

If temp tables didn't use the shared bufmgr nor did updates to them get
WAL-logged, they'd be noticeably more efficient than plain tables, which
IMHO would be a Good Thing. Such tables would be essentially invisible
to WAL and PITR (at least their contents would be --- I assume we'd
still log file creation and deletion). But I can't see anything wrong
with that.

In short, the proposal runs something like this:

* Regular tables that happen to be in their first transaction of
existence are not treated differently from any other regular table so
far as buffer management or WAL or PITR go. (rd_myxactonly either goes
away or is used for much less than it is now.)

* TEMP tables use the local buffer manager for their entire existence.
(This probably means adding an "rd_istemp" flag to relcache entries, but
I can't see anything wrong with that.)

* Local bufmgr semantics are twiddled to reflect this reality --- in
particular, data in local buffers can be held across transactions, there
is no end-of-transaction write (much less fsync). A TEMP table that
isn't too large might never touch disk at all.

* Data operations in TEMP tables do not get WAL-logged, nor do we
WAL-log page images of local-buffer pages.

These changes seem very attractive to me even without regard for making
the world safer for PITR. I'm willing to volunteer to make them happen,
if there are no objections.

regards, tom lane

#40Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#39)
Re: PITR, checkpoint, and local relations

Sounds like a win all around; make PITR easier and temp tables faster.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

There is debate on whether the local buffers are even valuable
considering the headache they cause in other parts of the system.

More specifically, the issue is that when (if) you commit, the contents
of the new table now have to be pushed out to shared storage. This is
moderately annoying in itself (among other things, it implies fsync'ing
those tables before commit). But the real reason it comes up now is
that the proposed PITR scheme can't cope gracefully with tables that
are suddenly there but weren't participating in checkpoints before.

It looks to me like we should stop using local buffers for ordinary
tables that happen to be in their first transaction of existence.
But, per Vadim's suggestion, we shouldn't abandon the local buffer
manager altogether. What we could and should use it for is TEMP tables,
which have no need to be checkpointed or WAL-logged or fsync'd or
accessible to other backends *ever*. Also, a temp table can leave
blocks in local buffers across transactions, which makes local buffers
considerably more useful than they are now.

If temp tables didn't use the shared bufmgr nor did updates to them get
WAL-logged, they'd be noticeably more efficient than plain tables, which
IMHO would be a Good Thing. Such tables would be essentially invisible
to WAL and PITR (at least their contents would be --- I assume we'd
still log file creation and deletion). But I can't see anything wrong
with that.

In short, the proposal runs something like this:

* Regular tables that happen to be in their first transaction of
existence are not treated differently from any other regular table so
far as buffer management or WAL or PITR go. (rd_myxactonly either goes
away or is used for much less than it is now.)

* TEMP tables use the local buffer manager for their entire existence.
(This probably means adding an "rd_istemp" flag to relcache entries, but
I can't see anything wrong with that.)

* Local bufmgr semantics are twiddled to reflect this reality --- in
particular, data in local buffers can be held across transactions, there
is no end-of-transaction write (much less fsync). A TEMP table that
isn't too large might never touch disk at all.

* Data operations in TEMP tables do not get WAL-logged, nor do we
WAL-log page images of local-buffer pages.

These changes seem very attractive to me even without regard for making
the world safer for PITR. I'm willing to volunteer to make them happen,
if there are no objections.

regards, tom lane

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#41Greg Copeland
greg@CopelandConsulting.Net
In reply to: Tom Lane (#39)
Re: PITR, checkpoint, and local relations

On Sat, 2002-08-03 at 21:01, Tom Lane wrote:

* Local bufmgr semantics are twiddled to reflect this reality --- in
particular, data in local buffers can be held across transactions, there
is no end-of-transaction write (much less fsync). A TEMP table that
isn't too large might never touch disk at all.

Curious. Is there currently such a criteria? What exactly constitutes
"too large"?

Greg

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Greg Copeland (#41)
Re: PITR, checkpoint, and local relations

Greg Copeland <greg@copelandconsulting.net> writes:

On Sat, 2002-08-03 at 21:01, Tom Lane wrote:

* Local bufmgr semantics are twiddled to reflect this reality --- in
particular, data in local buffers can be held across transactions, there
is no end-of-transaction write (much less fsync). A TEMP table that
isn't too large might never touch disk at all.

Curious. Is there currently such a criteria? What exactly constitutes
"too large"?

"too large" means "doesn't fit in the local buffer set". At the moment
the maximum number of local buffers seems to be frozen at 64. I was
thinking of exposing that as a configuration parameter while we're at
it.

regards, tom lane

#43J. R. Nield
jrnield@usol.com
In reply to: Bruce Momjian (#40)
Re: PITR, checkpoint, and local relations

This is great Tom. I will try to get what I have to you, Vadim, and
other interested parties tonight (Mon), assuming none of my tests fail
and reveal major bugs. It will do most of the important stuff except
your changes to the local buffer manager. I just have a few more minor
tweaks, and I would like to test it a little first.

On your advice I have made it use direct OS calls to copy the files,
using BLCKSZ aligned read() requests, instead of going through the
buffer manager for reads. I can think more about the correctness of this
later, since the rest of the code doesn't depend on which method is
used.

To Richard Tucker: I think duplicating the WAL files the way you plan is
not the way I want to do it. I'd rather have a log archiving system be
used for this. One thing that does need to be done is an interactive
recovery mode, and as soon as I finish getting my current work out for
review I'd be glad to have you write it if you want. You'll need to see
this in order to interface properly.

Regards,

John Nield

On Sat, 2002-08-03 at 22:52, Bruce Momjian wrote:

Sounds like a win all around; make PITR easier and temp tables faster.

---------------------------------------------------------------------------

Tom Lane wrote:

These changes seem very attractive to me even without regard for making
the world safer for PITR. I'm willing to volunteer to make them happen,
if there are no objections.

regards, tom lane

-- 
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

--
J. R. Nield
jrnield@usol.com

#44Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#39)
Re: PITR, checkpoint, and local relations

I said:

In short, the proposal runs something like this:

* Regular tables that happen to be in their first transaction of
existence are not treated differently from any other regular table so
far as buffer management or WAL or PITR go. (rd_myxactonly either goes
away or is used for much less than it is now.)

* TEMP tables use the local buffer manager for their entire existence.
(This probably means adding an "rd_istemp" flag to relcache entries, but
I can't see anything wrong with that.)

* Local bufmgr semantics are twiddled to reflect this reality --- in
particular, data in local buffers can be held across transactions, there
is no end-of-transaction write (much less fsync). A TEMP table that
isn't too large might never touch disk at all.

* Data operations in TEMP tables do not get WAL-logged, nor do we
WAL-log page images of local-buffer pages.

I've committed changes to implement these ideas. One thing that proved
interesting was that transactions that only made changes in existing
TEMP tables failed to commit --- RecordTransactionCommit thought it
didn't need to do anything, because no WAL entries had been made! This
was fixed by introducing another flag that gets set when we skip making
a WAL record because we're working in a TEMP relation.

I have not done anything about exporting NLocBuffer as a GUC parameter.
The algorithms in localbuf.c are, um, pretty sucky, and would run very
slowly if NLocBuffer were large. It'd make sense to install a hash
index table similar to the one used for shared buffers, and then we
could allow people to set NLocBuffer as large as their system can stand.
I figured that was a task for another day, however.

regards, tom lane

#45Andrew Sullivan
andrew@libertyrms.info
In reply to: Bruce Momjian (#35)
Re: PITR, checkpoint, and local relations

On Fri, Aug 02, 2002 at 08:52:27PM -0400, Bruce Momjian wrote:

But what if you normally continuous LOG to tape, and now you want to
backup to tape. You can't use the same tape drive for both operations.
Is that typical? I know sites that had only one tape drive that did
that.

I have seen such installations. They always seemed like a real false
economy to me. Tape drives are not so expensive that, if you really
need to ensure your data is well and truly safe, you can't afford two
of them. But that's just my 2 cents. (Or, I guess in this case, 4
cents.)

A

-- 
----
Andrew Sullivan                               87 Mowat Avenue 
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110
#46Greg Copeland
greg@CopelandConsulting.Net
In reply to: Andrew Sullivan (#45)
Re: PITR, checkpoint, and local relations

When I've seen this done, I've seen DLT's used as they allow for
multiple channels to be streamed to tape at the same time. If your tape
device does not allow for multiple, concurrent input streams, you're
going to have to obtain multiple drives.

Please keep in mind, my DLT experience is limited.

Greg

Show quoted text

On Tue, 2002-08-06 at 13:35, Andrew Sullivan wrote:

On Fri, Aug 02, 2002 at 08:52:27PM -0400, Bruce Momjian wrote:

But what if you normally continuous LOG to tape, and now you want to
backup to tape. You can't use the same tape drive for both operations.
Is that typical? I know sites that had only one tape drive that did
that.

I have seen such installations. They always seemed like a real false
economy to me. Tape drives are not so expensive that, if you really
need to ensure your data is well and truly safe, you can't afford two
of them. But that's just my 2 cents. (Or, I guess in this case, 4
cents.)

A

-- 
----
Andrew Sullivan                               87 Mowat Avenue 
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
+1 416 646 3304 x110

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

#47Richard Tucker
richt@multera.com
In reply to: Mikheev, Vadim (#29)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
Sent: Friday, August 02, 2002 7:51 PM
To: 'richt@multera.com'; J. R. Nield
Cc: Tom Lane; Bruce Momjian; PostgreSQL Hacker
Subject: RE: [HACKERS] PITR, checkpoint, and local relations

So I think what will work then is pg_copy (hot backup) would:
1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on
atomic write,
checkpoints the database and disables further checkpoints (so
wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read
the database
directory to find which files it should backup rather than
pg_class and for
each file just use system(cp...) to copy it to the backup directory.

Did you consider saving backup on the client host (ie from where
pg_copy started)?

No, pg_copy just uses the libpq interface.

3) ALTER SYSTEM FINISH BACKUP does at it does now and backs
up the pg_xlog
directory and renables database checkpointing.

I think now it could be just one command. My implementation was reading
pg_class to find the tables and indexes that needed backing up. Now reading
pg_database would be sufficient to find the directories containing files
that needed to be archived, so it could all be done in one command.

Show quoted text

Well, wouldn't be single command ALTER SYSTEM BACKUP enough?
What's the point to have 3 commands?

(If all of this is already discussed then sorry - I'm not going
to start new discussion).

Vadim

#48Richard Tucker
richt@multera.com
In reply to: Tom Lane (#28)
Re: PITR, checkpoint, and local relations

Maybe we don't have to turn off checkpointing but we DO have to make sure no
wal files get re-used while the backup is running. The wal-files must be
archived after everything else has been archived. Futhermore if we don't
stop checkpointing then care must be taken to backup the pg_control file
first.
-regards
richt

Show quoted text

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 02, 2002 7:49 PM
To: richt@multera.com
Cc: Mikheev, Vadim; J. R. Nield; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

Richard Tucker <richt@multera.com> writes:

1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on

atomic write,

checkpoints the database and disables further checkpoints (so wal files
won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read the database
directory to find which files it should backup rather than

pg_class and for

each file just use system(cp...) to copy it to the backup directory.
3) ALTER SYSTEM FINISH BACKUP does at it does now and backs up

the pg_xlog

directory and renables database checkpointing.

Does this sound right?

I really dislike the notion of turning off checkpointing. What if the
backup process dies or gets stuck (eg, it's waiting for some operator to
change a tape, but the operator has gone to lunch)? IMHO, backup
systems that depend on breaking the system's normal operational behavior
are broken. It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

regards, tom lane

#49Richard Tucker
richt@multera.com
In reply to: Bruce Momjian (#35)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Friday, August 02, 2002 8:52 PM
To: Tom Lane
Cc: Mikheev, Vadim; richt@multera.com; J. R. Nield; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

Tom Lane wrote:

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in

between is

a bad design.

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal operational behavior
in place, not muck with it.

But what if you normally continuous LOG to tape, and now you want to
backup to tape. You can't use the same tape drive for both operations.
Is that typical? I know sites that had only one tape drive that did
that.

Our implementation of pg_copy did not archive to tape. This adds a lot of
complications so I thought just make a disk to disk copy and then the disk
copy could be archived to table at the users discretion.

Show quoted text
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#50Richard Tucker
richt@multera.com
In reply to: Bruce Momjian (#34)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
Sent: Friday, August 02, 2002 8:51 PM
To: Tom Lane
Cc: richt@multera.com; Mikheev, Vadim; J. R. Nield; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

Tom Lane wrote:

Richard Tucker <richt@multera.com> writes:

1) Issue an ALTER SYSTEM BEGIN BACKUP command which turns on

atomic write,

checkpoints the database and disables further checkpoints (so

wal files

won't be reused) until the backup is complete.
2) Change ALTER SYSTEM BACKUP DATABASE TO <directory> read

the database

directory to find which files it should backup rather than

pg_class and for

each file just use system(cp...) to copy it to the backup directory.
3) ALTER SYSTEM FINISH BACKUP does at it does now and backs

up the pg_xlog

directory and renables database checkpointing.

Does this sound right?

I really dislike the notion of turning off checkpointing. What if the
backup process dies or gets stuck (eg, it's waiting for some operator to
change a tape, but the operator has gone to lunch)? IMHO, backup
systems that depend on breaking the system's normal operational behavior
are broken. It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

Yes, and we have the same issue with turning on/off after-image writes.
How do we reset this from a PITR crash?; however, the failure mode is
only poorer performance, but it may be that way for a long time without
the administrator knowing it.

I wonder if we could SET the value in a transaction and keep the session
connection open. When we complete, we abort the transaction and
disconnect. If we die, the session terminates and the SET variable goes
back to the original value. (I am using the ignore SET in aborted
transactions feature.)

I think all these concerns are addressed if the ALTER SYSTEM BACKUP is done
as a single command. In what I implemented the checkpoint process while
polling for the checkpoint lock tested if backup processing was still alive
and if not reset everything back to the pre-backup settings.

Show quoted text
--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#51Richard Tucker
richt@multera.com
In reply to: Tom Lane (#31)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Friday, August 02, 2002 8:06 PM
To: Mikheev, Vadim
Cc: richt@multera.com; J. R. Nield; Bruce Momjian; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

"Mikheev, Vadim" <vmikheev@SECTORBASE.COM> writes:

It should be sufficient to force a checkpoint when you
start and when you're done --- altering normal operation in between is
a bad design.

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal operational behavior
in place, not muck with it.

You want the log files necessary for recovering the database to be in the
backup copy -- don't you?

Show quoted text

regards, tom lane

#52Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Tucker (#51)
Re: PITR, checkpoint, and local relations

Richard Tucker <richt@multera.com> writes:

But you have to prevent log files reusing while you copy data files.

No, I don't think so. If you are using PITR then you presumably have
some process responsible for archiving off log files on a continuous
basis. The backup process should leave that normal operational behavior
in place, not muck with it.

You want the log files necessary for recovering the database to be in the
backup copy -- don't you?

Why? As far as I can see, this entire feature only makes sense in the
context where you are continuously archiving log files to someplace
(let's say tape, for purposes of discussion). Every so often you make a
backup, and what that does is it lets you recycle the log-archive tapes
older than the start of the backup. You still need the log segments
newer than the start of the backup, and you might as well just keep the
tapes that they're going to be on anyway. Doing it the way you propose
(ie, causing a persistent change in the behavior of the log archiving
process) simply makes the whole operation more complex and more fragile,
without any actual gain in functionality that I can detect.

regards, tom lane

#53Richard Tucker
richt@multera.com
In reply to: Mikheev, Vadim (#23)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: Mikheev, Vadim [mailto:vmikheev@SECTORBASE.COM]
Sent: Friday, August 02, 2002 6:01 PM
To: 'Tom Lane'; J. R. Nield
Cc: Richard Tucker; Bruce Momjian; PostgreSQL Hacker
Subject: RE: [HACKERS] PITR, checkpoint, and local relations

How do you get atomic block copies otherwise?

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

Good point.

We know for sure the kernel does this? I think this is a dubious
assumption.

Show quoted text

Vadim

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Richard Tucker (#53)
Re: PITR, checkpoint, and local relations

Richard Tucker <richt@multera.com> writes:

Eh? The kernel does that for you, as long as you're reading the
same-size blocks that the backends are writing, no?

We know for sure the kernel does this? I think this is a dubious
assumption.

Yeah, as someone pointed out later, it doesn't work if the kernel's
internal buffer size is smaller than our BLCKSZ. So we do still need
the page images in WAL --- that protection against non-atomic writes
at the hardware level should serve for this problem too.

regards, tom lane

#55Richard Tucker
richt@multera.com
In reply to: J. R. Nield (#43)
Re: PITR, checkpoint, and local relations

-----Original Message-----
From: J. R. Nield [mailto:jrnield@usol.com]
Sent: Monday, August 05, 2002 12:58 PM
To: Bruce Momjian
Cc: Tom Lane; Christopher Kings-Lynne; Richard Tucker; PostgreSQL Hacker
Subject: Re: [HACKERS] PITR, checkpoint, and local relations

This is great Tom. I will try to get what I have to you, Vadim, and
other interested parties tonight (Mon), assuming none of my tests fail
and reveal major bugs. It will do most of the important stuff except
your changes to the local buffer manager. I just have a few more minor
tweaks, and I would like to test it a little first.

On your advice I have made it use direct OS calls to copy the files,
using BLCKSZ aligned read() requests, instead of going through the
buffer manager for reads. I can think more about the correctness of this
later, since the rest of the code doesn't depend on which method is
used.

To Richard Tucker: I think duplicating the WAL files the way you plan is
not the way I want to do it. I'd rather have a log archiving system be
used for this. One thing that does need to be done is an interactive
recovery mode, and as soon as I finish getting my current work out for
review I'd be glad to have you write it if you want. You'll need to see
this in order to interface properly.

If you don't duplicate(mirror) the log then in the event you need to restore
a database with roll forward recovery won't the restored database be missing
on average 1/2 a log segments worth of changes?

Show quoted text

Regards,

John Nield

On Sat, 2002-08-03 at 22:52, Bruce Momjian wrote:

Sounds like a win all around; make PITR easier and temp tables faster.

------------------------------------------------------------------
---------

Tom Lane wrote:

These changes seem very attractive to me even without regard

for making

the world safer for PITR. I'm willing to volunteer to make

them happen,

if there are no objections.

regards, tom lane

--
Bruce Momjian                        |  http://candle.pha.pa.us
pgman@candle.pha.pa.us               |  (610) 853-3000
+  If your life is a hard drive,     |  830 Blythe Avenue
+  Christ can be your backup.        |  Drexel Hill,

Pennsylvania 19026

--
J. R. Nield
jrnield@usol.com

#56J. R. Nield
jrnield@usol.com
In reply to: Richard Tucker (#55)
Re: PITR, checkpoint, and local relations

On Wed, 2002-08-07 at 11:52, Richard Tucker wrote:

If you don't duplicate(mirror) the log then in the event you need to restore
a database with roll forward recovery won't the restored database be missing
on average 1/2 a log segments worth of changes?

The xlog code must allow us to force an advance to the next log file,
and truncate the archived file when it's copied so as not to waste
space. This also prevents the sysadmin from confusing two logfiles with
the same name and different data.

This complicates both the recovery logic and XLogInsert, and I'm trying
to kill the "last" latent bug in that feature now. Hopefully I can even
convince myself that the code is correct and covers all the cases.

As a side effect, the refactoring of XLogInsert makes it easy to add a
special record as the first XLogRecord of each file. This can contain
information useful to the system administrator, like what database
installation the file came from. Since it's at a fixed offset after the
page header, external tools can read it in a simple way.

--
J. R. Nield
jrnield@usol.com

#57Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#56)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

The xlog code must allow us to force an advance to the next log file,
and truncate the archived file when it's copied so as not to waste
space.

Uh, why? Why not just force a checkpoint and remember the exact
location of the checkpoint within the current log file?

When and if you roll back to a prior checkpoint, you'd want to start the
system running forward with a new xlog file, I think (compare what
pg_resetxlog does). But it doesn't follow that you MUST force an xlog
file boundary simply because you're taking a backup.

This complicates both the recovery logic and XLogInsert, and I'm trying
to kill the "last" latent bug in that feature now.

Indeed. How about keeping it simple, instead?

regards, tom lane

#58Vadim Mikheev
vmikheev@sectorbase.com
In reply to: Richard Tucker (#55)
Re: PITR, checkpoint, and local relations

The xlog code must allow us to force an advance to the next log file,
and truncate the archived file when it's copied so as not to waste
space.

Uh, why? Why not just force a checkpoint and remember the exact
location of the checkpoint within the current log file?

Yes, why not just save pg_control' content with new checkpoint
position in it? Didn't we agree (or at least I don't remember objections
to Tom' suggestion) that backup will not save log files at all and that
this will be task of log archiving procedure? Even if we are going to
reconsider this approach, I would just save required portion of
log at *this moment* and do that space optimization *later*.

Vadim

#59J. R. Nield
jrnield@usol.com
In reply to: Tom Lane (#57)
Re: PITR, checkpoint, and local relations

On Wed, 2002-08-07 at 23:41, Tom Lane wrote:

"J. R. Nield" <jrnield@usol.com> writes:

The xlog code must allow us to force an advance to the next log file,
and truncate the archived file when it's copied so as not to waste
space.

Uh, why? Why not just force a checkpoint and remember the exact
location of the checkpoint within the current log file?

If I do a backup with PITR and save it to tape, I need to be able to
restore it even if my machine is destroyed in a fire, and all the logs
since the end of a backup are destroyed. If we don't allow the user to
force a log advance, how will he do this? I don't want to copy the log
file, and then have the original be written to later, because it will
become confusing as to which log file to use.

Is the complexity really that big of a problem with this?

When and if you roll back to a prior checkpoint, you'd want to start the
system running forward with a new xlog file, I think (compare what
pg_resetxlog does). But it doesn't follow that you MUST force an xlog
file boundary simply because you're taking a backup.

This complicates both the recovery logic and XLogInsert, and I'm trying
to kill the "last" latent bug in that feature now.

Indeed. How about keeping it simple, instead?

regards, tom lane

--
J. R. Nield
jrnield@usol.com

#60Tom Lane
tgl@sss.pgh.pa.us
In reply to: J. R. Nield (#59)
Re: PITR, checkpoint, and local relations

"J. R. Nield" <jrnield@usol.com> writes:

Uh, why? Why not just force a checkpoint and remember the exact
location of the checkpoint within the current log file?

If I do a backup with PITR and save it to tape, I need to be able to
restore it even if my machine is destroyed in a fire, and all the logs
since the end of a backup are destroyed.

And for your next trick, restore it even if the backup tape itself is
destroyed. C'mon, be a little reasonable here. The backups and the
log archive tapes are *both* critical data in any realistic view of
the world.

Is the complexity really that big of a problem with this?

Yes, it is. Didn't you just admit to struggling with bugs introduced
by exactly this complexity?? I don't care *how* spiffy the backup
scheme is, if when push comes to shove my backup doesn't restore because
there was a software bug in the backup scheme. In this context there
simply is not any virtue greater than "simple and reliable".

regards, tom lane

#61Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#60)
Re: PITR, checkpoint, and local relations

Tom Lane wrote:

"J. R. Nield" <jrnield@usol.com> writes:

Uh, why? Why not just force a checkpoint and remember the exact
location of the checkpoint within the current log file?

If I do a backup with PITR and save it to tape, I need to be able to
restore it even if my machine is destroyed in a fire, and all the logs
since the end of a backup are destroyed.

And for your next trick, restore it even if the backup tape itself is
destroyed. C'mon, be a little reasonable here. The backups and the
log archive tapes are *both* critical data in any realistic view of
the world.

Tom, just because he doesn't agree with you doesn't mean he is
unreasonable.

I think it is an admirable goal to allow the PITR backup to restore a
consistent copy of the database _without_ needing the logs. In fact, I
consider something that _needs_ the logs to restore to a consistent
state to be broken.

If you are doing offsite backup, which people should be doing, requiring
the log tape for restore means you have to recycle the log tape _after_
the PITR backup, and to restore to a point in the future, you need two
log tapes, one that was done during the backup, and another current.

If you can restore the PITR backup without a log tape, you can take just
the PITR backup tape off site _and_ you can recyle the log tape _before_
the PITR backup, meaning you only need one tape for a restore to a point
in the future. I think there are good reasons to have the PITR backp be
restorable on its own, if possible.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073