WAL logging freezing

Started by Heikki Linnakangasover 19 years ago45 messageshackers
Jump to latest
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com

Here's a patch for WAL logging tuple freezes in vacuum, per discussion
on pgsql-bugs.

This patch is against CVS head. Should this be backported to stable
branches? I think it should.

After writing the patch, I realized that it needs some thought if
backported, because WAL records of removing tuples and freezing tuples
share the same heapam opcode XLOG_HEAP_CLEAN, and are only
differentiated by setting a flag. If we applied the patch as it is, and
for some reason someone replayed a WAL log generated by a newer version,
with the patch, with an older version, without the patch, the older
version would interpret the freeze WAL records as dead tuple removals,
and remove live records. I would've liked to give freezing a new opcode,
but we've ran out of them (see htup.h).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

logfreezes.difftext/plain; name=logfreezes.diffDownload+114-90
#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#1)
Re: WAL logging freezing

Heikki Linnakangas wrote:

Here's a patch for WAL logging tuple freezes in vacuum, per discussion
on pgsql-bugs.

This patch is against CVS head. Should this be backported to stable
branches? I think it should.

Keep in mind that previous releases do not use the same method for
determining the pg_clog cutoff point.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: WAL logging freezing

Heikki Linnakangas <heikki@enterprisedb.com> writes:

I would've liked to give freezing a new opcode,
but we've ran out of them (see htup.h).

Hardly ... we have plenty of unused rmgr id's still.

The real issue that still has to be resolved is the interaction of all
this stuff with PITR scenarios --- is it still safe to not log hint-bit
updates when PITR is on? I think it's premature to start writing
patches until we've decided how this really needs to work.

regards, tom lane

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#3)
Re: WAL logging freezing

Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

I would've liked to give freezing a new opcode,
but we've ran out of them (see htup.h).

Hardly ... we have plenty of unused rmgr id's still.

Good point.

The real issue that still has to be resolved is the interaction of all
this stuff with PITR scenarios --- is it still safe to not log hint-bit
updates when PITR is on? I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#4)
Re: WAL logging freezing

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog. Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions. I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

regards, tom lane

#6Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#5)
Re: [HACKERS] WAL logging freezing

On Fri, 2006-10-27 at 12:01 -0400, Tom Lane wrote:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog.

So we definitely have a nasty problem here.

VACUUM FREEZE is just a loaded gun right now.

Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?

Thinking...

Also, we should probably be setting all the hint bits for pages during
recovery then, so we don't need to re-write them again later.

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions. I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

I think we'll need to error-out at that point, plus produce messages
when we pass 2 billion transactions recovered. It makes sense to produce
a new base backup regularly anyway.

We'll also need to produce an error message on the primary server so
that we take a new base backup every 2 billion transactions.

There are better solutions, but I'm not sure it makes sense to try and
fix them right now, since that could well delay the release. If we think
it is a necessary fix for the 8.2 line then we could get a better fix
into 8.2.1

[I've just coded the relcache invalidation WAL logging patch also.]

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#6)
Re: [HACKERS] WAL logging freezing

"Simon Riggs" <simon@2ndquadrant.com> writes:

[I've just coded the relcache invalidation WAL logging patch also.]

What? That doesn't make any sense to me.

regards, tom lane

#8Simon Riggs
simon@2ndQuadrant.com
In reply to: Simon Riggs (#6)
Re: [HACKERS] WAL logging freezing

On Fri, 2006-10-27 at 22:19 +0100, Simon Riggs wrote:

So we definitely have a nasty problem here.

VACUUM FREEZE is just a loaded gun right now.

Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?

Thinking...

Suggestions:

1. Create a new Utility rmgr that can issue XLOG_UTIL_FREEZE messages
for each block that has had any tuples frozen on it during normal
VACUUMs. We need log only the relid, blockid and vacuum's xid to redo
the freeze operation.

2. VACUUM FREEZE need not generate any additional WAL records, but will
do an immediate sync following execution and before clog truncation.
That way the large number of changed blocks will all reach disk before
we do the updates to the catalog.

3. We don't truncate the clog during WAL replay, so the clog will grow
during recovery. Nothing to do there to make things safe.

4. When InArchiveRecovery we should set all of the datminxid and
datvacuumxid fields to be the Xid from where recovery started, so that
clog is not truncated soon after recovery. Performing a VACUUM FREEZE
after a recovery would be mentioned as an optional task at the end of a
PITR recovery on a failover/second server.

5. At 3.5 billion records during recovery we should halt the replay, do
a full database scan to set hint bits, truncate clog, then restart
replay. (Automatically within the recovery process).

6. During WAL replay, put out a warning message every 1 billion rows
saying that a hint bit scan will eventually be required if recovery
continues.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#5)
Re: WAL logging freezing

Tom Lane wrote:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog. Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?

Oh, I just understood what you were trying to say. We really don't keep
clog around all the way back to the freeze horizon.

Well, how about doing just that? With a FreezeLimit of 1 billion
transactions, that's 128 megabytes of clog. We could freeze more
aggressively to make it less.

We can't just not xlog updates of datvacuumxid, because that buffer
might still be written to disk before all the hint-bit updates. One
trick would be to include the old value in the WAL record. WAL recovery
would have to keep track of those records, and reset any datvacuumxid to
the old value if there's no checkpoint record afterwards.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#5)
Re: WAL logging freezing

We just discussed this in detail with Simon, and it looks like we have
5 (!) different but related problems:

1) The original problem of freeze then crash, leaving too high values in
relminxid and datminxid. If you then run vacuum, it might truncate CLOG
and you lose the commit status of the records that were supposed to be
frozen.

To fix this, we need to WAL log freezing as already discussed.

2) vactuple_get_minxid doesn't take into account xmax's of tuples that
have HEAP_XMAX_INVALID set. That's a problem:

transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
crash

VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set
on the tuple (possibly by vacuum itself) that the deletion that rolled
back touched. However, that hint-bit update hasn't hit the disk yet, so
after recovery, the tuple will have an xmax of 1001 with no hint-bit,
and relminxid is 1002.

The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint
bit, and take any xmax other than InvalidXid into account when
calculating the relminxid.

3) If you recover from a PITR backup (or have a hot stand-by), with base
backup that's more than 4 billion transactions older than the newest WAL
record, the clog entries of old transactions in the base backup will
overlap with the clog entries of new transactions that are in the WAL
records. This is the problem you also pointed out below.

To fix this, we need to emit a WAL record when truncating the clog. We
must also make sure that recovery of any WAL record type doesn't rely on
clog, because if we truncate the clog and then crash, recovery won't
have the clog available for the old transactions. At the moment,
TruncateCLog issues a checkpoint to protect from that but that's not
going to work when rolling forward logs in PITR, right?

4) If we fix issue 2 so that vactuple_get_minxid always takes xmax into
account, even if HEAP_XMAX_INVALID is set, a tuple with an aborted xmax
will keep us from advancing relminxid and truncating clog etc. That
doesn't lead to data corruption, but you will eventually hit the
transaction wrap-around limit. We don't have the same problem with xmin,
because we freeze tuples that are older than FreezeLimit to avoid it,
but we don't do that for xmax.

To fix this, replace any xmax older than FreezeLimit with InvalidXid
during vacuum. That also needs to be WAL logged.

5) We don't freeze tuples that are in RECENTLY_DEAD or
DELETE_IN_PROGRESS state. That doesn't lead to data corruption, but it
might make you hit the transaction wrap-around limit. That can happen if
you have a transaction that deletes or updates a very old, but not yet
frozen tuple. If you run vacuum while the deleting transaction is in
progress, vacuum won't freeze the tuple, and won't advance the
wrap-around limit because of the old tuple. That's not serious if the
deleting transaction commits, because the next vacuum will then remove
the tuple, but if it aborts, we might run into the same problem on the
next vacuum, and the next one, and the next one, until we reach the
wrap-around.

To fix this, simply do the freezing for tuples in RECENTLY_DEAD and
DELETE_IN_PROGRESS states as well,

Am I missing something? Finding this many bugs makes me nervous...

Simon volunteered to make the clog changes for 3 because it's a PITR
related issue. I can write a patch/patches for the other changes if it
helps.

Tom Lane wrote:

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

Tom Lane wrote:

I think it's premature to start writing
patches until we've decided how this really needs to work.

Not logging hint-bit updates seems safe to me. As long as we have the
clog, the hint-bit is just a hint. The problem with freezing is that
after freezing tuples, the corresponding clog page can go away.

Actually clog can go away much sooner than that, at least in normal
operation --- that's what datvacuumxid is for, to track where we can
truncate clog. Maybe it's OK to say that during WAL replay we keep it
all the way back to the freeze horizon, but I'm not sure how we keep the
system from wiping clog it still needs right after switching to normal
operation. Maybe we should somehow not xlog updates of datvacuumxid?

Another thing I'm concerned about is the scenario where a PITR
hot-standby machine tracks a master over a period of more than 4 billion
transactions. I'm not sure what will happen in the slave's pg_clog
directory, but I'm afraid it won't be good :-(

regards, tom lane

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#10)
Re: WAL logging freezing

Heikki Linnakangas wrote:

We just discussed this in detail with Simon, and it looks like we have
5 (!) different but related problems:

Wow, four of them are mine :-(

2) vactuple_get_minxid doesn't take into account xmax's of tuples that
have HEAP_XMAX_INVALID set. That's a problem:

transaction 1001 - BEGIN; DELETE FROM foo where key = 1;
transaction 1001 - ROLLBACK;
transaction 1002 - VACUUM foo;
crash

VACUUM foo will set relminxid to 1002, because HEAP_XMAX_INVALID was set
on the tuple (possibly by vacuum itself) that the deletion that rolled
back touched. However, that hint-bit update hasn't hit the disk yet, so
after recovery, the tuple will have an xmax of 1001 with no hint-bit,
and relminxid is 1002.

The simplest fix for this issue is to ignore the HEAP_XMAX_INVALID hint
bit, and take any xmax other than InvalidXid into account when
calculating the relminxid.

Ugh. Is there another solution to this? Say, sync the buffer so that
the hint bits are written to disk? The bug (4) below is problematic if
you take this approach; basically it removes all the optimization won by
the relminxid patch.

Simon volunteered to make the clog changes for 3 because it's a PITR
related issue. I can write a patch/patches for the other changes if it
helps.

I'm swamped at the moment, so I'd appreciate it.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#11)
Re: WAL logging freezing

Alvaro Herrera <alvherre@commandprompt.com> writes:

Ugh. Is there another solution to this? Say, sync the buffer so that
the hint bits are written to disk?

Yeah. The original design for all this is explained by the notes for
TruncateCLOG:

* When this is called, we know that the database logically contains no
* reference to transaction IDs older than oldestXact. However, we must
* not truncate the CLOG until we have performed a checkpoint, to ensure
* that no such references remain on disk either; else a crash just after
* the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk. There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

That still leaves us with the problem of hint bits not being updated
during WAL replay. I think the best solution for this is for WAL replay
to force relvacuumxid to equal relminxid (btw, these field names seem
poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
rather than adopting the value shown in the WAL record. This probably
is best done by abandoning the generic "overwrite tuple" WAL record type
in favor of something specific to minxid updates. The effect would then
be that a PITR slave would not truncate its clog beyond the freeze
horizon until it had performed a vacuum of its own.

The point about aborted xmax being a risk factor is a good one. I don't
think the risk is material for ordinary crash recovery scenarios,
because ordinarily we'd have many opportunities to set the hint bit
before anything really breaks, but it's definitely an issue for
long-term PITR replay scenarios.

I'll work on this as soon as I get done with the btree-index issue I'm
messing with now.

regards, tom lane

#13Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#12)
Re: [HACKERS] WAL logging freezing

On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Ugh. Is there another solution to this? Say, sync the buffer so that
the hint bits are written to disk?

Yeah. The original design for all this is explained by the notes for
TruncateCLOG:

* When this is called, we know that the database logically contains no
* reference to transaction IDs older than oldestXact. However, we must
* not truncate the CLOG until we have performed a checkpoint, to ensure
* that no such references remain on disk either; else a crash just after
* the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk. There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

Ouch! We did discuss that also. Flushing the buffercache is nasty with
very large caches, so this makes autovacuum much less friendly - and
could take a seriously long time if you enforce the vacuum delay
costings.

ISTM we only need to flush iff the clog would be truncated when we
update relminxid. Otherwise we are safe to update even if we crash,
since the clog will not have been truncated.

That still leaves us with the problem of hint bits not being updated
during WAL replay. I think the best solution for this is for WAL replay
to force relvacuumxid to equal relminxid (btw, these field names seem
poorly chosen, and the comment in catalogs.sgml isn't self-explanatory...)
rather than adopting the value shown in the WAL record. This probably
is best done by abandoning the generic "overwrite tuple" WAL record type
in favor of something specific to minxid updates. The effect would then
be that a PITR slave would not truncate its clog beyond the freeze
horizon until it had performed a vacuum of its own.

Sounds good. Methinks we do still need the TruncateCLOG patch to ensure
we do WAL replay for the truncation? I'm posting that now to -patches as
a prototype.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachments:

xlog_clog_truncate.patchtext/x-patch; charset=UTF-8; name=xlog_clog_truncate.patchDownload+36-17
#14Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Simon Riggs (#13)
Re: [HACKERS] WAL logging freezing

Simon Riggs wrote:

On Mon, 2006-10-30 at 12:05 -0500, Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Ugh. Is there another solution to this? Say, sync the buffer so that
the hint bits are written to disk?

Yeah. The original design for all this is explained by the notes for
TruncateCLOG:

* When this is called, we know that the database logically contains no
* reference to transaction IDs older than oldestXact. However, we must
* not truncate the CLOG until we have performed a checkpoint, to ensure
* that no such references remain on disk either; else a crash just after
* the truncation might leave us with a problem.

The pre-8.2 coding is actually perfectly safe within a single database,
because TruncateCLOG is only called at the end of a database-wide
vacuum, and so the checkpoint is guaranteed to have flushed valid hint
bits for all tuples to disk. There is a risk in other databases though.
I think that in the 8.2 structure the equivalent notion must be that
VACUUM has to flush and fsync a table before it can advance the table's
relminxid.

Ouch! We did discuss that also. Flushing the buffercache is nasty with
very large caches, so this makes autovacuum much less friendly - and
could take a seriously long time if you enforce the vacuum delay
costings.

ISTM we only need to flush iff the clog would be truncated when we
update relminxid. Otherwise we are safe to update even if we crash,
since the clog will not have been truncated.

I don't understand. When clog is actually going to be truncated, if
it's determined that there's any page that can be truncated, then a
checkpoint is forced. If no page is going to be removed then there's no
checkpoint, which makes a lot of sense and of course avoids the problem
of useless flushes.

In fact I don't understand what's the point about multiple databases vs.
a single database. Surely a checkpoint would flush all buffers in all
databases, no? This would flush all hint bits, everywhere. So this bug
does not really exist.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#13)
Re: [HACKERS] WAL logging freezing

"Simon Riggs" <simon@2ndquadrant.com> writes:

ISTM we only need to flush iff the clog would be truncated when we
update relminxid.

Wrong :-( If the relvacuumxid change (not relminxid ... as I said, these
names aren't very transparent) makes it to disk but not all the hint
bits do, you're at risk. Crash, restart, vacuum some other table, and
*now* the global min vacuumxid advances. The fact that we're
WAL-logging the relvacuumxid change makes this scenario exceedingly
probable, if no action is taken to force out the hint bits.

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon. That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments. So I don't like
it much. I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions. Anyone for a
GUC parameter? In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work. OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.

regards, tom lane

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#14)
Re: [HACKERS] WAL logging freezing

Alvaro Herrera <alvherre@commandprompt.com> writes:

In fact I don't understand what's the point about multiple databases vs.
a single database. Surely a checkpoint would flush all buffers in all
databases, no?

Yeah --- all the ones that are dirty *now*. Consider the case where you
vacuum DB X, update its datvacuumxid, and don't checkpoint because the
global min didn't advance. Now you crash, possibly leaving some hint
bits unwritten; but the datvacuumxid change did make it to disk. After
restart, vacuum DB Y, update its datvacuumxid, and find that the global
min *did* advance. You checkpoint, and that guarantees that DB Y is
clean for the clog truncation. But DB X isn't.

The 8.2 changes have created the equivalent risk at the level of each
individual table. We can't write a vacuumxid change unless we are sure
that the hint-bit changes it promises are actually down to disk.

regards, tom lane

#17Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#15)
Re: [HACKERS] WAL logging freezing

On Mon, 2006-10-30 at 16:58 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

ISTM we only need to flush iff the clog would be truncated when we
update relminxid.

Wrong :-( If the relvacuumxid change (not relminxid ... as I said, these
names aren't very transparent) makes it to disk but not all the hint
bits do, you're at risk. Crash, restart, vacuum some other table, and
*now* the global min vacuumxid advances. The fact that we're
WAL-logging the relvacuumxid change makes this scenario exceedingly
probable, if no action is taken to force out the hint bits.

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger. In order to
truncate up to 2 million we need to re-vacuum; at that point we discover
that the 1.5 million setting was wrong, realise it should have been 1.2
million but don't care because we now set it to 1.8 million etc. No
problem, even with repeated crashes. We only flush when we move the
counter past a truncation point.

If you look at this another way, maybe you'll see what I'm saying: Only
update relvacuumxid iff the update would allow us to truncate the clog.
That way we leap forwards in 1 million Xid chunks, rounded down. No
change to clog => no update => no danger that we need to flush to avoid.

The only alternative I can see is the one Heikki suggested: don't
truncate clog until the freeze horizon. That's safe (given the planned
change to WAL-log tuple freezing) and clean and simple, but a permanent
requirement of 250MB+ for pg_clog would put the final nail in the coffin
of PG's usability in small-disk-footprint environments. So I don't like
it much. I suppose it could be made more tolerable by reducing the
freeze horizon, say to 100M instead of 1G transactions. Anyone for a
GUC parameter? In a high-volume DB you'd want the larger setting to
minimize the amount of tuple freezing work. OTOH it seems like making
this configurable creates a nasty risk for PITR situations: a slave
that's configured with a smaller freeze window than the master is
probably not safe.

If we need to, just put the CLOG seg size in pg_config_manual.h

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#17)
Re: [HACKERS] WAL logging freezing

"Simon Riggs" <simon@2ndquadrant.com> writes:

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger.

You're still wrong though. Suppose that VACUUM moves a particular rel's
relvacuumxid from 1.9 to 2.1 million, but because this rel is not
currently the oldest vacuumxid, it doesn't truncate clog. Then we crash
and lose hint bits, but not the relvacuumxid change. Then VACUUM
vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
million --- but this time that *was* the globally oldest value, and now
we think we can truncate clog at 2 million. But the first rel might
still have some unhinted xids around 1.9 million.

If you look at this another way, maybe you'll see what I'm saying: Only
update relvacuumxid iff the update would allow us to truncate the clog.

Then you'll never update it at all, because there will always be some
other rel constraining the global min.

regards, tom lane

#19Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#18)
Re: [HACKERS] WAL logging freezing

On Mon, 2006-10-30 at 19:18 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

I don't agree: If the truncation points are at 1 million, 2 million etc,
then if we advance the relvacuumxid from 1.2 million to 1.5 million,
then crash, the hints bits for that last vacuum are lost. Sounds bad,
but we have not truncated clog, so there is no danger.

You're still wrong though.

Frequently, I'd say :-)

Suppose that VACUUM moves a particular rel's
relvacuumxid from 1.9 to 2.1 million, but because this rel is not
currently the oldest vacuumxid, it doesn't truncate clog. Then we crash
and lose hint bits, but not the relvacuumxid change. Then VACUUM
vacuums some other rel and advances its relvacuumxid from 1.9 to 2.1
million --- but this time that *was* the globally oldest value, and now
we think we can truncate clog at 2 million. But the first rel might
still have some unhinted xids around 1.9 million.

That was understood; in the above example I agree you need to flush. If
you don't pass a truncation point, you don't need to flush whether or
not you actually truncate. So we don't need to flush *every* time, so
IMHO we don't need to play safe and keep clog the size of an iceberg.

Anyway, if PITR is safe again, I'd like to sleep....zzzzzzzzzzzzzz

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#19)
Re: [HACKERS] WAL logging freezing

"Simon Riggs" <simon@2ndquadrant.com> writes:

That was understood; in the above example I agree you need to flush. If
you don't pass a truncation point, you don't need to flush whether or
not you actually truncate. So we don't need to flush *every* time,

OK, but does that actually do much of anything for your performance
complaint? Just after GlobalXmin has passed a truncation point, *every*
vacuum the system does will start performing a flush-n-fsync, which
seems like exactly what you didn't like. If the syncs were spread out
in time for different rels then maybe this idea would help, but AFAICS
they won't be.

regards, tom lane

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#15)
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#20)
#23Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
#26Bruce Momjian
bruce@momjian.us
In reply to: Tom Lane (#24)
#27Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Bruce Momjian (#26)
#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#24)
#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#27)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#28)
#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#24)
#32Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#31)
#33Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#31)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Zeugswetter Andreas SB SD (#33)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#31)
#36Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#35)
#37Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#35)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#36)
#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#37)
#40Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#38)
#41Simon Riggs
simon@2ndQuadrant.com
In reply to: Tom Lane (#40)
#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#41)
#43Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#41)
#44Simon Riggs
simon@2ndQuadrant.com
In reply to: Bruce Momjian (#43)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#44)