Unexpected VACUUM FULL failure

Started by Tom Laneover 18 years ago19 messages
#1Tom Lane
tgl@sss.pgh.pa.us

This is a bit disturbing:

*** ./expected/vacuum.out	Sat Jul 20 00:58:14 2002
--- ./results/vacuum.out	Wed Aug  8 20:16:45 2007
***************
*** 50,55 ****
--- 50,56 ----

DELETE FROM vactst WHERE i != 0;
VACUUM FULL vactst;
+ ERROR: HEAP_MOVED_OFF was expected
DELETE FROM vactst;
SELECT * FROM vactst;
i

======================================================================

This is today's CVS HEAD, plus some startup/shutdown logic changes in
postmaster.c that hardly seem like they could be related.

I couldn't reproduce it in a few tries. A reasonable guess is that
it's triggered by autovacuum deciding to vacuum the table sometime
before the VACUUM FULL starts. Anyone want to try to reproduce it?

regards, tom lane

#2Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#1)
Re: Unexpected VACUUM FULL failure

Tom Lane wrote:

This is a bit disturbing:

*** ./expected/vacuum.out	Sat Jul 20 00:58:14 2002
--- ./results/vacuum.out	Wed Aug  8 20:16:45 2007
***************
*** 50,55 ****
--- 50,56 ----

DELETE FROM vactst WHERE i != 0;
VACUUM FULL vactst;
+ ERROR: HEAP_MOVED_OFF was expected
DELETE FROM vactst;
SELECT * FROM vactst;
i

======================================================================

This is today's CVS HEAD, plus some startup/shutdown logic changes in
postmaster.c that hardly seem like they could be related.

I couldn't reproduce it in a few tries. A reasonable guess is that
it's triggered by autovacuum deciding to vacuum the table sometime
before the VACUUM FULL starts. Anyone want to try to reproduce it?

Hum, aren't vacuums supposed to be blocked by each other? Maybe this is
about a toast table not being locked enough against concurrent vacuuming
of the main table.

I'm currently away on vacation, which is why I've missed all the stuff
going on here. Sorry for not letting people know.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: Unexpected VACUUM FULL failure

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I couldn't reproduce it in a few tries. A reasonable guess is that
it's triggered by autovacuum deciding to vacuum the table sometime
before the VACUUM FULL starts. Anyone want to try to reproduce it?

Hum, aren't vacuums supposed to be blocked by each other?

Sure. I'm not thinking it's a case of concurrent vacuums (if it is,
we've got worse problems than anyone imagined), but rather that the
autovac left the table in a state that exposes a bug in the subsequent
VACUUM FULL. Since we've whacked the tqual.c logic around recently,
the problem might actually lie there...

regards, tom lane

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#3)
Re: Unexpected VACUUM FULL failure

I wrote:

... Since we've whacked the tqual.c logic around recently,
the problem might actually lie there...

In fact, I bet this is a result of the async-commit patch. The places
where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where
it is looking at a tuple not marked XMIN_COMMITTED; it expects that
after its first pass over the table, *every* tuple is either
XMIN_COMMITTED or one that it moved. Async commit changed tqual.c
so that tuples that are in fact known committed might not get marked
XMIN_COMMITTED right away. The patch tries to prevent this from
happening within VACUUM FULL by means of

/*
* VACUUM FULL assumes that all tuple states are well-known prior to
* moving tuples around --- see comment "known dead" in repair_frag(),
* as well as simplifications in tqual.c. So before we start we must
* ensure that any asynchronously-committed transactions with changes
* against this table have been flushed to disk. It's sufficient to do
* this once after we've acquired AccessExclusiveLock.
*/
XLogAsyncCommitFlush();

but I bet lunch that that's not good enough. I still haven't reproduced
it, but I'm thinking that the inexact bookkeeping that we created for
clog page LSNs allows tuples to not get marked if the right sort of
timing of concurrent transactions happens.

Not sure about the best solution for this.

regards, tom lane

#5Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#4)
1 attachment(s)
Re: Unexpected VACUUM FULL failure

On Wed, 2007-08-08 at 23:23 -0400, Tom Lane wrote:

I wrote:

... Since we've whacked the tqual.c logic around recently,
the problem might actually lie there...

In fact, I bet this is a result of the async-commit patch. The places
where vacuum.c bleats "HEAP_MOVED_OFF was expected" are all places where
it is looking at a tuple not marked XMIN_COMMITTED; it expects that
after its first pass over the table, *every* tuple is either
XMIN_COMMITTED or one that it moved. Async commit changed tqual.c
so that tuples that are in fact known committed might not get marked
XMIN_COMMITTED right away. The patch tries to prevent this from
happening within VACUUM FULL by means of

/*
* VACUUM FULL assumes that all tuple states are well-known prior to
* moving tuples around --- see comment "known dead" in repair_frag(),
* as well as simplifications in tqual.c. So before we start we must
* ensure that any asynchronously-committed transactions with changes
* against this table have been flushed to disk. It's sufficient to do
* this once after we've acquired AccessExclusiveLock.
*/
XLogAsyncCommitFlush();

but I bet lunch that that's not good enough. I still haven't reproduced
it, but I'm thinking that the inexact bookkeeping that we created for
clog page LSNs allows tuples to not get marked if the right sort of
timing of concurrent transactions happens.

Not sure about the best solution for this.

Good hunch. I plugged this hole earlier, but on further inspection I can
see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough,
but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to
sometimes skip hint bit setting, when executed with concurrent
transactions touching other tables.

ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
boolean parameter, force, we can tell VF to always set the hint bits in
every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Patch enclosed, but a little crufty. Gotta run now, talk later.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachments:

async_commit_bug.v1.patchtext/x-patch; charset=utf-8; name=async_commit_bug.v1.patchDownload
Index: src/backend/commands/vacuum.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/vacuum.c,v
retrieving revision 1.354
diff -c -r1.354 vacuum.c
*** src/backend/commands/vacuum.c	1 Aug 2007 22:45:08 -0000	1.354
--- src/backend/commands/vacuum.c	9 Aug 2007 07:24:41 -0000
***************
*** 1384,1390 ****
  			tuple.t_len = ItemIdGetLength(itemid);
  			ItemPointerSet(&(tuple.t_self), blkno, offnum);
  
! 			switch (HeapTupleSatisfiesVacuum(tuple.t_data, OldestXmin, buf))
  			{
  				case HEAPTUPLE_DEAD:
  					tupgone = true;		/* we can delete the tuple */
--- 1384,1390 ----
  			tuple.t_len = ItemIdGetLength(itemid);
  			ItemPointerSet(&(tuple.t_self), blkno, offnum);
  
! 			switch (HeapTupleSatisfiesVacuumFull(tuple.t_data, OldestXmin, buf))
  			{
  				case HEAPTUPLE_DEAD:
  					tupgone = true;		/* we can delete the tuple */
***************
*** 1998,2004 ****
  						break;
  					}
  					/* must check for DEAD or MOVED_IN tuple, too */
! 					nextTstatus = HeapTupleSatisfiesVacuum(nextTdata,
  														   OldestXmin,
  														   nextBuf);
  					if (nextTstatus == HEAPTUPLE_DEAD ||
--- 1998,2004 ----
  						break;
  					}
  					/* must check for DEAD or MOVED_IN tuple, too */
! 					nextTstatus = HeapTupleSatisfiesVacuumFull(nextTdata,
  														   OldestXmin,
  														   nextBuf);
  					if (nextTstatus == HEAPTUPLE_DEAD ||
Index: src/backend/utils/time/tqual.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/time/tqual.c,v
retrieving revision 1.103
diff -c -r1.103 tqual.c
*** src/backend/utils/time/tqual.c	1 Aug 2007 22:45:09 -0000	1.103
--- src/backend/utils/time/tqual.c	9 Aug 2007 07:24:43 -0000
***************
*** 77,82 ****
--- 77,85 ----
  TransactionId RecentGlobalXmin = InvalidTransactionId;
  
  /* local functions */
+ static HTSV_Result HeapTupleSatisfiesVacuumInternal(HeapTupleHeader tuple, 
+ 						TransactionId OldestXmin, Buffer buffer, bool locked);
+ 
  static bool XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot);
  
  
***************
*** 1045,1054 ****
   * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
   * still be visible to some open transaction, so we can't remove them,
   * even if we see that the deleting transaction has committed.
   */
  HTSV_Result
! HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
! 						 Buffer buffer)
  {
  	/*
  	 * Has inserting transaction committed?
--- 1048,1060 ----
   * deleted by XIDs >= OldestXmin are deemed "recently dead"; they might
   * still be visible to some open transaction, so we can't remove them,
   * even if we see that the deleting transaction has committed.
+  *
+  * If the heap we are checking is exclusively locked, we can skip certain checks
+  * because we know that no transactions that wrote data are still active.
   */
  HTSV_Result
! HeapTupleSatisfiesVacuumInternal(HeapTupleHeader tuple, TransactionId OldestXmin,
! 						 Buffer buffer, bool locked)
  {
  	/*
  	 * Has inserting transaction committed?
***************
*** 1106,1112 ****
  		}
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
  			HeapTupleSetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
! 								 HeapTupleHeaderGetXmin(tuple));
  		else
  		{
  			/*
--- 1112,1119 ----
  		}
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple)))
  			HeapTupleSetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
! 								 (locked ? InvalidTransactionId : 
! 											HeapTupleHeaderGetXmin(tuple)));
  		else
  		{
  			/*
***************
*** 1144,1155 ****
  		{
  			if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  			{
! 				if (MultiXactIdIsRunning(HeapTupleHeaderGetXmax(tuple)))
  					return HEAPTUPLE_LIVE;
  			}
  			else
  			{
! 				if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
  					return HEAPTUPLE_LIVE;
  			}
  
--- 1151,1162 ----
  		{
  			if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
  			{
! 				if (!locked && MultiXactIdIsRunning(HeapTupleHeaderGetXmax(tuple)))
  					return HEAPTUPLE_LIVE;
  			}
  			else
  			{
! 				if (!locked && TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
  					return HEAPTUPLE_LIVE;
  			}
  
***************
*** 1173,1183 ****
  
  	if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
  	{
! 		if (TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
  			return HEAPTUPLE_DELETE_IN_PROGRESS;
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))
  			HeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
! 								 HeapTupleHeaderGetXmax(tuple));
  		else
  		{
  			/*
--- 1180,1191 ----
  
  	if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
  	{
! 		if (!locked && TransactionIdIsInProgress(HeapTupleHeaderGetXmax(tuple)))
  			return HEAPTUPLE_DELETE_IN_PROGRESS;
  		else if (TransactionIdDidCommit(HeapTupleHeaderGetXmax(tuple)))
  			HeapTupleSetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
! 								 (locked ? InvalidTransactionId : 
! 											HeapTupleHeaderGetXmax(tuple)));
  		else
  		{
  			/*
***************
*** 1221,1226 ****
--- 1229,1247 ----
  	return HEAPTUPLE_DEAD;
  }
  
+ HTSV_Result
+ HeapTupleSatisfiesVacuum(HeapTupleHeader tuple, TransactionId OldestXmin,
+ 						 Buffer buffer)
+ {
+ 	return HeapTupleSatisfiesVacuumInternal(tuple, OldestXmin, buffer, false);
+ }
+ 
+ HTSV_Result
+ HeapTupleSatisfiesVacuumFull(HeapTupleHeader tuple, TransactionId OldestXmin,
+ 						 Buffer buffer)
+ {
+ 	return HeapTupleSatisfiesVacuumInternal(tuple, OldestXmin, buffer, true);
+ }
  
  /*
   * GetTransactionSnapshot
Index: src/include/utils/tqual.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/tqual.h,v
retrieving revision 1.67
diff -c -r1.67 tqual.h
*** src/include/utils/tqual.h	25 Jul 2007 12:22:54 -0000	1.67
--- src/include/utils/tqual.h	9 Aug 2007 07:24:45 -0000
***************
*** 144,149 ****
--- 144,151 ----
  						 CommandId curcid, Buffer buffer);
  extern HTSV_Result HeapTupleSatisfiesVacuum(HeapTupleHeader tuple,
  						 TransactionId OldestXmin, Buffer buffer);
+ extern HTSV_Result HeapTupleSatisfiesVacuumFull(HeapTupleHeader tuple,
+ 						 TransactionId OldestXmin, Buffer buffer);
  
  extern Snapshot GetTransactionSnapshot(void);
  extern Snapshot GetLatestSnapshot(void);
#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#5)
Re: Unexpected VACUUM FULL failure

"Simon Riggs" <simon@2ndquadrant.com> writes:

Good hunch. I plugged this hole earlier, but on further inspection I can
see the plug wasn't wide enough. XLogAsyncCommitFlush() is good enough,
but HeapTupleSatisfiesVacuum() still allowed the inexact bookkeeping to
sometimes skip hint bit setting, when executed with concurrent
transactions touching other tables.

ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
boolean parameter, force, we can tell VF to always set the hint bits in
every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Surely this approach is no good: won't it allow hint bits to reach disk
in advance of their transaction?

I think it'd be safer, and a lot less ugly, to recast the tests in
VACUUM FULL. If we make the first pass clear any old MOVED_IN/MOVED_OUT
bits then the last pass can key off those instead of assuming that
XMIN_COMMITTED is set everywhere. Then we'd not need
XLogAsyncCommitFlush, which is a kluge anyway.

regards, tom lane

#7Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#6)
Re: Unexpected VACUUM FULL failure

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Simon Riggs" <simon@2ndquadrant.com> writes:

ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
boolean parameter, force, we can tell VF to always set the hint bits in
every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Surely this approach is no good: won't it allow hint bits to reach disk
in advance of their transaction?

I don't think so since it sounds like he's saying to still sync the log and
VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
changes it sees in the table must have been committed or aborted before the
log sync.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#8Heikki Linnakangas
heikki@enterprisedb.com
In reply to: Gregory Stark (#7)
Re: Unexpected VACUUM FULL failure

Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

"Simon Riggs" <simon@2ndquadrant.com> writes:

ISTM that if we call HeapTupleSatisfiesVacuum() with an additional
boolean parameter, force, we can tell VF to always set the hint bits in
every case, not just HEAP_MOVED_IN and HEAP_MOVED_OUT.

Surely this approach is no good: won't it allow hint bits to reach disk
in advance of their transaction?

I don't think so since it sounds like he's saying to still sync the log and
VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
changes it sees in the table must have been committed or aborted before the
log sync.

Hint bit updates are not WAL-logged, so there's no mechanism to keep the
data page from hitting the disk before the COMMIT record. That's the
reason why we can't just set the hint bits for async committed
transactions in the first place.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#8)
Re: Unexpected VACUUM FULL failure

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Gregory Stark wrote:

I don't think so since it sounds like he's saying to still sync the log and
VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
changes it sees in the table must have been committed or aborted before the
log sync.

Hint bit updates are not WAL-logged, so there's no mechanism to keep the
data page from hitting the disk before the COMMIT record. That's the
reason why we can't just set the hint bits for async committed
transactions in the first place.

Well the theory was that the flush at the start would ensure that VF's
first scan could always set the hint bits. But I remembered this
morning why I felt uneasy about that: it's not guaranteed true for
system catalogs. We sometimes release locks on catalogs before
committing. (Whether that is a good idea is a different discussion.)

What I'm now thinking we should do is to have the first scan check
whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum
claims it's good), and abandon defrag if not, the same as we already do
in the other corner cases where we find not-guaranteed-committed tuples
(see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap).

If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
I'm inclined to remove it just because it's ugly. Comments?

BTW, something else that just penetrated my brain is that this failure
can only happen with synchronous_commit = off. In the sync-commit case,
even though we have inexact bookkeeping for clog LSNs, it's always true
that every LSN recorded for a clog page will have been flushed first.
So we will always think we can set hint bits even though we might be
testing an LSN later than the particular transaction in question.
I just rechecked and verified that I'd been (without really thinking
about it) running my test build with synchronous_commit = off for the
past few days. If I happened to see this in one of the relatively small
number of parallel regression sets I've run since then, then it should
have been obvious in the buildfarm. The reason it wasn't was that none
of the buildfarm machines are testing async-commit. We need to do
something about that.

regards, tom lane

#10Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#9)
Re: Unexpected VACUUM FULL failure

On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:

Heikki Linnakangas <heikki@enterprisedb.com> writes:

Gregory Stark wrote:

I don't think so since it sounds like he's saying to still sync the log and
VACUUM FULL has an exclusive lock on the table. So any committed (or aborted)
changes it sees in the table must have been committed or aborted before the
log sync.

Hint bit updates are not WAL-logged, so there's no mechanism to keep the
data page from hitting the disk before the COMMIT record. That's the
reason why we can't just set the hint bits for async committed
transactions in the first place.

Well the theory was that the flush at the start would ensure that VF's
first scan could always set the hint bits. But I remembered this
morning why I felt uneasy about that: it's not guaranteed true for
system catalogs. We sometimes release locks on catalogs before
committing. (Whether that is a good idea is a different discussion.)

Yes, I see comments in the VF code along those lines.

Seems like we should have code comments explaining exactly where we do
this, why and which things it causes issues for.

What I'm now thinking we should do is to have the first scan check
whether XMIN_COMMITTED is actually set (after HeapTupleSatisfiesVacuum
claims it's good), and abandon defrag if not, the same as we already do
in the other corner cases where we find not-guaranteed-committed tuples
(see INSERT_IN_PROGRESS/DELETE_IN_PROGRESS cases in scan_heap).

I now think we *must* do this for system catalogs, or something else at
least. Personally I would prefer preventing async commits from occurring
when a system catalog has been touched at all, which would make the VF
situation and a few other situations go away entirely.

However, I see no reason to do as you suggest for other tables. It seems
like this would be an annoying feature to have VF sometimes fail,
depending upon the timing of other transactions. There would be a rare
failure if an async commit touched an early block in a table immediately
prior to a VF. It's rare but possible. This would be extremely annoying
if a DBA ran a job to VF a table overnight and then came back in to see
the ERROR message. It is fairly easy to code it this way though, if you
really think this is the best way.

If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
I'm inclined to remove it just because it's ugly. Comments?

I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
that doesn't make anything else any harder. Apart from system catalogs,
doing it that way would be error free for all normal VFs.

BTW, something else that just penetrated my brain is that this failure
can only happen with synchronous_commit = off. In the sync-commit case,
even though we have inexact bookkeeping for clog LSNs, it's always true
that every LSN recorded for a clog page will have been flushed first.
So we will always think we can set hint bits even though we might be
testing an LSN later than the particular transaction in question.
I just rechecked and verified that I'd been (without really thinking
about it) running my test build with synchronous_commit = off for the
past few days. If I happened to see this in one of the relatively small
number of parallel regression sets I've run since then, then it should
have been obvious in the buildfarm. The reason it wasn't was that none
of the buildfarm machines are testing async-commit. We need to do
something about that.

Yes, this issue effects async commit only.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#10)
Re: Unexpected VACUUM FULL failure

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:

If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
I'm inclined to remove it just because it's ugly. Comments?

I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
that doesn't make anything else any harder. Apart from system catalogs,
doing it that way would be error free for all normal VFs.

Please recall that the failure that started this thread was on a regular
user table. To do what you want will introduce what I'm now thinking
is an unacceptable amount of fragility. In particular your patch of
yesterday to force hint-bit setting in VF scares the heck out of me.

The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
actually accomplish a darn thing, as we now see from this failure:
it does not guarantee that hint bits will get set, because of the
inexact bookkeeping in clog.c. It might marginally improve the
probability that they get set, but that's all. The reason I want
to take it out is mainly that its very existence tempts people to make
the same mistake that was made here.

I now think we *must* do this for system catalogs, or something else at
least. Personally I would prefer preventing async commits from occurring
when a system catalog has been touched at all, which would make the VF
situation and a few other situations go away entirely.

I think that that is completely the wrong line of thought: we should be
making async commit work everywhere, or as nearly so as possible, not
inventing arbitrary restrictions to place on it. The restriction
you suggest here would probably cost more performance (by forcing sync
commits) than could ever be gained back by being able to assume hint
bits are set. In the patch as committed, I took out most of the
restrictions you had on async commit --- the only utility commands that
force sync commit are the ones that have direct effects on the filesystem.

Another argument is that VACUUM FULL is a dinosaur that should probably
go away entirely someday (a view I believe you share); it should
therefore not be allowed to drive the design of other parts of the
system.

regards, tom lane

#12Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Unexpected VACUUM FULL failure

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
actually accomplish a darn thing, as we now see from this failure:
it does not guarantee that hint bits will get set, because of the
inexact bookkeeping in clog.c. It might marginally improve the
probability that they get set, but that's all. The reason I want
to take it out is mainly that its very existence tempts people to make
the same mistake that was made here.

I don't understand your reasoning here and I would like to understand it so if
you don't mind I would like to see if I can work out what you're talking
about. Regardless of this point I think the impact of what you were proposing
to do to VF instead was much less than Simon seemed to think it was so it
seems like a perfectly acceptable solution.

As far as I understand the Xlog flush in combination with keeping an exclusive
lock on table and always holding locks until the end of the transaction make
forcing the hint bits entirely safe.

The fragility you see comes from depending on how those three things interact
and the difficulty in ensuring that all of those properties are always true.
By "marginally improve the probability" you're making a judgement of the
probability that programmers will manage to maintain all those properties
consistently, not about the probability that the race will occur at run-time?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#13Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#11)
Re: Unexpected VACUUM FULL failure

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Another argument is that VACUUM FULL is a dinosaur that should probably
go away entirely someday (a view I believe you share); it should
therefore not be allowed to drive the design of other parts of the
system.

Incidentally, every time it comes up we recommend using CLUSTER or ALTER
TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
wonder if it would make sense to add a "VACUUM REWRITE" which just did the
same as the noop ALTER TABLE we're recommending people do anyways. Then we
could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.

I would think this would be 8.4 stuff except if all we want it to do is a
straight noop alter table it's pretty trivial. The hardest part is coming up
with a name for it.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#14Decibel!
decibel@decibel.org
In reply to: Gregory Stark (#13)
Re: Unexpected VACUUM FULL failure

On Fri, Aug 10, 2007 at 07:53:06PM +0100, Gregory Stark wrote:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

Another argument is that VACUUM FULL is a dinosaur that should probably
go away entirely someday (a view I believe you share); it should
therefore not be allowed to drive the design of other parts of the
system.

Incidentally, every time it comes up we recommend using CLUSTER or ALTER
TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
wonder if it would make sense to add a "VACUUM REWRITE" which just did the
same as the noop ALTER TABLE we're recommending people do anyways. Then we
could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.

I would think this would be 8.4 stuff except if all we want it to do is a
straight noop alter table it's pretty trivial. The hardest part is coming up
with a name for it.

One question... should we have a vacuum variant that also reindexes? Or
does that just naturally fall out of the rewrite?

BTW, rewrite sounds fine to me... anything but full, which is constantly
confused with a "full database vacuum".
--
Decibel!, aka Jim Nasby decibel@decibel.org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)

#15Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#11)
Re: Unexpected VACUUM FULL failure

On Fri, 2007-08-10 at 13:47 -0400, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Fri, 2007-08-10 at 11:11 -0400, Tom Lane wrote:

If we do that, we don't actually need XLogAsyncCommitFlush() at the start.
I'm inclined to remove it just because it's ugly. Comments?

I'm not clear why XLogAsyncCommitFlush() is ugly. It's one line of code
that doesn't make anything else any harder. Apart from system catalogs,
doing it that way would be error free for all normal VFs.

Please recall that the failure that started this thread was on a regular
user table. To do what you want will introduce what I'm now thinking
is an unacceptable amount of fragility. In particular your patch of
yesterday to force hint-bit setting in VF scares the heck out of me.

The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
actually accomplish a darn thing, as we now see from this failure:
it does not guarantee that hint bits will get set, because of the
inexact bookkeeping in clog.c. It might marginally improve the
probability that they get set, but that's all. The reason I want
to take it out is mainly that its very existence tempts people to make
the same mistake that was made here.

I think this *can* work, but I accept you don't like it, even if I'm not
sure why. The bug was in the assumption that all flushed async commits
can be confirmed to be flushed, which isn't true, not in the flush
itself.

If VACUUM FULL has problems with catalog tables, then that is an
existing bug, not one introduced by the async patch. Catalog tables
might be unlocked and yet have uncommitted transactions in them, which
violates the assumption in the current VF code that all hint bits will
be set prior to repair_frag(). But the comments in vacuum.c line 1821
say "A tuple is either: (a) a tuple in a system catalog, inserted or
deleted by a not yet committed transaction... in case (a) we would not
be in repair_frag() at all" (it doesn't give a reason).

If the current comments are correct, then we're OK to fix this by having
a special case HeapTupleSatisfies, maybe coded slightly differently.

If the current comments are false, in that (a) is possible, then VF has
a pre-existing bug, that *must* be fixed in the way you suggest, but
that has nothing to do with async.

...but...

Another argument is that VACUUM FULL is a dinosaur that should probably
go away entirely someday (a view I believe you share); it should
therefore not be allowed to drive the design of other parts of the
system.

You're right. Let's make that day today.

I vote to completely replace VF now with a cluster-style rebuild of the
table. That way we won't have to keep fixing something we're gonna kill
anyway.

It would be better to release 8.3 with a new, clean, fast implementation
of VF, than to release it with code that we *think* is right, but has so
far proved a source of some truly obscure bugs.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#16Gregory Stark
stark@enterprisedb.com
In reply to: Simon Riggs (#15)
Re: Unexpected VACUUM FULL failure

"Simon Riggs" <simon@2ndquadrant.com> writes:

It would be better to release 8.3 with a new, clean, fast implementation
of VF, than to release it with code that we *think* is right, but has so
far proved a source of some truly obscure bugs.

Well building a suitable replacement for VACUUM FULL is more work than I was
proposing. The no-op ALTER TABLE rebuild still has the disadvantage of
requiring 2x the space taken by the table (and potentially more if you have
large indexes).

I also think it's a safer course of action to create a new command, spend one
or two releases improving it until we're sure it meets everyone's needs, then
drop the old command.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#16)
Re: Unexpected VACUUM FULL failure

Gregory Stark <stark@enterprisedb.com> writes:

"Simon Riggs" <simon@2ndquadrant.com> writes:

It would be better to release 8.3 with a new, clean, fast implementation
of VF, than to release it with code that we *think* is right, but has so
far proved a source of some truly obscure bugs.

Well building a suitable replacement for VACUUM FULL is more work than I was
proposing.

Reimplementing VACUUM FULL for 8.3 is right out :-(. I do want to ship
a release in calendar 2007 ...

I also think it's a safer course of action to create a new command, spend one
or two releases improving it until we're sure it meets everyone's needs, then
drop the old command.

Agreed. While I'd like to see V.F. dead, we need a proven replacement
first. (At the same time, we shouldn't spend more effort on V.F. than
we have to.)

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#13)
Re: Unexpected VACUUM FULL failure

Gregory Stark <stark@enterprisedb.com> writes:

Incidentally, every time it comes up we recommend using CLUSTER or ALTER
TABLE. And explaining the syntax for ALTER TABLE is always a bit fiddly. I
wonder if it would make sense to add a "VACUUM REWRITE" which just did the
same as the noop ALTER TABLE we're recommending people do anyways. Then we
could have a HINT from VACUUM FULL which suggests considering VACUUM REWRITE.

Not that syntax, please :-(. The trouble with VACUUM [adjective] is
that "adjective" has to become a fully reserved keyword, else the parser
can't tell it from a table name. This is all right for FULL because
that's a reserved word anyway due to the outer join syntax, but I really
don't want to do it for any words that aren't otherwise reserved.

regards, tom lane

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#12)
Re: Unexpected VACUUM FULL failure

Gregory Stark <stark@enterprisedb.com> writes:

"Tom Lane" <tgl@sss.pgh.pa.us> writes:

The reason why XLogAsyncCommitFlush() is ugly is that it doesn't
actually accomplish a darn thing, as we now see from this failure:
it does not guarantee that hint bits will get set, because of the
inexact bookkeeping in clog.c. It might marginally improve the
probability that they get set, but that's all. The reason I want
to take it out is mainly that its very existence tempts people to make
the same mistake that was made here.

I don't understand your reasoning here and I would like to understand it so if
you don't mind I would like to see if I can work out what you're talking
about.

Well, both Simon and I thought that XLogAsyncCommitFlush would allow
VACUUM FULL to assume that hint bits were good. We were both wrong
about that, and in hindsight that goes against the whole trend of PG
development on this topic: hint bits are hints, full stop. I think
that having XLogAsyncCommitFlush in the API will just encourage people
to use it when they shouldn't.

As far as I understand the Xlog flush in combination with keeping an exclusive
lock on table and always holding locks until the end of the transaction make
forcing the hint bits entirely safe.

I don't currently see a hole in that line of reasoning, but it's a bit
longer chain of reasoning than I like for a critical correctness
property. Especially when we have a boatload of code that violates the
last assumption, including deeply-embedded APIs (heap_close) that make
it easy to violate the assumption. We could maybe go out next week and
fix all the core code to not release any locks early, but what of
third-party backend add-ons? Worse, might we not regret the change
later due to increased deadlock risks?

By "marginally improve the probability" you're making a judgement of the
probability that programmers will manage to maintain all those properties
consistently, not about the probability that the race will occur at run-time?

No, I was thinking about the latter. The current CVS tip code doesn't
have a huge window between logically committing a transaction and being
able to set hint bits for it. In situations where you think "hmm, I
just made a big update, maybe a VACUUM FULL would be a good idea",
you'll be quite safe by the time you've managed to type VACUUM FULL.
A V.F. that is automatically issued while a lot of other stuff is
going on will be at larger risk of having the defragment step disabled,
but at the same time it's not obvious that anyone will care much.

regards, tom lane