B-tree page deletion boundary cases

Started by Noah Mischover 13 years ago6 messages
#1Noah Misch
noah@leadboat.com
1 attachment(s)

For the sake of concurrency, our B-tree implementation has a phased process
for reusing empty pages. Excerpting from nbtree/README:

A deleted page cannot be reclaimed immediately, since there may be other
processes waiting to reference it (ie, search processes that just left the
parent, or scans moving right or left from one of the siblings). These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the right-link
until they find a non-dead page --- this will be where the deleted page's
key-space moved to.

...

A deleted page can only be reclaimed once there is no scan or search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed. We implement this by waiting until all
transactions that were running at the time of deletion are dead; which is
overly strong, but is simple to implement within Postgres. When marked
dead, a deleted page is labeled with the next-transaction counter value.
VACUUM can reclaim the page for re-use when this transaction number is
older than the oldest open transaction.

The VACUUM that deletes a page's last tuple calls _bt_pagedel(), which flags
the page BTP_DELETED and stores therein the result of ReadNewTransactionId().
When a later VACUUM visits such a page and observes that the stored XID is now
less than or equal to RecentXmin, it adds the page to the FSM. An INSERT or
UPDATE will pull the page from the FSM and repurpose it.

As I mentioned[1]http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com peripherally back in November, that algorithm has been
insufficient since the introduction of non-XID-bearing transactions in
PostgreSQL 8.3. Such transactions do not restrain RecentXmin. If no running
transaction has an XID, RecentXmin == ReadNewTransactionId() and the page
incorrectly becomes available for immediate reuse. This is difficult to
encounter in practice. VACUUM acquires a cleanup lock on every leaf page in
each index. Consequently, a problem can only arise around a leaf page
deletion when two VACUUMs visit the page during the narrow window when
_bt_steppage() has released the pin on its left sibling and not yet acquired a
read lock on the target. Non-leaf deletions might not require such narrow
conditions, but they are also exponentially less frequent. Here is a test
procedure illustrating the bug:

1. In session S1, run these commands:
BEGIN;
CREATE TABLE t (x int, filler character(400));
INSERT INTO t SELECT *, '' FROM generate_series(1, 10000);
CREATE INDEX ON t(x);
DELETE FROM t WHERE x >= 2000 AND x < 4000;
COMMIT;

BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL enable_bitmapscan = off;
DECLARE c CURSOR FOR SELECT x FROM t WHERE x >= 1990 AND x < 4510;
FETCH 5 c;

2. Attach gdb to S1 and set a breakpoint on _bt_getbuf.
3. In S1, run "FETCH 10 c". This will hit the breakpoint.
4. In another session S2, run these commands:
VACUUM VERBOSE t; -- mark some pages BTP_DELETED
VACUUM VERBOSE t; -- update FSM to know about the pages
-- reuse the pages
INSERT INTO t SELECT * FROM generate_series(10001, 12000);

5. Exit gdb to free up S1. The FETCH only returns five rows.

(The "filler" column makes each index page correspond to more heap pages.
Without it, heap page pins prevent removing some of the tuples on the index
page under test.)

The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. We
already use RecentGlobalXmin when wal_level = hot_standby. If no running
transaction has an XID and all running transactions began since the last
transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId().
Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <=
RecentGlobalXmin as we have today. This also cleanly removes the need for the
bit of code in _bt_getbuf() that decrements btpo.xact before sending it down
for ResolveRecoveryConflictWithSnapshot(). I suggested[2]http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com that decrement on
an unprincipled basis; it was just masking the off-by-one of using "<=
RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().

This change makes empty B-tree pages wait through two generations of running
transactions before reuse, so some additional bloat will arise. Furthermore,
the set of transactions having snapshots precluding reuse of the page will
continue to grow until the next transaction to allocate an XID commits. The
alternative of occasionally returning wrong query results won't do, though.
While we could explore fundamentally-different page deletion algorithms for
PostgreSQL 9.3, this is the only fix coming to mind that's suitable for today.

For purposes of log message writing, this patch effectively reverts commits
758bd2a433d64bed00ca084203b3e5ccfdea4499 and
e1cd66f74862936d84acf3008118d6094c56ad58. I've attempted to document in
comments all the questions raised over on the SP-GiST/hot standby thread[3]http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com.

Thanks,
nm

[1]: http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com
[2]: http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com
[3]: http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com

Attachments:

btree-page-del-v1.patchtext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 561ffbb..6932c2b 100644
*** a/src/backend/access/nbtree/README
--- b/src/backend/access/nbtree/README
***************
*** 261,272 **** we need to be sure we don't miss or re-scan any items.
  
  A deleted page can only be reclaimed once there is no scan or search that
  has a reference to it; until then, it must stay in place with its
! right-link undisturbed.  We implement this by waiting until all
! transactions that were running at the time of deletion are dead; which is
  overly strong, but is simple to implement within Postgres.  When marked
  dead, a deleted page is labeled with the next-transaction counter value.
  VACUUM can reclaim the page for re-use when this transaction number is
! older than the oldest open transaction.
  
  Reclaiming a page doesn't actually change its state on disk --- we simply
  record it in the shared-memory free space map, from which it will be
--- 261,274 ----
  
  A deleted page can only be reclaimed once there is no scan or search that
  has a reference to it; until then, it must stay in place with its
! right-link undisturbed.  We implement this by waiting until all active
! snapshots and registered snapshots as of the deletion are gone; which is
  overly strong, but is simple to implement within Postgres.  When marked
  dead, a deleted page is labeled with the next-transaction counter value.
  VACUUM can reclaim the page for re-use when this transaction number is
! older than RecentGlobalXmin.  As collateral damage, this implementation
! also waits for running XIDs with no snapshots and for snapshots taken
! until the next transaction to allocate an XID commits.
  
  Reclaiming a page doesn't actually change its state on disk --- we simply
  record it in the shared-memory free space map, from which it will be
diff --git a/src/backend/access/nbtree/index c5e147f..e6dec61 100644
*** a/src/backend/access/nbtree/nbtpage.c
--- b/src/backend/access/nbtree/nbtpage.c
***************
*** 558,576 **** _bt_getbuf(Relation rel, BlockNumber blkno, int access)
  					 */
  					if (XLogStandbyInfoActive())
  					{
- 						TransactionId latestRemovedXid;
- 
  						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  
! 						/*
! 						 * opaque->btpo.xact is the threshold value not the
! 						 * value to measure conflicts against. We must retreat
! 						 * by one from it to get the correct conflict xid.
! 						 */
! 						latestRemovedXid = opaque->btpo.xact;
! 						TransactionIdRetreat(latestRemovedXid);
! 
! 						_bt_log_reuse_page(rel, blkno, latestRemovedXid);
  					}
  
  					/* Okay to use page.  Re-initialize and return it */
--- 558,566 ----
  					 */
  					if (XLogStandbyInfoActive())
  					{
  						BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  
! 						_bt_log_reuse_page(rel, blkno, opaque->btpo.xact);
  					}
  
  					/* Okay to use page.  Re-initialize and return it */
***************
*** 685,691 **** bool
  _bt_page_recyclable(Page page)
  {
  	BTPageOpaque opaque;
- 	TransactionId cutoff;
  
  	/*
  	 * It's possible to find an all-zeroes page in an index --- for example, a
--- 675,680 ----
***************
*** 698,715 **** _bt_page_recyclable(Page page)
  
  	/*
  	 * Otherwise, recycle if deleted and too old to have any processes
! 	 * interested in it.  If we are generating records for Hot Standby
! 	 * defer page recycling until RecentGlobalXmin to respect user
! 	 * controls specified by vacuum_defer_cleanup_age or hot_standby_feedback.
  	 */
- 	if (XLogStandbyInfoActive())
- 		cutoff = RecentGlobalXmin;
- 	else
- 		cutoff = RecentXmin;
- 
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	if (P_ISDELETED(opaque) &&
! 		TransactionIdPrecedesOrEquals(opaque->btpo.xact, cutoff))
  		return true;
  	return false;
  }
--- 687,697 ----
  
  	/*
  	 * Otherwise, recycle if deleted and too old to have any processes
! 	 * interested in it.
  	 */
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	if (P_ISDELETED(opaque) &&
! 		TransactionIdPrecedes(opaque->btpo.xact, RecentGlobalXmin))
  		return true;
  	return false;
  }
***************
*** 1376,1382 **** _bt_pagedel(Relation rel, Buffer buf, BTStack stack)
  
  	/*
  	 * Mark the page itself deleted.  It can be recycled when all current
! 	 * transactions are gone.
  	 */
  	page = BufferGetPage(buf);
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
--- 1358,1370 ----
  
  	/*
  	 * Mark the page itself deleted.  It can be recycled when all current
! 	 * transactions are gone.  Storing GetTopTransactionId() would work, but
! 	 * we're in VACUUM and would not otherwise have an XID.  Having already
! 	 * updated links to the target, ReadNewTransactionId() suffices as an
! 	 * upper bound.  Any scan having retained a now-stale link is advertising
! 	 * in its PGXACT an xmin less than or equal to the value we read here.  It
! 	 * will continue to do so, holding back RecentGlobalXmin, for the duration
! 	 * of that scan.
  	 */
  	page = BufferGetPage(buf);
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
diff --git a/src/backend/access/nbtree/nbtindex 3b351a8..deca38c 100644
*** a/src/backend/access/nbtree/nbtxlog.c
--- b/src/backend/access/nbtree/nbtxlog.c
***************
*** 968,974 **** btree_redo(XLogRecPtr lsn, XLogRecord *record)
  				/*
  				 * Btree reuse page records exist to provide a conflict point
  				 * when we reuse pages in the index via the FSM. That's all it
! 				 * does though.
  				 */
  				{
  					xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
--- 968,978 ----
  				/*
  				 * Btree reuse page records exist to provide a conflict point
  				 * when we reuse pages in the index via the FSM. That's all it
! 				 * does though. latestRemovedXid was the page's btpo.xact. The
! 				 * btpo.xact < RecentGlobalXmin test in _bt_page_recyclable()
! 				 * conceptually mirrors the pgxact->xmin > limitXmin test in
! 				 * GetConflictingVirtualXIDs().  Consequently, one XID value
! 				 * achieves the same exclusion effect on master and standby.
  				 */
  				{
  					xl_btree_reuse_page *xlrec = (xl_btree_reuse_page *) XLogRecGetData(record);
#2Nikhil Sontakke
nikkhils@gmail.com
In reply to: Noah Misch (#1)
Re: B-tree page deletion boundary cases

Hi Noah,

Was wondering if there's a similar bug which gets triggered while using
VACUUM FULL. See for instance this thread:

http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html

This issue has been reported on-off from time to time and in most cases
VACUUM or VACUUM FULL appears to be involved. We have usually attributed it
to hardware issues and reindex has been recommended by default as a
solution/work around..

Regards,
Nikhils

On Sat, Apr 21, 2012 at 10:22 PM, Noah Misch <noah@leadboat.com> wrote:

Show quoted text

For the sake of concurrency, our B-tree implementation has a phased process
for reusing empty pages. Excerpting from nbtree/README:

A deleted page cannot be reclaimed immediately, since there may be
other
processes waiting to reference it (ie, search processes that just
left the
parent, or scans moving right or left from one of the siblings).
These
processes must observe that the page is marked dead and recover
accordingly. Searches and forward scans simply follow the
right-link
until they find a non-dead page --- this will be where the deleted
page's
key-space moved to.

...

A deleted page can only be reclaimed once there is no scan or
search that
has a reference to it; until then, it must stay in place with its
right-link undisturbed. We implement this by waiting until all
transactions that were running at the time of deletion are dead;
which is
overly strong, but is simple to implement within Postgres. When
marked
dead, a deleted page is labeled with the next-transaction counter
value.
VACUUM can reclaim the page for re-use when this transaction number
is
older than the oldest open transaction.

The VACUUM that deletes a page's last tuple calls _bt_pagedel(), which
flags
the page BTP_DELETED and stores therein the result of
ReadNewTransactionId().
When a later VACUUM visits such a page and observes that the stored XID is
now
less than or equal to RecentXmin, it adds the page to the FSM. An INSERT
or
UPDATE will pull the page from the FSM and repurpose it.

As I mentioned[1] peripherally back in November, that algorithm has been
insufficient since the introduction of non-XID-bearing transactions in
PostgreSQL 8.3. Such transactions do not restrain RecentXmin. If no
running
transaction has an XID, RecentXmin == ReadNewTransactionId() and the page
incorrectly becomes available for immediate reuse. This is difficult to
encounter in practice. VACUUM acquires a cleanup lock on every leaf page
in
each index. Consequently, a problem can only arise around a leaf page
deletion when two VACUUMs visit the page during the narrow window when
_bt_steppage() has released the pin on its left sibling and not yet
acquired a
read lock on the target. Non-leaf deletions might not require such narrow
conditions, but they are also exponentially less frequent. Here is a test
procedure illustrating the bug:

1. In session S1, run these commands:
BEGIN;
CREATE TABLE t (x int, filler character(400));
INSERT INTO t SELECT *, '' FROM generate_series(1, 10000);
CREATE INDEX ON t(x);
DELETE FROM t WHERE x >= 2000 AND x < 4000;
COMMIT;

BEGIN;
SET LOCAL enable_seqscan = off;
SET LOCAL enable_bitmapscan = off;
DECLARE c CURSOR FOR SELECT x FROM t WHERE x >= 1990 AND x < 4510;
FETCH 5 c;

2. Attach gdb to S1 and set a breakpoint on _bt_getbuf.
3. In S1, run "FETCH 10 c". This will hit the breakpoint.
4. In another session S2, run these commands:
VACUUM VERBOSE t; -- mark some pages BTP_DELETED
VACUUM VERBOSE t; -- update FSM to know about the pages
-- reuse the pages
INSERT INTO t SELECT * FROM generate_series(10001, 12000);

5. Exit gdb to free up S1. The FETCH only returns five rows.

(The "filler" column makes each index page correspond to more heap pages.
Without it, heap page pins prevent removing some of the tuples on the index
page under test.)

The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin.
We
already use RecentGlobalXmin when wal_level = hot_standby. If no running
transaction has an XID and all running transactions began since the last
transaction that did bear an XID, RecentGlobalXmin ==
ReadNewTransactionId().
Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact
<=
RecentGlobalXmin as we have today. This also cleanly removes the need for
the
bit of code in _bt_getbuf() that decrements btpo.xact before sending it
down
for ResolveRecoveryConflictWithSnapshot(). I suggested[2] that decrement
on
an unprincipled basis; it was just masking the off-by-one of using "<=
RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().

This change makes empty B-tree pages wait through two generations of
running
transactions before reuse, so some additional bloat will arise.
Furthermore,
the set of transactions having snapshots precluding reuse of the page will
continue to grow until the next transaction to allocate an XID commits.
The
alternative of occasionally returning wrong query results won't do, though.
While we could explore fundamentally-different page deletion algorithms for
PostgreSQL 9.3, this is the only fix coming to mind that's suitable for
today.

For purposes of log message writing, this patch effectively reverts commits
758bd2a433d64bed00ca084203b3e5ccfdea4499 and
e1cd66f74862936d84acf3008118d6094c56ad58. I've attempted to document in
comments all the questions raised over on the SP-GiST/hot standby
thread[3].

Thanks,
nm

[1]
http://archives.postgresql.org/message-id/20111122031745.GA10556@tornado.leadboat.com
[2]
http://archives.postgresql.org/message-id/20110616144746.GA13694@tornado.leadboat.com
[3]
http://archives.postgresql.org/message-id/20120419065516.GC12337@tornado.leadboat.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Noah Misch
noah@leadboat.com
In reply to: Nikhil Sontakke (#2)
Re: B-tree page deletion boundary cases

On Sun, Apr 22, 2012 at 12:13:34AM +0530, Nikhil Sontakke wrote:

Was wondering if there's a similar bug which gets triggered while using
VACUUM FULL. See for instance this thread:

http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html

This issue has been reported on-off from time to time and in most cases
VACUUM or VACUUM FULL appears to be involved. We have usually attributed it
to hardware issues and reindex has been recommended by default as a
solution/work around..

I do not perceive much similarity. The bug I've raised can produce wrong
query results transiently. It might permit injecting a tuple into the wrong
spot in the tree, yielding persistent wrong results. It would not introduce
tree-structural anomalies like sibling pointers directed at zeroed pages or
internal pages in an 1-level tree. Given the symptoms you reported, I share
Robert's suspicion of WAL replay in your scenario.

#4Nikhil Sontakke
nikkhils@gmail.com
In reply to: Noah Misch (#3)
Re: B-tree page deletion boundary cases

Was wondering if there's a similar bug which gets triggered while using
VACUUM FULL. See for instance this thread:

http://postgresql.1045698.n5.nabble.com/index-corruption-in-PG-8-3-13-td4257589.html

This issue has been reported on-off from time to time and in most cases
VACUUM or VACUUM FULL appears to be involved. We have usually attributed

it

to hardware issues and reindex has been recommended by default as a
solution/work around..

I do not perceive much similarity. The bug I've raised can produce wrong
query results transiently. It might permit injecting a tuple into the
wrong
spot in the tree, yielding persistent wrong results. It would not
introduce
tree-structural anomalies like sibling pointers directed at zeroed pages or
internal pages in an 1-level tree. Given the symptoms you reported, I
share
Robert's suspicion of WAL replay in your scenario.

Thanks for taking the time to analyze this Noah.

Regards,
Nikhils

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Noah Misch (#1)
Re: B-tree page deletion boundary cases

On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch <noah@leadboat.com> wrote:

As I mentioned[1] peripherally back in November, that algorithm has been
insufficient since the introduction of non-XID-bearing transactions in
PostgreSQL 8.3.  Such transactions do not restrain RecentXmin.  If no running
transaction has an XID, RecentXmin == ReadNewTransactionId() and the page
incorrectly becomes available for immediate reuse.

Good observation.

The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin.  We
already use RecentGlobalXmin when wal_level = hot_standby.  If no running
transaction has an XID and all running transactions began since the last
transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId().
Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <=
RecentGlobalXmin as we have today.  This also cleanly removes the need for the
bit of code in _bt_getbuf() that decrements btpo.xact before sending it down
for ResolveRecoveryConflictWithSnapshot().  I suggested[2] that decrement on
an unprincipled basis; it was just masking the off-by-one of using "<=
RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().

Looks like the right fix. I'll apply this to 9.0/9.1/HEAD.

This change makes empty B-tree pages wait through two generations of running
transactions before reuse, so some additional bloat will arise.

Could arise, in some circumstances. But that assumes VACUUMs are
fairly frequent and that they would be delayed/rendered less effective
by this, which I don't think will be the case.

I note that we don't take any account of the number of pages that may
be reused when we VACUUM, so when HOT avoids a VACUUM we may
accumulate pages for a longer period. Looks like there is more work to
be done yet in cleaning indexes.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#6Noah Misch
noah@leadboat.com
In reply to: Simon Riggs (#5)
Re: B-tree page deletion boundary cases

On Tue, Apr 24, 2012 at 09:08:36AM +0100, Simon Riggs wrote:

On Sat, Apr 21, 2012 at 5:52 PM, Noah Misch <noah@leadboat.com> wrote:

The fix is to compare the stored XID to RecentGlobalXmin, not RecentXmin. ?We
already use RecentGlobalXmin when wal_level = hot_standby. ?If no running
transaction has an XID and all running transactions began since the last
transaction that did bear an XID, RecentGlobalXmin == ReadNewTransactionId().
Therefore, the correct test is btpo.xact < RecentGlobalXmin, not btpo.xact <=
RecentGlobalXmin as we have today. ?This also cleanly removes the need for the
bit of code in _bt_getbuf() that decrements btpo.xact before sending it down
for ResolveRecoveryConflictWithSnapshot(). ?I suggested[2] that decrement on
an unprincipled basis; it was just masking the off-by-one of using "<=
RecentGlobalXmin" instead of "< RecentGlobalXmin" in _bt_page_recyclable().

Looks like the right fix. I'll apply this to 9.0/9.1/HEAD.

Thanks. 8.3 and 8.4 need it, too, albeit simplified due to some of the
affected code not existing yet.

This change makes empty B-tree pages wait through two generations of running
transactions before reuse, so some additional bloat will arise.

Could arise, in some circumstances. But that assumes VACUUMs are
fairly frequent and that they would be delayed/rendered less effective
by this, which I don't think will be the case.

I note that we don't take any account of the number of pages that may
be reused when we VACUUM, so when HOT avoids a VACUUM we may
accumulate pages for a longer period. Looks like there is more work to
be done yet in cleaning indexes.

Agreed. Taking one VACUUM visit to mark the page BTP_DELETED and another to
add it to the FSM is fairly pessimistic; perhaps we can do better.

nm