Resurrecting per-page cleaner for btree

Started by ITAGAKI Takahiroover 19 years ago34 messages

itagaki.takahiro@oss.ntt.co.jp

over 19 years ago

Hi Hackers,

Can we resurrect the patch proposed by Junji TERAMOTO?
It removes unnecessary items before btree pages split.
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

There was a problem in the patch when we restarted scans from deleted tuples.
But now we scan pages at-a-time, so the problem is resolved, isn't it?
http://archives.postgresql.org/pgsql-patches/2006-05/msg00008.php

I think this feature is independent from the SITC project and useful for
heavily-updated indexes. If it is worthwhile, I'll revise the patch to
catch up on HEAD.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

over 19 years ago

In reply to: ITAGAKI Takahiro (#1)

1 attachment(s)

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

Have I missed something? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

btsplitvacuum.patchapplication/octet-stream; name=btsplitvacuum.patchDownload

diff -cpr pgsql-orig/src/backend/access/nbtree/README pgsql/src/backend/access/nbtree/README
*** pgsql-orig/src/backend/access/nbtree/README	Thu Jul 13 09:36:05 2006
--- pgsql/src/backend/access/nbtree/README	Thu Jul 13 10:47:57 2006
*************** item doesn't fit on the split page where
*** 128,146 ****
  The deletion algorithm
  ----------------------
  
! Before deleting a leaf item, we get a super-exclusive lock on the target
! page, so that no other backend has a pin on the page when the deletion
! starts.  This is not necessary for correctness in terms of the btree index
! operations themselves; as explained above, index scans logically stop
! "between" pages and so can't lose their place.  The reason we do it is to
! provide an interlock between non-full VACUUM and indexscans.  Since VACUUM
! deletes index entries before deleting tuples, the super-exclusive lock
! guarantees that VACUUM can't delete any heap tuple that an indexscanning
! process might be about to visit.  (This guarantee works only for simple
! indexscans that visit the heap in sync with the index scan, not for bitmap
! scans.  We only need the guarantee when using non-MVCC snapshot rules such
! as SnapshotNow, so in practice this is only important for system catalog
! accesses.)
  
  Because a page can be split even while someone holds a pin on it, it is
  possible that an indexscan will return items that are no longer stored on
--- 128,148 ----
  The deletion algorithm
  ----------------------
  
! Before deleting a leaf item and the corresponding heap tuple, we get a
! super-exclusive lock on the target page, so that no other backend has a
! pin on the page when the deletion starts.  This is not necessary for
! correctness in terms of the btree index operations themselves; as explained
! above, index scans logically stop "between" pages and so can't lose their
! place. The reason we do it is to provide an interlock between non-full
! VACUUM and indexscans.  Since VACUUM deletes index entries before deleting
! tuples, the super-exclusive lock guarantees that VACUUM can't delete any
! heap tuple that an indexscanning process might be about to visit.  (This
! guarantee works only for simple indexscans that visit the heap in sync with
! the index scan, not for bitmap scans.  We only need the guarantee when
! using non-MVCC snapshot rules such as SnapshotNow, so in practice this is
! only important for system catalog accesses.) When we sweep only leaf pages
! and don't delete heap tuples, we take just an exclusive lock on leaf pages.
! We use leaf-only sweep before leaf pages are splitting.
  
  Because a page can be split even while someone holds a pin on it, it is
  possible that an indexscan will return items that are no longer stored on
diff -cpr pgsql-orig/src/backend/access/nbtree/nbtinsert.c pgsql/src/backend/access/nbtree/nbtinsert.c
*** pgsql-orig/src/backend/access/nbtree/nbtinsert.c	Thu Jul 13 09:36:05 2006
--- pgsql/src/backend/access/nbtree/nbtinsert.c	Thu Jul 13 09:53:58 2006
*************** static void _bt_pgaddtup(Relation rel, P
*** 64,69 ****
--- 64,70 ----
  			 OffsetNumber itup_off, const char *where);
  static bool _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
  			int keysz, ScanKey scankey);
+ static void _bt_vacuumpage(Relation rel, Buffer buffer);
  
  
  /*
*************** _bt_check_unique(Relation rel, IndexTupl
*** 261,266 ****
--- 262,268 ----
  												 hbuffer) == HEAPTUPLE_DEAD)
  					{
  						curitemid->lp_flags |= LP_DELETE;
+ 						opaque->btpo_flags |= BTP_HAS_GARBAGE;
  						if (nbuf != InvalidBuffer)
  							SetBufferCommitInfoNeedsSave(nbuf);
  						else
*************** _bt_insertonpg(Relation rel,
*** 423,433 ****
  		 */
  		bool		movedright = false;
  
! 		while (PageGetFreeSpace(page) < itemsz &&
! 			   !P_RIGHTMOST(lpageop) &&
! 			   _bt_compare(rel, keysz, scankey, page, P_HIKEY) == 0 &&
! 			   random() > (MAX_RANDOM_VALUE / 100))
  		{
  			/*
  			 * step right to next non-dead page
  			 *
--- 425,449 ----
  		 */
  		bool		movedright = false;
  
! 		while (PageGetFreeSpace(page) < itemsz)
  		{
+ 			Buffer		rbuf;
+ 
+ 			/*
+ 			 * Erase dead items to obtain enough space.
+ 			 */
+ 			if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
+ 			{
+ 				_bt_vacuumpage(rel, buf);
+ 				if (PageGetFreeSpace(page) >= itemsz)
+ 					break; /* Ok, now we have enough space. */
+ 			}
+ 
+ 			if (P_RIGHTMOST(lpageop) ||
+ 				_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
+ 				random() <= (MAX_RANDOM_VALUE / 100))
+ 				break;
+ 
  			/*
  			 * step right to next non-dead page
  			 *
*************** _bt_insertonpg(Relation rel,
*** 437,443 ****
  			 * pages won't do because we don't know when they will get
  			 * de-linked from the tree.
  			 */
! 			Buffer		rbuf = InvalidBuffer;
  
  			for (;;)
  			{
--- 453,459 ----
  			 * pages won't do because we don't know when they will get
  			 * de-linked from the tree.
  			 */
! 			rbuf = InvalidBuffer;
  
  			for (;;)
  			{
*************** _bt_isequal(TupleDesc itupdesc, Page pag
*** 1649,1652 ****
--- 1665,1704 ----
  
  	/* if we get here, the keys are equal */
  	return true;
+ }
+ 
+ /*
+  * _bt_vacuumpage --- Vacuum just one page.
+  *
+  * Simplifed version of btvacuumpage. It cleans the passed page only.
+  * The passed bufffer must be exclusive-locked.
+  */
+ static void
+ _bt_vacuumpage(Relation rel, Buffer buffer)
+ {
+ 	OffsetNumber	deletable[MaxOffsetNumber];
+ 	int				ndeletable = 0;
+ 	OffsetNumber	offnum,
+ 					minoff,
+ 					maxoff;
+ 	Page			page = BufferGetPage(buffer);
+ 	BTPageOpaque	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+ 
+ 	/*
+ 	 * Scan over all items to see which ones need deleted
+ 	 * according to LP_DELETE flags.
+ 	 */
+ 	minoff = P_FIRSTDATAKEY(opaque);
+ 	maxoff = PageGetMaxOffsetNumber(page);
+ 	for (offnum = minoff;
+ 		 offnum <= maxoff;
+ 		 offnum = OffsetNumberNext(offnum))
+ 	{
+ 		ItemId itemId = PageGetItemId(page, offnum);
+ 		if (ItemIdDeleted(itemId))
+ 			deletable[ndeletable++] = offnum;
+ 	}
+ 
+ 	if (ndeletable > 0)
+ 		_bt_delitems(rel, buffer, deletable, ndeletable);
  }
diff -cpr pgsql-orig/src/backend/access/nbtree/nbtpage.c pgsql/src/backend/access/nbtree/nbtpage.c
*** pgsql-orig/src/backend/access/nbtree/nbtpage.c	Thu Jul 13 09:36:05 2006
--- pgsql/src/backend/access/nbtree/nbtpage.c	Thu Jul 13 09:45:19 2006
*************** _bt_delitems(Relation rel, Buffer buf,
*** 666,671 ****
--- 666,672 ----
  	 */
  	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
  	opaque->btpo_cycleid = 0;
+ 	opaque->btpo_flags &= ~BTP_HAS_GARBAGE;
  
  	MarkBufferDirty(buf);
  
diff -cpr pgsql-orig/src/include/access/nbtree.h pgsql/src/include/access/nbtree.h
*** pgsql-orig/src/include/access/nbtree.h	Thu Jul 13 09:36:09 2006
--- pgsql/src/include/access/nbtree.h	Thu Jul 13 09:45:19 2006
*************** typedef BTPageOpaqueData *BTPageOpaque;
*** 71,76 ****
--- 71,77 ----
  #define BTP_META		(1 << 3)	/* meta-page */
  #define BTP_HALF_DEAD	(1 << 4)	/* empty, but still in tree */
  #define BTP_SPLIT_END	(1 << 5)	/* rightmost page of split group */
+ #define BTP_HAS_GARBAGE	(1 << 6)	/* page has dead tuples */
  
  
  /*
*************** typedef struct BTMetaPageData
*** 163,168 ****
--- 164,170 ----
  #define P_ISROOT(opaque)		((opaque)->btpo_flags & BTP_ROOT)
  #define P_ISDELETED(opaque)		((opaque)->btpo_flags & BTP_DELETED)
  #define P_IGNORE(opaque)		((opaque)->btpo_flags & (BTP_DELETED|BTP_HALF_DEAD))
+ #define P_HAS_GARBAGE(opaque)	((opaque)->btpo_flags & BTP_HAS_GARBAGE)
  
  /*
   *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: ITAGAKI Takahiro (#2)

Re: Resurrecting per-page cleaner for btree

On Thu, 2006-07-13 at 10:49 +0900, ITAGAKI Takahiro wrote:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

Have I missed something? Comments welcome.

The new locking system does allow this now.

So this is an option for us, AFAICS.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Jim C. Nasby

jnasby@pervasive.com

over 19 years ago

In reply to: ITAGAKI Takahiro (#1)

Re: Resurrecting per-page cleaner for btree

On Wed, Jul 12, 2006 at 03:59:01PM +0900, ITAGAKI Takahiro wrote:

Hi Hackers,

Can we resurrect the patch proposed by Junji TERAMOTO?
It removes unnecessary items before btree pages split.
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

There was a problem in the patch when we restarted scans from deleted tuples.
But now we scan pages at-a-time, so the problem is resolved, isn't it?
http://archives.postgresql.org/pgsql-patches/2006-05/msg00008.php

I think this feature is independent from the SITC project and useful for
heavily-updated indexes. If it is worthwhile, I'll revise the patch to
catch up on HEAD.

Tom's comment about the patch needing better comments still holds. If
nothing else, do the best you can with the comments in English and
someone else can clean the grammar up.

It's also not clear to me if Tom's comment about not deleting LP_DELETE
tuples at-will is still valid or not.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#2)

Re: Resurrecting per-page cleaner for btree

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.
Have I missed something? Comments welcome.

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one. The current
form of VACUUM does not care, but people keep raising the idea of doing
"retail" vacuuming that operates by looking up index entries explicitly.
You could certainly make a retail vacuumer do nothing if it fails to
find the expected index entry, but ISTM that'd be a rather serious loss
of consistency checking --- you could not tell the someone-already-
deleted-it case apart from a bug in the vacuumer's index value
computation or lookup.

Personally I don't think retail vacuuming in that form will ever fly
anyway, so I have no problem with installing the proposed patch,
but I thought I'd better throw this comment out to see if anyone
thinks it's a big deal.

regards, tom lane

Heikki Linnakangas

hlinnaka@iki.fi

over 19 years ago

In reply to: Tom Lane (#5)

Re: [HACKERS] Resurrecting per-page cleaner for btree

On Mon, 24 Jul 2006, Tom Lane wrote:

Personally I don't think retail vacuuming in that form will ever fly
anyway, so I have no problem with installing the proposed patch,
but I thought I'd better throw this comment out to see if anyone
thinks it's a big deal.

My feeling is that retail vacuuming would be useful some day. But it's
certainly not going to be there in 8.2 so I have no objection with the
patch. It's a fairly localized change; it can easily be reverted later if
necessary.

- Heikki

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#5)

Re: Resurrecting per-page cleaner for btree

Tom Lane wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.
Have I missed something? Comments welcome.

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one. The current
form of VACUUM does not care, but people keep raising the idea of doing
"retail" vacuuming that operates by looking up index entries explicitly.
You could certainly make a retail vacuumer do nothing if it fails to
find the expected index entry, but ISTM that'd be a rather serious loss
of consistency checking --- you could not tell the someone-already-
deleted-it case apart from a bug in the vacuumer's index value
computation or lookup.

Personally I don't think retail vacuuming in that form will ever fly
anyway, so I have no problem with installing the proposed patch,
but I thought I'd better throw this comment out to see if anyone
thinks it's a big deal.

Agreed. Reverse lookup of index entries will always be too slow.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#2)

Re: Resurrecting per-page cleaner for btree

Your patch has been added to the PostgreSQL unapplied patches list at:

http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php

I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

Have I missed something? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

over 19 years ago

In reply to: Tom Lane (#5)

Re: Resurrecting per-page cleaner for btree

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one.

This can occurs when we do REINDEX after DELETE, because dead tuples are
excluded on REINDEX. So we cannot guarantee that all heap tuples have
corresponding index entries. Vacuumers should not suppose their existence.

But I see the change is confusable, too.
I'll add more comments.

# CREATE TABLE test (i int);
# INSERT INTO test SELECT generate_series(1, 1000);
# CREATE INDEX test_idx ON test (i);
# DELETE FROM test WHERE i % 10 = 0;
# REINDEX INDEX test_idx;

# SELECT tuple_count, dead_tuple_count FROM pgstattuple('test');
tuple_count | dead_tuple_count
-------------+------------------
900 | 100

# SELECT tuple_count, dead_tuple_count FROM pgstattuple('test_idx');
tuple_count | dead_tuple_count
-------------+------------------
900 | 0

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#10

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#9)

Re: Resurrecting per-page cleaner for btree

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one.

This can occurs when we do REINDEX after DELETE, because dead tuples are
excluded on REINDEX.

Good point. So we already have that problem anyway ...

regards, tom lane

#11

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#2)

Re: Resurrecting per-page cleaner for btree

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

Have I missed something? Comments welcome.

I was thinking for awhile just now that this would break the interlock
that guarantees VACUUM can't delete a heap tuple that an indexscanning
process is about to visit. After further thought, it doesn't, but it's
non-obvious. I've added the attached commentary to nbtree/README:

On-the-fly deletion of index tuples
-----------------------------------

If a process visits a heap tuple and finds that it's dead and removable
(ie, dead to all open transactions, not only that process), then we can
return to the index and mark the corresponding index entry "known dead",
allowing subsequent index scans to skip visiting the heap tuple. The
"known dead" marking uses the LP_DELETE bit in ItemIds. This is currently
only done in plain indexscans, not bitmap scans, because only plain scans
visit the heap and index "in sync" and so there's not a convenient way
to do it for bitmap scans.

Once an index tuple has been marked LP_DELETE it can actually be removed
from the index immediately; since index scans only stop "between" pages,
no scan can lose its place from such a deletion. We separate the steps
because we allow LP_DELETE to be set with only a share lock (it's exactly
like a hint bit for a heap tuple), but physically removing tuples requires
exclusive lock. In the current code we try to remove LP_DELETE tuples when
we are otherwise faced with having to split a page to do an insertion (and
hence have exclusive lock on it already).

This leaves the index in a state where it has no entry for a dead tuple
that still exists in the heap. This is not a problem for the current
implementation of VACUUM, but it could be a problem for anything that
explicitly tries to find index entries for dead tuples. (However, the
same situation is created by REINDEX, since it doesn't enter dead
tuples into the index.)

It's sufficient to have an exclusive lock on the index page, not a
super-exclusive lock, to do deletion of LP_DELETE items. It might seem
that this breaks the interlock between VACUUM and indexscans, but that is
not so: as long as an indexscanning process has a pin on the page where
the index item used to be, VACUUM cannot complete its btbulkdelete scan
and so cannot remove the heap tuple. This is another reason why
btbulkdelete has to get super-exclusive lock on every leaf page, not only
the ones where it actually sees items to delete.

regards, tom lane

#12

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#2)

Re: Resurrecting per-page cleaner for btree

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code. Using
pgbench -i -s 10 bench
pgbench -c 10 -t 1000 bench (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-( Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

* check for uniqueness marks one more tuple dead (the
next-to-last version of the tuple)
* newly added code removes one tuple and does a write
* now there's enough room to insert one tuple
* lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good). Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(. But it definitely needs more tweaking.

regards, tom lane

#13

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#12)

1 attachment(s)

Re: [HACKERS] Resurrecting per-page cleaner for btree

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

---------------------------------------------------------------------------

Tom Lane wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code. Using
pgbench -i -s 10 bench
pgbench -c 10 -t 1000 bench (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-( Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

* check for uniqueness marks one more tuple dead (the
next-to-last version of the tuple)
* newly added code removes one tuple and does a write
* now there's enough room to insert one tuple
* lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good). Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(. But it definitely needs more tweaking.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

Attachments:

/pgpatches/btreetext/x-diffDownload

Index: src/backend/access/nbtree/nbtinsert.c
===================================================================
RCS file: /cvsroot/pgsql/src/backend/access/nbtree/nbtinsert.c,v
retrieving revision 1.142
diff -c -c -r1.142 nbtinsert.c
*** src/backend/access/nbtree/nbtinsert.c	25 Jul 2006 19:13:00 -0000	1.142
--- src/backend/access/nbtree/nbtinsert.c	26 Jul 2006 01:35:52 -0000
***************
*** 438,445 ****
  			if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
  			{
  				_bt_vacuum_one_page(rel, buf);
! 				if (PageGetFreeSpace(page) >= itemsz)
! 					break;		/* OK, now we have enough space */
  			}
  
  			/*
--- 438,451 ----
  			if (P_ISLEAF(lpageop) && P_HAS_GARBAGE(lpageop))
  			{
  				_bt_vacuum_one_page(rel, buf);
! 				/*
! 				 *	Free space should be large enough for the new tuple and
! 				 *	should be >= 10% because scanning the page over and
! 				 *	over again to get just a little free space is inefficient.
! 				 */
! 				if (PageGetFreeSpace(page) >= itemsz &&
! 					PageGetFreeSpace(page) >= BLCKSZ / 10)
! 					break;
  			}
  
  			/*

#14

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#13)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Tom, I ran your tests with fsync off (as you did), and saw numbers
bouncing between 400-700 tps without my patch, and sticking at 700 tps
with my patch.

---------------------------------------------------------------------------

Bruce Momjian wrote:

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

---------------------------------------------------------------------------

Tom Lane wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code. Using
pgbench -i -s 10 bench
pgbench -c 10 -t 1000 bench (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-( Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

* check for uniqueness marks one more tuple dead (the
next-to-last version of the tuple)
* newly added code removes one tuple and does a write
* now there's enough room to insert one tuple
* lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good). Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(. But it definitely needs more tweaking.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#15

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

over 19 years ago

In reply to: Tom Lane (#12)

Re: Resurrecting per-page cleaner for btree

Tom Lane <tgl@sss.pgh.pa.us> wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

Thank you for applying and adding documentation.
But hmm... I tested on DBT-2 and pgbench, and the patch worked well on both.
I used disks with battery backup for WAL and enabled writeback cache, so the
extra WAL records might be of no significance.
On DBT-2, the bottleneck was completely the data disks. Cache usage was
important there, so avoiding splitting was useful for saving momory.
On pgbench, full dataset was cached in memory, so the bottleneck was CPUs.
Especially cleanup for branches_pkey was effective because it was frequently
updated and had many LP_DELETE'd tuples. Degradation was eased on my machine.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

I suspect PageIndexMultiDelete() consumes CPU. If there are one or two
dead tuples, PageIndexTupleDelete() is called and memmove(4KB average)
and adjustment of the linepointer-offsets are performed everytime.
I think this is a heavy operation. But if the size of most upper index
entry is same with the dead tuple, we can only move the upper to the hole
and avoid to modify all tuples. Is this change acceptable?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#16

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: ITAGAKI Takahiro (#15)

Re: Resurrecting per-page cleaner for btree

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

I suspect PageIndexMultiDelete() consumes CPU.

That's part of the problem, but only part: the extra WAL record is
expensive too.

If there are one or two
dead tuples, PageIndexTupleDelete() is called and memmove(4KB average)
and adjustment of the linepointer-offsets are performed everytime.
I think this is a heavy operation. But if the size of most upper index
entry is same with the dead tuple, we can only move the upper to the hole
and avoid to modify all tuples. Is this change acceptable?

I'm inclined to think that this is too special-purpose to be a good
solution. It will help pgbench because that test uses only integer
keys, but it won't help for any variable-width datatype. In any case
we'd still have the WAL overhead...

regards, tom lane

#17

Gregory Stark

gsstark@mit.edu

over 19 years ago

In reply to: Tom Lane (#5)

Re: Resurrecting per-page cleaner for btree

Tom Lane <tgl@sss.pgh.pa.us> writes:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.
Have I missed something? Comments welcome.

I think the only serious objection to this would be that it'd mean that
tuples that should have an index entry might not have one. The current
form of VACUUM does not care, but people keep raising the idea of doing
"retail" vacuuming that operates by looking up index entries explicitly.
You could certainly make a retail vacuumer do nothing if it fails to
find the expected index entry, but ISTM that'd be a rather serious loss
of consistency checking --- you could not tell the someone-already-
deleted-it case apart from a bug in the vacuumer's index value
computation or lookup.

Well you already have that case anyways due to online index builds. (If a
vacuum gets in between the two transactions.) I suppose you could go and check
whether the pg_index entry for the index indicates that it's valid already but
that's not something anyone has to be looking for currently.

Personally I don't think retail vacuuming in that form will ever fly
anyway, so I have no problem with installing the proposed patch,
but I thought I'd better throw this comment out to see if anyone
thinks it's a big deal.

Well it's not like the existing vacuum checks for this. It's not even
convenient to check for it in the current vacuum architecture. It would have
to maintain the list of expired htids that haven't been found yet in the
bulkdelete and see if there are any left when it's done.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#18

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Gregory Stark (#17)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Gregory Stark <gsstark@mit.edu> writes:

... Well it's not like the existing vacuum checks for this.

Right, that's exactly why the patch works at all. But the point here is
that the existing vacuum does not rely on re-computing index keys; all
it cares about is matching TIDs. The retail-vacuum idea depends on the
assumption that you can look at the tuple and re-compute the same index
keys that you computed the first time; which is an assumption much
shakier than the assumption that TID comparison works. (In fact, it's
trivial to see how user-defined functions that are mislabeled immutable
could make this fail.) So retail vacuum without any cross-check that
you got all the index tuples is a scary proposition IMHO.

regards, tom lane

#19

Csaba Nagy

nagy@ecircle-ag.com

over 19 years ago

In reply to: Tom Lane (#18)

Re: [HACKERS] Resurrecting per-page cleaner for btree

[snip] (In fact, it's
trivial to see how user-defined functions that are mislabeled immutable
could make this fail.) So retail vacuum without any cross-check that
you got all the index tuples is a scary proposition IMHO.

Wouldn't work to restrict that kind of vacuum to only tables which have
no indexes using user defined functions ? That would mean a very small
restriction I guess, probably 99.9% of the indexes won't use user
defined functions...

I actually wonder if such a vacuum would be useful for my scenario,
where I have some pretty big tables, and update a relatively small
percentage of it. Would it be faster to run such a vacuum against the
current one ?
One example would be a ~100 million table where I have 1-4 million
updates per day. Could I run vacuum multiple times a day for this table
and expect that individual runs are relatively fast ?

Cheers,
Csaba.

#20

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Csaba Nagy (#19)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Csaba Nagy <nagy@ecircle-ag.com> writes:

[snip] (In fact, it's
trivial to see how user-defined functions that are mislabeled immutable
could make this fail.) So retail vacuum without any cross-check that
you got all the index tuples is a scary proposition IMHO.

Wouldn't work to restrict that kind of vacuum to only tables which have
no indexes using user defined functions ?

Of course, we never have bugs in PG core. Nope, doesn't happen ...

I actually wonder if such a vacuum would be useful for my scenario,
where I have some pretty big tables, and update a relatively small
percentage of it. Would it be faster to run such a vacuum against the
current one ?

So far, the case hasn't been made for retail vacuum even ignoring the
not-so-immutable-function risk.

regards, tom lane

#21

Greg Stark

gsstark@mit.edu

over 19 years ago

In reply to: Tom Lane (#20)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Tom Lane <tgl@sss.pgh.pa.us> writes:

So far, the case hasn't been made for retail vacuum even ignoring the
not-so-immutable-function risk.

Well the desire for it comes from a very well established need for dealing
with extremely large tales with relatively small hot spots. The basic problem
being that currently the cost of vacuum is proportional to the size of the
table rather than the amount of dead space. There's no link between those
variables (at least in one direction) and any time they're far out of whack it
means excruciating pain for the DBA.

The case that perhaps hasn't been made is for whether retail vacuum is the
best solution. The only other candidate that I've seen proposed (many many
times) is some form of segregating the older tuples a la Oracle.

That doesn't mean retail vacuuming is a good idea. It may just be that we
haven't thought of a the best option yet. But it also may be that retail
vacuuming or some kind of rollback segment is just the least bad idea.

--
greg

#22

Martijn van Oosterhout

kleptog@svana.org

over 19 years ago

In reply to: Greg Stark (#21)

Re: [HACKERS] Resurrecting per-page cleaner for btree

On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

So far, the case hasn't been made for retail vacuum even ignoring the
not-so-immutable-function risk.

Well the desire for it comes from a very well established need for dealing
with extremely large tales with relatively small hot spots. The basic problem
being that currently the cost of vacuum is proportional to the size of the
table rather than the amount of dead space. There's no link between those
variables (at least in one direction) and any time they're far out of whack it
means excruciating pain for the DBA.

I thought the suggested solution for that was the dead space map. That
way vacuum can ignore parts of the table that havn't changed...

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#23

Hannu Krosing

hannu@skype.net

over 19 years ago

In reply to: Martijn van Oosterhout (#22)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Ühel kenal päeval, K, 2006-07-26 kell 23:02, kirjutas Martijn van
Oosterhout:

On Wed, Jul 26, 2006 at 12:47:57PM -0400, Greg Stark wrote:

Tom Lane <tgl@sss.pgh.pa.us> writes:

So far, the case hasn't been made for retail vacuum even ignoring the
not-so-immutable-function risk.

Well the desire for it comes from a very well established need for dealing
with extremely large tales with relatively small hot spots. The basic problem
being that currently the cost of vacuum is proportional to the size of the
table rather than the amount of dead space. There's no link between those
variables (at least in one direction) and any time they're far out of whack it
means excruciating pain for the DBA.

I thought the suggested solution for that was the dead space map. That
way vacuum can ignore parts of the table that havn't changed...

It can ignore parts of the *table* but still has to scan full *indexes*.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

#24

ITAGAKI Takahiro

itagaki.takahiro@oss.ntt.co.jp

over 19 years ago

In reply to: Bruce Momjian (#13)

Re: Resurrecting per-page cleaner for btree

Bruce Momjian <bruce@momjian.us> wrote:

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

This is another solution for the same purpose. We can avoid to call
PageIndexMultiDelete() to remove only one tuple. _bt_split() becomes
to ignore tuples with LP_DELETE because such tuples are passed now.
The cleaner threshold (currently 2) might have to be more tweaked.

Can this change resolve your performance problem?

diff -cpr pgsql-orig/src/backend/access/nbtree/nbtinsert.c pgsql/src/backend/access/nbtree/nbtinsert.c
*** pgsql-orig/src/backend/access/nbtree/nbtinsert.c	Wed Jul 26 11:15:20 2006
--- pgsql/src/backend/access/nbtree/nbtinsert.c	Thu Jul 27 18:27:25 2006
*************** _bt_split(Relation rel, Buffer buf, Offs
*** 788,793 ****
--- 788,798 ----
  	for (i = P_FIRSTDATAKEY(oopaque); i <= maxoff; i = OffsetNumberNext(i))
  	{
  		itemid = PageGetItemId(origpage, i);
+ 
+ 		/* We can ignore items with LP_DELETE. */
+  		if (ItemIdDeleted(itemid))
+ 			continue;
+ 
  		itemsz = ItemIdGetLength(itemid);
  		item = (IndexTuple) PageGetItem(origpage, itemid);

*************** _bt_vacuum_one_page(Relation rel, Buffer
*** 1707,1713 ****
deletable[ndeletable++] = offnum;
}

! 	if (ndeletable > 0)
  		_bt_delitems(rel, buffer, deletable, ndeletable);
  	/*
  	 * Note: if we didn't find any LP_DELETE items, then the page's
--- 1712,1723 ----
  			deletable[ndeletable++] = offnum;
  	}

! /*
! * We delete items if there are at least 2 deletable index tuples
! * because scanning the page over and over again to get just a little
! * free space is inefficient.
! */
! if (ndeletable >= 2)
_bt_delitems(rel, buffer, deletable, ndeletable);
/*
* Note: if we didn't find any LP_DELETE items, then the page's

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#25

Jim Nasby

jnasby@pervasive.com

over 19 years ago

In reply to: Hannu Krosing (#23)

Re: [HACKERS] Resurrecting per-page cleaner for btree

On Jul 26, 2006, at 4:29 PM, Hannu Krosing wrote:

Well the desire for it comes from a very well established need
for dealing
with extremely large tales with relatively small hot spots. The
basic problem
being that currently the cost of vacuum is proportional to the
size of the
table rather than the amount of dead space. There's no link
between those
variables (at least in one direction) and any time they're far
out of whack it
means excruciating pain for the DBA.

I thought the suggested solution for that was the dead space map.
That
way vacuum can ignore parts of the table that havn't changed...

It can ignore parts of the *table* but still has to scan full
*indexes*.

Even if we stopped right there it would still be a huge win in many
(most?) cases. How often do the indexes on a table comprise even 50%
of the table's size? If indexes are 10% of the table's size, and you
only scan 10% of the table that's dead, you've gone from scanning
1.10*X pages (where X is the number of pages in the table) to 0.20*X
pages. For large tables, that's a drastic improvement. Even in the
50% case, you've gone from 1.5X to .6X (assuming 10% dead space). And
I'm actually ignoring that we currently scan the heap twice; if you
throw that into the mix the argument for doing this is even stronger.

While it would be ideal to eliminate the need to scan the indexes,
I'd certainly rather have the 'half-solution' of not scanning the
entire heap and still scanning the indexes over what we have today.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#26

Jim Nasby

jnasby@pervasive.com

over 19 years ago

In reply to: Tom Lane (#18)

Re: [HACKERS] Resurrecting per-page cleaner for btree

On Jul 26, 2006, at 10:29 AM, Tom Lane wrote:

Gregory Stark <gsstark@mit.edu> writes:

... Well it's not like the existing vacuum checks for this.

Right, that's exactly why the patch works at all. But the point
here is
that the existing vacuum does not rely on re-computing index keys; all
it cares about is matching TIDs. The retail-vacuum idea depends on
the
assumption that you can look at the tuple and re-compute the same
index
keys that you computed the first time; which is an assumption much
shakier than the assumption that TID comparison works. (In fact, it's
trivial to see how user-defined functions that are mislabeled
immutable
could make this fail.) So retail vacuum without any cross-check that
you got all the index tuples is a scary proposition IMHO.

Are there other cases that could cause a problem besides mislabeled
user function indexes (a case I don't feel we need to protect
against) and bugs in core?

If we don't worry about users who can't figure out what immutable
means, we should be able to cover core bugs by having the buildfarm
occasionally do long-running pg_bench (or some other workload) runs
with autovacuum turned on. Knowing what autovacuum is set to, we can
calculate approximately how much dead/empty space should exist in
indexes, and make sure we're within reason (though this might mean
having a mode that disables deleting known-dead tuples before
splitting a page). Another possibility would be actually inspecting
the indexes for invalid entries.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#27

Greg Stark

gsstark@mit.edu

over 19 years ago

In reply to: Jim Nasby (#25)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Jim Nasby <jnasby@pervasive.com> writes:

Even if we stopped right there it would still be a huge win in many (most?)
cases. How often do the indexes on a table comprise even 50% of the table's
size?

I would say they're usually roughly comparable actually. It depends on how
wide your table is of course but the wider your table rows the more indexes
you're likely to have on the table too.

Even in the 50% case, you've gone from 1.5X to .6X

Sure, and a 3x speedup is nothing to sneeze at, that would be a great
improvement to vacuum. But it's still just a linear speedup and doesn't
address the algorithmic problem.

The fundamental problem is we have a process that's O(m) where m is the total
space taken by a table and its indexes. The actual amount of space it has to
reclaim is n. Other than n<m there's basically no relationship between these
figures. As long as that's the case vacuum may as well be O(n^2) or O(n!).

We frequently assume -- and often it's a valid assumption -- that these
figures are roughly proportional. Hence all the talk about databases reaching
a "steady-state" where the amount of dead space is constantly being reclaimed
at more or less the same speed it's being generated. But there are also plenty
of use cases where a complete vacuum pass takes thousands of times longer than
the i/o it took to generate those dead tuples. Currently Postgres just isn't
that great a tool for those use cases.

Unfortunately while I'm convinced of the problem I'm equally unconvinced of
the solution. I tried to solve online index builds using retail index lookups
in a very similar way to what's being discussed here. And ran into the same
problems. I eventually decided that while it could be made to work that way it
would be far too much code, far too unsafe, and far too invasive in the index
access methods to be the right approach.

Our existing method works with minimal help from the index access methods
which allows for an enormous degree of freedom in the index design.. To be
able to support retail vacuum you would have to force index method
implementors to keep information in a way that allowed them to look up a
particular value/tid efficiently which would limit the kinds of indexes you
could support drastically.

--
greg

#28

Jim C. Nasby

jnasby@pervasive.com

over 19 years ago

In reply to: Greg Stark (#27)

Re: [HACKERS] Resurrecting per-page cleaner for btree

On Thu, Jul 27, 2006 at 05:24:35PM -0400, Greg Stark wrote:

Jim Nasby <jnasby@pervasive.com> writes:

Even if we stopped right there it would still be a huge win in many (most?)
cases. How often do the indexes on a table comprise even 50% of the table's
size?

I would say they're usually roughly comparable actually. It depends on how
wide your table is of course but the wider your table rows the more indexes
you're likely to have on the table too.

I think the number of fields in a table will correlate with the number
of indexes, but I don't think the width of those fields matters.

Even in the 50% case, you've gone from 1.5X to .6X

Sure, and a 3x speedup is nothing to sneeze at, that would be a great
improvement to vacuum. But it's still just a linear speedup and doesn't
address the algorithmic problem.

The fundamental problem is we have a process that's O(m) where m is the total
space taken by a table and its indexes. The actual amount of space it has to
reclaim is n. Other than n<m there's basically no relationship between these
figures. As long as that's the case vacuum may as well be O(n^2) or O(n!).

We frequently assume -- and often it's a valid assumption -- that these
figures are roughly proportional. Hence all the talk about databases reaching
a "steady-state" where the amount of dead space is constantly being reclaimed
at more or less the same speed it's being generated. But there are also plenty
of use cases where a complete vacuum pass takes thousands of times longer than
the i/o it took to generate those dead tuples. Currently Postgres just isn't
that great a tool for those use cases.

This is exactly why I'm suggesting that we stop waiting for the perfect
vacuum that will only hit the exact tuples in both the heap and indexes
that it needs to and at least take the giant leap forward of only
hitting heap tuples/pages that we know are dead. Even if indexes are the
same size as the heap, you've still gained nearly a 2x speed improvement
(depending on how long you wait to vacuum).

Unfortunately while I'm convinced of the problem I'm equally unconvinced of
the solution. I tried to solve online index builds using retail index lookups
in a very similar way to what's being discussed here. And ran into the same
problems. I eventually decided that while it could be made to work that way it
would be far too much code, far too unsafe, and far too invasive in the index
access methods to be the right approach.

Our existing method works with minimal help from the index access methods
which allows for an enormous degree of freedom in the index design.. To be
able to support retail vacuum you would have to force index method
implementors to keep information in a way that allowed them to look up a
particular value/tid efficiently which would limit the kinds of indexes you
could support drastically.

--
greg

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#29

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#12)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Have we made a decision on this issue? Should I apply my patch that
still forces a split unless 10% of the page has been freed?

---------------------------------------------------------------------------

Tom Lane wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code. Using
pgbench -i -s 10 bench
pgbench -c 10 -t 1000 bench (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-( Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

* check for uniqueness marks one more tuple dead (the
next-to-last version of the tuple)
* newly added code removes one tuple and does a write
* now there's enough room to insert one tuple
* lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good). Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(. But it definitely needs more tweaking.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#30

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#29)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Bruce Momjian <bruce@momjian.us> writes:

Have we made a decision on this issue? Should I apply my patch that
still forces a split unless 10% of the page has been freed?

I haven't gotten back to doing any more performance testing. Please
stick that patch on the pending queue, so we don't forget it, but
don't apply it yet ...

regards, tom lane

#31

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Bruce Momjian (#14)

Re: [HACKERS] Resurrecting per-page cleaner for

Tom, should I apply this patch now? Are you still considering other
options for this?

---------------------------------------------------------------------------

Bruce Momjian wrote:

Tom, I ran your tests with fsync off (as you did), and saw numbers
bouncing between 400-700 tps without my patch, and sticking at 700 tps
with my patch.

---------------------------------------------------------------------------

Bruce Momjian wrote:

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

---------------------------------------------------------------------------

Tom Lane wrote:

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

This is a revised patch originated by Junji TERAMOTO for HEAD.
[BTree vacuum before page splitting]
http://archives.postgresql.org/pgsql-patches/2006-01/msg00301.php
I think we can resurrect his idea because we will scan btree pages
at-atime now; the missing-restarting-point problem went away.

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code. Using
pgbench -i -s 10 bench
pgbench -c 10 -t 1000 bench (repeat this half a dozen times)
with fsync off but all other settings factory-stock, what I'm seeing
is that the first run looks really good but subsequent runs tail off in
spectacular fashion :-( Pre-patch there was only minor degradation in
successive runs.

What I think is happening is that because pgbench depends so heavily on
updating existing records, we get into a state where an index page is
about full and there's one dead tuple on it, and then for each insertion
we have

* check for uniqueness marks one more tuple dead (the
next-to-last version of the tuple)
* newly added code removes one tuple and does a write
* now there's enough room to insert one tuple
* lather, rinse, repeat, never splitting the page.

The problem is that we've traded splitting a page every few hundred
inserts for doing a PageIndexMultiDelete, and emitting an extra WAL
record, on *every* insert. This is not good.

Had you done any performance testing on this patch, and if so what
tests did you use? I'm a bit hesitant to try to fix it on the basis
of pgbench results alone.

One possible fix that comes to mind is to only perform the cleanup
if we are able to remove more than one dead tuple (perhaps about 10
would be good). Or do the deletion anyway, but then go ahead and
split the page unless X amount of space has been freed (where X is
more than just barely enough for the incoming tuple).

After all the thought we've put into this, it seems a shame to
just abandon it :-(. But it definitely needs more tweaking.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#32

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#31)

Re: [HACKERS] Resurrecting per-page cleaner for

Bruce Momjian <bruce@momjian.us> writes:

Tom, should I apply this patch now? Are you still considering other
options for this?

Please wait. This issue is very far down the to-list in terms of size
or significance ... but I'll get to it.

regards, tom lane

#33

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Bruce Momjian (#13)

Re: [HACKERS] Resurrecting per-page cleaner for btree

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code.

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

At the moment, I cannot replicate any consistent difference between
CVS head with the patch, without the patch, with the patch plus your
BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition.
That's whether I use HEAD's broken version of pgbench or one from late
July. So I'm feeling a tad frustrated ... but I have no evidence in
favor of changing what is in CVS, and accordingly recommend that we
leave well enough alone for 8.2.

regards, tom lane

#34

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Tom Lane (#33)

Re: [HACKERS] Resurrecting per-page cleaner for

OK, removed from open item list.

---------------------------------------------------------------------------

Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Tom Lane wrote:

I've applied this but I'm now having some second thoughts about it,
because I'm seeing an actual *decrease* in pgbench numbers from the
immediately prior CVS HEAD code.

The attached patch requires the new row to fit, and 10% to be free on
the page. Would someone test that?

At the moment, I cannot replicate any consistent difference between
CVS head with the patch, without the patch, with the patch plus your
BLCKSZ/10 limit addition, or with a variant BLCKSZ/32 limit addition.
That's whether I use HEAD's broken version of pgbench or one from late
July. So I'm feeling a tad frustrated ... but I have no evidence in
favor of changing what is in CVS, and accordingly recommend that we
leave well enough alone for 8.2.

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +