Pre-allocated free space for row updating (like PCTFREE)

Started by Satoshi Nagayasuover 20 years ago28 messages
#1Satoshi Nagayasu
nagayasus@nttdata.co.jp
1 attachment(s)

Hi all,

I've done a quick hack to implement PCTFREE on PostgreSQL.

As you know, it's inspired by Oracle's PCTFREE.

http://www.csee.umbc.edu/help/oracle8/server.815/a67772/schema.htm#990
http://www.comp.hkbu.edu.hk/docs/o/oracle10g/server.101/b10743/cncpt031.gif

Pre-allocated space for each block(page) can improve heap_update() performance,
because heap_update() looks for the free space in same block
to insert new row.

According to my experiments, pgbench score was improved 10% or more
with 1024 bytes free space.

Any comments? Is this idea good, or not?

Thanks.
--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

Attachments:

pctfree.001.difftext/plain; name=pctfree.001.diffDownload
diff -rc postgresql-8.0.0.orig/src/backend/access/heap/heapam.c postgresql-8.0.0.pctfree/src/backend/access/heap/heapam.c
*** postgresql-8.0.0.orig/src/backend/access/heap/heapam.c	2005-01-01 06:59:16.000000000 +0900
--- postgresql-8.0.0.pctfree/src/backend/access/heap/heapam.c	2005-08-20 23:20:45.017901208 +0900
***************
*** 1151,1157 ****
  		heap_tuple_toast_attrs(relation, tup, NULL);
  
  	/* Find buffer to insert this tuple into */
! 	buffer = RelationGetBufferForTuple(relation, tup->t_len, InvalidBuffer);
  
  	/* NO EREPORT(ERROR) from here till changes are logged */
  	START_CRIT_SECTION();
--- 1151,1160 ----
  		heap_tuple_toast_attrs(relation, tup, NULL);
  
  	/* Find buffer to insert this tuple into */
! 	buffer = RelationGetBufferForTuple(relation,
! 									   tup->t_len,
! 									   InvalidBuffer,
! 									   true);
  
  	/* NO EREPORT(ERROR) from here till changes are logged */
  	START_CRIT_SECTION();
***************
*** 1671,1678 ****
  		if (newtupsize > pagefree)
  		{
  			/* Assume there's no chance to put newtup on same page. */
! 			newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 											   buffer);
  		}
  		else
  		{
--- 1674,1683 ----
  		if (newtupsize > pagefree)
  		{
  			/* Assume there's no chance to put newtup on same page. */
! 			newbuf = RelationGetBufferForTuple(relation,
! 											   newtup->t_len,
! 											   buffer,
! 											   false);
  		}
  		else
  		{
***************
*** 1688,1695 ****
  				 * should seldom be taken.
  				 */
  				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
! 				newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 												   buffer);
  			}
  			else
  			{
--- 1693,1702 ----
  				 * should seldom be taken.
  				 */
  				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
! 				newbuf = RelationGetBufferForTuple(relation,
! 												   newtup->t_len,
! 												   buffer,
! 												   false);
  			}
  			else
  			{
diff -rc postgresql-8.0.0.orig/src/backend/access/heap/hio.c postgresql-8.0.0.pctfree/src/backend/access/heap/hio.c
*** postgresql-8.0.0.orig/src/backend/access/heap/hio.c	2005-01-01 06:59:16.000000000 +0900
--- postgresql-8.0.0.pctfree/src/backend/access/heap/hio.c	2005-08-20 23:35:44.986085248 +0900
***************
*** 89,95 ****
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer)
  {
  	Buffer		buffer = InvalidBuffer;
  	Page		pageHeader;
--- 89,95 ----
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer, bool forInsert)
  {
  	Buffer		buffer = InvalidBuffer;
  	Page		pageHeader;
***************
*** 136,142 ****
  		 * We have no cached target page, so ask the FSM for an initial
  		 * target.
  		 */
! 		targetBlock = GetPageWithFreeSpace(&relation->rd_node, len);
  
  		/*
  		 * If the FSM knows nothing of the rel, try the last page before
--- 136,142 ----
  		 * We have no cached target page, so ask the FSM for an initial
  		 * target.
  		 */
! 		targetBlock = GetPageWithFreeSpace(&relation->rd_node, len, forInsert);
  
  		/*
  		 * If the FSM knows nothing of the rel, try the last page before
***************
*** 192,198 ****
  		 */
  		pageHeader = (Page) BufferGetPage(buffer);
  		pageFreeSpace = PageGetFreeSpace(pageHeader);
! 		if (len <= pageFreeSpace)
  		{
  			/* use this page as future insert target, too */
  			relation->rd_targblock = targetBlock;
--- 192,198 ----
  		 */
  		pageHeader = (Page) BufferGetPage(buffer);
  		pageFreeSpace = PageGetFreeSpace(pageHeader);
! 		if ((forInsert ? (len+1024) : len) <= pageFreeSpace)
  		{
  			/* use this page as future insert target, too */
  			relation->rd_targblock = targetBlock;
***************
*** 221,227 ****
  		targetBlock = RecordAndGetPageWithFreeSpace(&relation->rd_node,
  													targetBlock,
  													pageFreeSpace,
! 													len);
  	}
  
  	/*
--- 221,228 ----
  		targetBlock = RecordAndGetPageWithFreeSpace(&relation->rd_node,
  													targetBlock,
  													pageFreeSpace,
! 													len,
! 													forInsert);
  	}
  
  	/*
diff -rc postgresql-8.0.0.orig/src/backend/storage/freespace/freespace.c postgresql-8.0.0.pctfree/src/backend/storage/freespace/freespace.c
*** postgresql-8.0.0.orig/src/backend/storage/freespace/freespace.c	2005-01-01 07:00:54.000000000 +0900
--- postgresql-8.0.0.pctfree/src/backend/storage/freespace/freespace.c	2005-08-20 23:18:21.413732360 +0900
***************
*** 229,235 ****
  static void unlink_fsm_rel_usage(FSMRelation *fsmrel);
  static void link_fsm_rel_storage(FSMRelation *fsmrel);
  static void unlink_fsm_rel_storage(FSMRelation *fsmrel);
! static BlockNumber find_free_space(FSMRelation *fsmrel, Size spaceNeeded);
  static BlockNumber find_index_free_space(FSMRelation *fsmrel);
  static void fsm_record_free_space(FSMRelation *fsmrel, BlockNumber page,
  					  Size spaceAvail);
--- 229,237 ----
  static void unlink_fsm_rel_usage(FSMRelation *fsmrel);
  static void link_fsm_rel_storage(FSMRelation *fsmrel);
  static void unlink_fsm_rel_storage(FSMRelation *fsmrel);
! static BlockNumber find_free_space(FSMRelation *fsmrel,
! 								   Size spaceNeeded,
! 								   bool forInsert);
  static BlockNumber find_index_free_space(FSMRelation *fsmrel);
  static void fsm_record_free_space(FSMRelation *fsmrel, BlockNumber page,
  					  Size spaceAvail);
***************
*** 359,365 ****
   * extend the relation.
   */
  BlockNumber
! GetPageWithFreeSpace(RelFileNode *rel, Size spaceNeeded)
  {
  	FSMRelation *fsmrel;
  	BlockNumber freepage;
--- 361,367 ----
   * extend the relation.
   */
  BlockNumber
! GetPageWithFreeSpace(RelFileNode *rel, Size spaceNeeded, bool forInsert)
  {
  	FSMRelation *fsmrel;
  	BlockNumber freepage;
***************
*** 384,390 ****
  		cur_avg += ((int) spaceNeeded - cur_avg) / 32;
  		fsmrel->avgRequest = (Size) cur_avg;
  	}
! 	freepage = find_free_space(fsmrel, spaceNeeded);
  	LWLockRelease(FreeSpaceLock);
  	return freepage;
  }
--- 386,392 ----
  		cur_avg += ((int) spaceNeeded - cur_avg) / 32;
  		fsmrel->avgRequest = (Size) cur_avg;
  	}
! 	freepage = find_free_space(fsmrel, spaceNeeded, forInsert);
  	LWLockRelease(FreeSpaceLock);
  	return freepage;
  }
***************
*** 399,405 ****
  RecordAndGetPageWithFreeSpace(RelFileNode *rel,
  							  BlockNumber oldPage,
  							  Size oldSpaceAvail,
! 							  Size spaceNeeded)
  {
  	FSMRelation *fsmrel;
  	BlockNumber freepage;
--- 401,408 ----
  RecordAndGetPageWithFreeSpace(RelFileNode *rel,
  							  BlockNumber oldPage,
  							  Size oldSpaceAvail,
! 							  Size spaceNeeded,
! 							  bool forInsert)
  {
  	FSMRelation *fsmrel;
  	BlockNumber freepage;
***************
*** 429,435 ****
  		fsmrel->avgRequest = (Size) cur_avg;
  	}
  	/* Do the Get */
! 	freepage = find_free_space(fsmrel, spaceNeeded);
  	LWLockRelease(FreeSpaceLock);
  	return freepage;
  }
--- 432,438 ----
  		fsmrel->avgRequest = (Size) cur_avg;
  	}
  	/* Do the Get */
! 	freepage = find_free_space(fsmrel, spaceNeeded, forInsert);
  	LWLockRelease(FreeSpaceLock);
  	return freepage;
  }
***************
*** 1204,1210 ****
   * if no success.
   */
  static BlockNumber
! find_free_space(FSMRelation *fsmrel, Size spaceNeeded)
  {
  	FSMPageData *info;
  	int			pagesToCheck,	/* outer loop counter */
--- 1207,1213 ----
   * if no success.
   */
  static BlockNumber
! find_free_space(FSMRelation *fsmrel, Size spaceNeeded, bool forInsert)
  {
  	FSMPageData *info;
  	int			pagesToCheck,	/* outer loop counter */
***************
*** 1225,1231 ****
  		Size		spaceAvail = FSMPageGetSpace(page);
  
  		/* Check this page */
! 		if (spaceAvail >= spaceNeeded)
  		{
  			/*
  			 * Found what we want --- adjust the entry, and update
--- 1228,1234 ----
  		Size		spaceAvail = FSMPageGetSpace(page);
  
  		/* Check this page */
! 		if (spaceAvail >= (forInsert ? (spaceNeeded + 1024) : spaceNeeded) )
  		{
  			/*
  			 * Found what we want --- adjust the entry, and update
diff -rc postgresql-8.0.0.orig/src/include/access/hio.h postgresql-8.0.0.pctfree/src/include/access/hio.h
*** postgresql-8.0.0.orig/src/include/access/hio.h	2005-01-01 07:03:21.000000000 +0900
--- postgresql-8.0.0.pctfree/src/include/access/hio.h	2005-08-20 23:13:20.267513544 +0900
***************
*** 19,24 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
  					 HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer);
  
  #endif   /* HIO_H */
--- 19,24 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
  					 HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer, bool forInsert);
  
  #endif   /* HIO_H */
diff -rc postgresql-8.0.0.orig/src/include/storage/freespace.h postgresql-8.0.0.pctfree/src/include/storage/freespace.h
*** postgresql-8.0.0.orig/src/include/storage/freespace.h	2005-01-01 07:03:42.000000000 +0900
--- postgresql-8.0.0.pctfree/src/include/storage/freespace.h	2005-08-20 23:18:46.661894056 +0900
***************
*** 39,49 ****
  extern void InitFreeSpaceMap(void);
  extern int	FreeSpaceShmemSize(void);
  
! extern BlockNumber GetPageWithFreeSpace(RelFileNode *rel, Size spaceNeeded);
  extern BlockNumber RecordAndGetPageWithFreeSpace(RelFileNode *rel,
  							  BlockNumber oldPage,
  							  Size oldSpaceAvail,
! 							  Size spaceNeeded);
  extern Size GetAvgFSMRequestSize(RelFileNode *rel);
  extern void RecordRelationFreeSpace(RelFileNode *rel,
  						int nPages,
--- 39,52 ----
  extern void InitFreeSpaceMap(void);
  extern int	FreeSpaceShmemSize(void);
  
! extern BlockNumber GetPageWithFreeSpace(RelFileNode *rel,
! 										Size spaceNeeded,
! 										bool forInsert);
  extern BlockNumber RecordAndGetPageWithFreeSpace(RelFileNode *rel,
  							  BlockNumber oldPage,
  							  Size oldSpaceAvail,
! 							  Size spaceNeeded,
! 												 bool forInsert);
  extern Size GetAvgFSMRequestSize(RelFileNode *rel);
  extern void RecordRelationFreeSpace(RelFileNode *rel,
  						int nPages,
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Satoshi Nagayasu (#1)
Re: Pre-allocated free space for row updating (like PCTFREE)

Satoshi Nagayasu <nagayasus@nttdata.co.jp> writes:

I've done a quick hack to implement PCTFREE on PostgreSQL.
...
According to my experiments, pgbench score was improved 10% or more
with 1024 bytes free space.

I'm not very enthused about this. Enforcing 12.5% PCTFREE means that
you pay 12.5% extra I/O costs across the board for INSERT and SELECT
and then hope you can make it back (plus some more) on UPDATEs.
pgbench is a completely UPDATE-dominated benchmark and thus it makes
such a patch look much better than it would on other workloads.

I think the reason Oracle offers this has to do with their
overwrite-based storage management; it's not obvious that the tradeoff
is as useful for us. There are some relevant threads in our archives
here, here, and here:
http://archives.postgresql.org/pgsql-patches/2005-04/msg00078.php
http://archives.postgresql.org/pgsql-performance/2004-08/msg00402.php
http://archives.postgresql.org/pgsql-performance/2003-10/msg00618.php

regards, tom lane

#3Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Tom Lane (#2)
Re: Pre-allocated free space for row updating (like PCTFREE)

Tom Lane wrote:

I'm not very enthused about this. Enforcing 12.5% PCTFREE means that
you pay 12.5% extra I/O costs across the board for INSERT and SELECT
and then hope you can make it back (plus some more) on UPDATEs.
pgbench is a completely UPDATE-dominated benchmark and thus it makes
such a patch look much better than it would on other workloads.

Yes. I'm thinking about update-intensive workload or batch jobs
which generate huge amounts of updates.

I know pgbench is just a update-intensive benchmark, however
I don't like updates cause many smgrextend() and performance down,
because there are many workload types in the real-world.

I believe some of us need more options for these types of workloads.

(And I also know we need more tricks on page repair.)

I think the reason Oracle offers this has to do with their
overwrite-based storage management; it's not obvious that the tradeoff
is as useful for us. There are some relevant threads in our archives
here, here, and here:

I think the reason why this topic is raised many times is
some people need this.

The important point is that we need several options
for own workloads (or access patterns).

--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#4Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#2)
Re: Pre-allocated free space for row updating (like PCTFREE)

On Sun, Aug 21, 2005 at 09:50:10PM -0400, Tom Lane wrote:

Satoshi Nagayasu <nagayasus@nttdata.co.jp> writes:

I've done a quick hack to implement PCTFREE on PostgreSQL.
...
According to my experiments, pgbench score was improved 10% or more
with 1024 bytes free space.

I'm not very enthused about this. Enforcing 12.5% PCTFREE means that
you pay 12.5% extra I/O costs across the board for INSERT and SELECT
and then hope you can make it back (plus some more) on UPDATEs.
pgbench is a completely UPDATE-dominated benchmark and thus it makes
such a patch look much better than it would on other workloads.

I think the reason Oracle offers this has to do with their
overwrite-based storage management; it's not obvious that the tradeoff
is as useful for us. There are some relevant threads in our archives
here, here, and here:
http://archives.postgresql.org/pgsql-patches/2005-04/msg00078.php
http://archives.postgresql.org/pgsql-performance/2004-08/msg00402.php
http://archives.postgresql.org/pgsql-performance/2003-10/msg00618.php

It should be possible to see what the crossover point is in terms of
benefit using dbt2 and tweaking the transactions that are run, something
I can do if there's interest. But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jim C. Nasby (#4)
Re: Pre-allocated free space for row updating (like PCTFREE)

"Jim C. Nasby" <jnasby@pervasive.com> writes:

... But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?

Because there's no such thing as a free lunch. Every option we support
costs us in initial implementation time, documentation effort, and
ongoing maintenance. Plus it confuses users who don't know what to do
with it. (Note Josh's nearby lobbying to remove some GUC parameters.
While I opposed him on that particular item, I sympathize with his
point in general.)

Oracle's approach of "offer every knob you can think of" is not one
that I care to emulate. We have to strike a balance between flexibility
and not having a database that's too complex to administer for anyone
except an expert.

regards, tom lane

#6Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Tom Lane (#5)
Re: Pre-allocated free space for row updating (like PCTFREE)

Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

... But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?

Because there's no such thing as a free lunch. Every option we support
costs us in initial implementation time, documentation effort, and
ongoing maintenance. Plus it confuses users who don't know what to do
with it. (Note Josh's nearby lobbying to remove some GUC parameters.
While I opposed him on that particular item, I sympathize with his
point in general.)

Oracle's approach of "offer every knob you can think of" is not one
that I care to emulate. We have to strike a balance between flexibility
and not having a database that's too complex to administer for anyone
except an expert.

I understand what you mean, but I think we have to provide more flexibility
or options for PostgreSQL to be used wider area in the real-world.

In my case, if many updates reduce the system performance and there is no option,
our customer will change their DBMS from PostgreSQL to MySQL or Oracle.

If the DBAs can choose fewer options, the system performance management(monitoring)
cost gets higher, because sometimes simple architecture causes complex
operations (or tricks) in the real applications (like performance v.s. vacuum).
It is also a part of user's TCO.

I know there is no free lunch.
However, it also means if we can pay more costs, we can get more great lunch.

Just my thought...
--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#7Mark Kirkwood
markir@paradise.net.nz
In reply to: Jim C. Nasby (#4)
Re: Pre-allocated free space for row updating (like PCTFREE)

Jim C. Nasby wrote:

It should be possible to see what the crossover point is in terms of
benefit using dbt2 and tweaking the transactions that are run, something
I can do if there's interest. But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?

ISTM that this patch could be beneficial for the 'web session table'
type workload (i.e. huge number of updates on relatively few rows), that
is (well - last time I tried anyway) a bit of a challenge to reign in.

There was a thread about this a while ago (late 2004), so in some sense
it is a 'real world' scenario:

http://archives.postgresql.org/pgsql-hackers/2004-06/msg00282.php

regards

Mark

#8Josh Berkus
josh@agliodbs.com
In reply to: Jim C. Nasby (#4)
Re: Pre-allocated free space for row updating (like PCTFREE)

Jim, Satoshi,

It should be possible to see what the crossover point is in terms of
benefit using dbt2 and tweaking the transactions that are run, something
I can do if there's interest. But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?

Because your predicate is still disputed? That is, we don't know that people
will benefit yet -- pgbench is a pretty useless benchmark for real
performance comparisons.

Satoshi, if you can package up a patch on current CVS, I'll throw it at DBT2.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#9Jim C. Nasby
jnasby@pervasive.com
In reply to: Tom Lane (#5)
Re: Pre-allocated free space for row updating (like PCTFREE)

On Mon, Aug 22, 2005 at 10:18:25PM -0400, Tom Lane wrote:

"Jim C. Nasby" <jnasby@pervasive.com> writes:

... But I agree with Satoshi; if there are
people who will benefit from this option (which doesn't hurt those who
choose not to use it), why not put it in?

Because there's no such thing as a free lunch. Every option we support
costs us in initial implementation time, documentation effort, and
ongoing maintenance. Plus it confuses users who don't know what to do
with it. (Note Josh's nearby lobbying to remove some GUC parameters.
While I opposed him on that particular item, I sympathize with his
point in general.)

Oracle's approach of "offer every knob you can think of" is not one
that I care to emulate. We have to strike a balance between flexibility
and not having a database that's too complex to administer for anyone
except an expert.

The problem is that unless you're going to put a lot of AI in the
database[1]I'm all in favor of making things self-tuning wherever possible, but that's generally a lot more work than just exposing a GUC, so I suspect it will be some time before we get to that point. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461 (something Oracle is now doing...), you're going to end up
limiting yourself. As the PostgreSQL code continues to improve
performance-wise, we're going to run into more and more situations where
the way to get more performance means adding more tunables. Look at the
knobs that have been added for bgwriter and delayed vacuum for example.
These were added because the code had gotten to a point where the
problems they solve had become bigger and bigger bottlenecks. I know
there's hope that eventually these can be turned into simple 1-10 knobs
or something, but I'm doubtful that something that simple will suffice
for all situations.

I do understand the issue of having 100s of knobs, though. I don't think
we should go adding knobs willy-nilly (Josh made the good point that
there's currently no testing to validate the usefullness of this free
space knob, for example). But I also think that the way to control
'knob-bloat' isn't to do everything possible not to add knobs, but to
look at how to limit their exposure to people who don't need to know
about them.

For example, there's less than a half dozen knobs that people always ask
about when people post performance questions: shared_buffers, work_mem,
effective_cache_size, etc. These are knobs that almost every user needs
to know about. Call them 'level 1' knobs. Level 2 might be things like
vacuum_cost_delay, maintenance_work_mem, max_fsm_pages, and
max_connections. And so on. By grouping in this fashion we can limit
exposure to things that most users won't need to mess with, but give
users who have need to change these things the ability to do so.

[1]: I'm all in favor of making things self-tuning wherever possible, but that's generally a lot more work than just exposing a GUC, so I suspect it will be some time before we get to that point. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
but that's generally a lot more work than just exposing a GUC, so I
suspect it will be some time before we get to that point.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com 512-569-9461

#10Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Josh Berkus (#8)
Re: Pre-allocated free space for row updating (like PCTFREE)

Josh Berkus wrote:

Satoshi, if you can package up a patch on current CVS, I'll throw it at DBT2.

Ok. I'll do it.
--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#11Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Satoshi Nagayasu (#10)
Re: Pre-allocated free space for row updating (like PCTFREE)

Satoshi Nagayasu wrote:

Josh Berkus wrote:

Satoshi, if you can package up a patch on current CVS, I'll throw it at DBT2.

Ok. I'll do it.

I've created a new patch which can be applied to the current cvs tree.

http://dpsql.sourceforge.net/pctfree.cvs.diff

--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#12Josh Berkus
josh@agliodbs.com
In reply to: Satoshi Nagayasu (#11)
Re: Pre-allocated free space for row updating (like PCTFREE)

Satoshi,

I've created a new patch which can be applied to the current cvs tree.

http://dpsql.sourceforge.net/pctfree.cvs.diff

Hmmm ... I don't see where I set the GUC. How am I supposed to vary the
PCTFREE amount?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

#13Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Josh Berkus (#12)
Re: Pre-allocated free space for row updating (like PCTFREE)

Josh,

Josh Berkus wrote:

Hmmm ... I don't see where I set the GUC. How am I supposed to vary the
PCTFREE amount?

Well, currently PCTFREE size(1024 bytes) is fixed in the code,
because this hack is written just to check the effort of PCTFREE stuffs.

I will move the variable into the GUC later.

Thanks.
--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#14Simon Riggs
simon@2ndquadrant.com
In reply to: Josh Berkus (#12)
Re: Pre-allocated free space for row updating (like

On Wed, 2005-08-24 at 17:24 -0700, Josh Berkus wrote:

Satoshi,

I've created a new patch which can be applied to the current cvs tree.

http://dpsql.sourceforge.net/pctfree.cvs.diff

Hmmm ... I don't see where I set the GUC. How am I supposed to vary the
PCTFREE amount?

This is strikingly similar to a patch I wrote in February and submitted
in March for performance prototyping (pgsql-patches). We followed up on
that patch with a detailed discussion on how we would implement that
feature. My patch was slated in just the same way this has been (and
rightfully so...).

The summary was:

1. Have a PCTFREE column added on a table by table basis
2. Apply PCTFREE for Inserts only
3. Allow Updates to use the full space in the block.

Having PCTFREE set for all tables will not produce a good performance
result. This definitely needs to be on a table by table basis because
different tables have different ratios of INSERT/UPDATE/DELETEs.

If you look at DBT-2, you'll see that only the STOCK table would benefit
from this optimization, since it has 100% UPDATEs and is also the
heaviest hit table in the workload. Other tables would not benefit at
all from having PCTFREE set... for example the HISTORY table which has
100% INSERTs would see a drop in performance as a result.

Best Regards, Simon Riggs

#15Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Simon Riggs (#14)
Re: Pre-allocated free space for row updating (like PCTFREE)

Simon Riggs wrote:

The summary was:

1. Have a PCTFREE column added on a table by table basis

I think a good place to keep PCTFREE value is a new column
in the pg_class, and ALTER TABLE should be able to change this value.

2. Apply PCTFREE for Inserts only
3. Allow Updates to use the full space in the block.

4. Allow to repair fragmentation in each page.

Because updates cause fragmentation in the page.

So we need to keep large continuous free space in each page,
if we want to get more effective on PCTFREE feature.

--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#16Simon Riggs
simon@2ndquadrant.com
In reply to: Satoshi Nagayasu (#15)
Re: Pre-allocated free space for row updating

On Wed, 2005-08-31 at 08:32 +0900, Satoshi Nagayasu wrote:

Simon Riggs wrote:

The summary was:

1. Have a PCTFREE column added on a table by table basis

I think a good place to keep PCTFREE value is a new column
in the pg_class, and ALTER TABLE should be able to change this value.

Agreed

2. Apply PCTFREE for Inserts only
3. Allow Updates to use the full space in the block.

4. Allow to repair fragmentation in each page.

Because updates cause fragmentation in the page.

So we need to keep large continuous free space in each page,
if we want to get more effective on PCTFREE feature.

...doesn't VACUUM already do that?

Anyway, if the setting is for each specific table then the performance
benefit is very clear.

Best Regards, Simon Riggs

#17Satoshi Nagayasu
nagayasus@nttdata.co.jp
In reply to: Simon Riggs (#16)
Re: Pre-allocated free space for row updating (like PCTFREE)

Simon Riggs wrote:

4. Allow to repair fragmentation in each page.

Because updates cause fragmentation in the page.

So we need to keep large continuous free space in each page,
if we want to get more effective on PCTFREE feature.

...doesn't VACUUM already do that?

VACUUM generates a huge load because it repaires all pages
on the table file.

I think (more light-weight) repairing on a single page
is needed to maintain free space in the specific page.

--
NAGAYASU Satoshi <nagayasus@nttdata.co.jp>

#18Hannu Krosing
hannu@skype.net
In reply to: Satoshi Nagayasu (#17)
Re: Pre-allocated free space for row

On K, 2005-08-31 at 16:50 +0900, Satoshi Nagayasu wrote:

Simon Riggs wrote:

4. Allow to repair fragmentation in each page.

Because updates cause fragmentation in the page.

So we need to keep large continuous free space in each page,
if we want to get more effective on PCTFREE feature.

...doesn't VACUUM already do that?

VACUUM generates a huge load because it repaires all pages
on the table file.

I think (more light-weight) repairing on a single page
is needed to maintain free space in the specific page.

There have been plans floating around for adding a more lightweight
vacuum, which uses something similar to FSM to keep track of pages which
need vacuuming. And possibly integreated with background writer to make
effective use of I/O resources.

I guess it could be used for this case of "cheap page cleanups" as well.

--
Hannu Krosing <hannu@skype.net>

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#18)
Re: Pre-allocated free space for row

Hannu Krosing <hannu@skype.net> writes:

On K, 2005-08-31 at 16:50 +0900, Satoshi Nagayasu wrote:

VACUUM generates a huge load because it repaires all pages
on the table file.

I think (more light-weight) repairing on a single page
is needed to maintain free space in the specific page.

There have been plans floating around for adding a more lightweight
vacuum, which uses something similar to FSM to keep track of pages which
need vacuuming. And possibly integreated with background writer to make
effective use of I/O resources.

I guess it could be used for this case of "cheap page cleanups" as well.

Pretty much all of these ideas fall down when you remember that you have
to fix indexes too. There's no such thing as a "cheap page cleanup",
except maybe in a table with no indexes. Cleaning out the indexes
efficiently requires a certain amount of batch processing, which leads
straight back to VACUUM.

regards, tom lane

#20Hannu Krosing
hannu@skype.net
In reply to: Tom Lane (#19)
Re: Pre-allocated free space for row

On K, 2005-08-31 at 10:33 -0400, Tom Lane wrote:

Hannu Krosing <hannu@skype.net> writes:

On K, 2005-08-31 at 16:50 +0900, Satoshi Nagayasu wrote:

VACUUM generates a huge load because it repaires all pages
on the table file.

I think (more light-weight) repairing on a single page
is needed to maintain free space in the specific page.

There have been plans floating around for adding a more lightweight
vacuum, which uses something similar to FSM to keep track of pages which
need vacuuming. And possibly integreated with background writer to make
effective use of I/O resources.

I guess it could be used for this case of "cheap page cleanups" as well.

Pretty much all of these ideas fall down when you remember that you have
to fix indexes too. There's no such thing as a "cheap page cleanup",
except maybe in a table with no indexes. Cleaning out the indexes
efficiently requires a certain amount of batch processing, which leads
straight back to VACUUM.

What I was aiming for here, is cases when bgwriter kicks in after it is
safe to do the cleanup but before the changed page and it's changed
index pages are flushed to disk.

I think that for OLTP scenarios this is what happens quite often.

Even more so if we consider that we do mark quaranteed-invisible pages
in index as well.

My wild guess is that deleting all index pointers for a removed index is
more-or-less the same cost as creating new ones for inserted/updated
page. If so, the max cost factor for doing so is 2X, but usually less,
as many of the needed pages are already in memory even at the time when
it is safe to remove old tuple, which in OLTP usage is a few seconds
(usually even less than a second) after the original delete is done.

It is often more agreeable to take a continuous up-to-2X performance hit
than an unpredictable hit at unknown (or even at a known) time.

--
Hannu Krosing <hannu@skype.net>

#21Tom Lane
tgl@sss.pgh.pa.us
In reply to: Hannu Krosing (#20)
Re: Pre-allocated free space for row

Hannu Krosing <hannu@skype.net> writes:

My wild guess is that deleting all index pointers for a removed index is
more-or-less the same cost as creating new ones for inserted/updated
page.

Only if you are willing to make the removal process recalculate the
index keys from looking at the deleted tuple. This opens up a ton of
gotchas for user-defined index functions, particularly for doing it in
the bgwriter which is not really capable of running transactions.
Removing index entries also requires writing WAL log records, which
is something we probably want to minimize in the bgwriter to avoid
contention issues.

It is often more agreeable to take a continuous up-to-2X performance hit
than an unpredictable hit at unknown (or even at a known) time.

Well, you can have that sort of tradeoff today, by running autovacuum
continuously with the right delay parameters.

The only vacuum optimization idea I've heard that makes any sense to me
is the one about keeping a bitmap of changed pages so that vacuum need
not read in pages that have not changed since last time. Everything
else is just shuffling the same work around, and in most cases doing it
less efficiently than we do now and in more performance-critical places.

regards, tom lane

#22Simon Riggs
simon@2ndquadrant.com
In reply to: Satoshi Nagayasu (#17)
Re: Pre-allocated free space for row

On Wed, 2005-08-31 at 16:50 +0900, Satoshi Nagayasu wrote:

Simon Riggs wrote:

4. Allow to repair fragmentation in each page.

Because updates cause fragmentation in the page.

So we need to keep large continuous free space in each page,
if we want to get more effective on PCTFREE feature.

...doesn't VACUUM already do that?

VACUUM generates a huge load because it repaires all pages
on the table file.

I think (more light-weight) repairing on a single page
is needed to maintain free space in the specific page.

So PCTFREE is an OK idea, but lets drop #4, which is a separate idea and
not one that has gained agreeable consensus.

Best Regards, Simon Riggs

#23Hannu Krosing
hannu@skype.net
In reply to: Tom Lane (#21)
Re: Pre-allocated free space for row

On K, 2005-08-31 at 12:23 -0400, Tom Lane wrote:

Hannu Krosing <hannu@skype.net> writes:

My wild guess is that deleting all index pointers for a removed index is
more-or-less the same cost as creating new ones for inserted/updated
page.

Only if you are willing to make the removal process recalculate the
index keys from looking at the deleted tuple. This opens up a ton of
gotchas for user-defined index functions, particularly for doing it in
the bgwriter which is not really capable of running transactions.

Would it be OK in non-functional index case ?

Removing index entries also requires writing WAL log records, which
is something we probably want to minimize in the bgwriter to avoid
contention issues.

but the WAL log records have to be written at some point anyway, so this
should not increase the general load.

It is often more agreeable to take a continuous up-to-2X performance hit
than an unpredictable hit at unknown (or even at a known) time.

Well, you can have that sort of tradeoff today, by running autovacuum
continuously with the right delay parameters.

The only vacuum optimization idea I've heard that makes any sense to me
is the one about keeping a bitmap of changed pages so that vacuum need
not read in pages that have not changed since last time. Everything
else is just shuffling the same work around, and in most cases doing it
less efficiently than we do now and in more performance-critical places.

Not really, I was aiming at the case where the old and new *index*
entries are also on the same page (quite likely after an update of a
non-index field, or only one of the indexed fields). I this case we are
possibly shuffling around the CPU work, but we have a good chance of
avoiding I/O work. This is similar to placing the updated heap tuple on
the same page as old one to avoid extra page writes.

Another interesting idea is to have a counter in heap tuple for "index
entries pointing to this tuple", so that instead of setting the too-old-
to-be-visible bit, we could just remove the index entry, and decrease
that counter, and remove the counter when it's zero.

--
Hannu Krosing <hannu@skype.net>

#24Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Hannu Krosing (#23)
Re: Pre-allocated free space for row

My wild guess is that deleting all index pointers for a removed

index

is more-or-less the same cost as creating new ones for
inserted/updated page.

Only if you are willing to make the removal process
recalculate the index keys from looking at the deleted tuple.

The bgwriter could "update" all columns of dead heap tuples in heap
pages
to NULL and thus also gain free space without the need to touch the
indexes.
The slot would stay used but it would need less space.

Andreas

#25Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas DAZ SD (#24)
Re: Pre-allocated free space for row

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:

The bgwriter could "update" all columns of dead heap tuples in heap
pages
to NULL and thus also gain free space without the need to touch the
indexes.
The slot would stay used but it would need less space.

Not unless it's running a transaction (consider TOAST updates).

regards, tom lane

#26Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#25)
Re: Pre-allocated free space for row

The bgwriter could "update" all columns of dead heap tuples in heap
pages to NULL and thus also gain free space without the need to

touch

the indexes.
The slot would stay used but it would need less space.

Not unless it's running a transaction (consider TOAST updates).

Ok, you could leave all toast pointers and the toast table as is.

Andreas

#27Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#21)
Re: Pre-allocated free space for row

Tom Lane wrote:

Hannu Krosing <hannu@skype.net> writes:

My wild guess is that deleting all index pointers for a removed index is
more-or-less the same cost as creating new ones for inserted/updated
page.

Only if you are willing to make the removal process recalculate the
index keys from looking at the deleted tuple. This opens up a ton of
gotchas for user-defined index functions, particularly for doing it in
the bgwriter which is not really capable of running transactions.
Removing index entries also requires writing WAL log records, which
is something we probably want to minimize in the bgwriter to avoid
contention issues.

It is often more agreeable to take a continuous up-to-2X performance hit
than an unpredictable hit at unknown (or even at a known) time.

Well, you can have that sort of tradeoff today, by running autovacuum
continuously with the right delay parameters.

The only vacuum optimization idea I've heard that makes any sense to me
is the one about keeping a bitmap of changed pages so that vacuum need
not read in pages that have not changed since last time. Everything
else is just shuffling the same work around, and in most cases doing it
less efficiently than we do now and in more performance-critical places.

I assume that for a vacuum that only hit pages indicated in the bitmap,
it would still be necessary to do an index scan to remove the heap
pointers in the index, right?

I have added the last sentence to the TODO entry:

* Create a bitmap of pages that need vacuuming

Instead of sequentially scanning the entire table, have the background
writer or some other process record pages that have expired rows, then
VACUUM can look at just those pages rather than the entire table. In
the event of a system crash, the bitmap would probably be invalidated.
One complexity is that index entries still have to be vacuumed, and
doing this without an index scan (by using the heap values to find the
index entry) might be slow and unreliable, especially for user-defined
index functions.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#27)
Re: Pre-allocated free space for row

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I assume that for a vacuum that only hit pages indicated in the bitmap,
it would still be necessary to do an index scan to remove the heap
pointers in the index, right?

Given the current vacuum technology, yes. However, bearing in mind that
indexes should generally be much smaller than their tables, cutting down
the table traversal is certainly the first-order problem. (See also
discussion with Simon from today.)

regards, tom lane