BufferSync and bgwriter

Started by Simon Riggsabout 21 years ago21 messages
#1Simon Riggs
simon@2ndquadrant.com
1 attachment(s)

The idea that bgwriter smooths out the response time of transactions is
only true if the buffer lists T1 and T2 have *some* clean buffers
available for use when performing I/O. The alternative is that
transactions unlucky enough to encounter the no-clean-buffers situation
have to clean a space for themselves, effectively making the bgwriter
redundant.

In BufferSync, we start off by calling StrategyDirtyBufferList to make a
list of all the dirty buffers. Even though we know we are limited to
maxpages, we still scan the whole of shared_buffers (...making it a very
expensive call and thereby causing us to increase bgwriter_delay, which
then negates the cleaning effect as described above).

Once we've got the list, we limit ourselves to only using maxpages of
the list that we just built. We do it that way round to allow
bgwriter_percent to calculate how many of the dirty buffers it should
flush, on the assumption that percent < 100.

If the bgwriter_percent = 100, then we should actually do the sensible
thing and prepare the list that we need, i.e. limit
StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Thus if you have a large shared_buffers, you can still have a relatively
frequent bgwriter_delay, so that the bgwriter can keep the LRUs of the
T1 and T2 lists free for use...and so let backends get on with useful
work.

Patch which implements this attached, for discussion.

Mark, any chance we could run this patch on STP to test whether it has a
beneficial performance effect? Re-run test 207 to compare?

I'll be asking for this in 8.0, if it works, for all the same
performance reasons discussed previously as well as coming under the
header of "bgwriter default changes" since this effects the default
behaviour when bgwriter_percent=100.

There are some other ideas for 8.1, but that can wait.

--
Best Regards, Simon Riggs

Attachments:

100pct.patchtext/x-patch; charset=UTF-8; name=100pct.patchDownload
Index: buffer/bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -d -c -r1.182 bufmgr.c
*** buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- buffer/bufmgr.c	11 Dec 2004 17:09:31 -0000
***************
*** 681,686 ****
--- 681,687 ----
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
+     int         maxdirty;
  	int			num_buffer_dirty;
  	int			i;
  
***************
*** 688,704 ****
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
! 	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   NBuffers);
  
  	/*
  	 * If called by the background writer, we are usually asked to only
--- 689,714 ----
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
+     /* If we know we will write all dirty buffers, up to the limit of maxpages
+      * then we can make a cheaper call to StrategyDirtyBufferList
+      */
+ 	if (percent = 100)
+     	maxdirty = maxpages;
+     else
+     	maxdirty = NBuffers;
+ 
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
!    	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   maxdirty);
  
  	/*
  	 * If called by the background writer, we are usually asked to only
#2Neil Conway
neilc@samurai.com
In reply to: Simon Riggs (#1)
Re: [Testperf-general] BufferSync and bgwriter

I wonder if we even need to retain the bgwriter_percent GUC var. Is
there actually a situation in which the combination of bgwriter_maxpages
and bgwriter_delay does not give the DBA sufficient flexibility in
tuning bgwriter behavior?

Simon Riggs wrote:

If the bgwriter_percent = 100, then we should actually do the sensible
thing and prepare the list that we need, i.e. limit
StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Is the plan to make bgwriter_percent = 100 the default setting?

-Neil

#3Simon Riggs
simon@2ndquadrant.com
In reply to: Neil Conway (#2)
1 attachment(s)
Re: [Testperf-general] BufferSync and bgwriter

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
Simon Riggs wrote:

If the bgwriter_percent = 100, then we should actually do the sensible
thing and prepare the list that we need, i.e. limit
StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I'm worried that we're late in the day for changes, but I'm equally
worried that a) the bgwriter is very tuning sensitive b) we don't really
have much info on how to set the defaults in a meaningful way for the
majority of cases c) there are some issues that greatly reduce the
effectiveness of the bgwriter in many circumstances.

The 100pct.patch was my first attempt at getting something acceptable in
the next few days that gives sufficient room for the DBA to perform
tuning.

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:

I wonder if we even need to retain the bgwriter_percent GUC var. Is
there actually a situation in which the combination of bgwriter_maxpages
and bgwriter_delay does not give the DBA sufficient flexibility in
tuning bgwriter behavior?

Yes, I do now think that only two GUCs are required to tune the
behaviour; but you make me think - which two? Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers. (I see that as a bug)

Your question has made me rethink the exact objective of the bgwriter's
actions: The way it is coded now the bgwriter looks for dirty blocks, no
matter where they are in the list. What we are bothered about is the
number of clean buffers at the LRU, which has a direct influence on the
probability that BufferAlloc() will need to call FlushBuffer(), since
StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
After further thought, I would prefer a subtle change in behaviour so
that the bgwriter checks that clean blocks are available at the LRUs for
when buffer replacement occurs. With that slight change, I'd keep the
bgwriter_percent GUC but make it mean something different.

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are then
written to disk. That means the number of dirty blocks written to disk
is between 0 and the number of buffers searched, but we're not hugely
bothered what that number is... [This change to StrategyDirtyBufferList
resolves the unusability of the bgwriter with large shared_buffers]

Writing away dirty blocks towards the MRU end is more likely to be
wasted effort. If a block stays near the MRU then it will be dirty again
in the wink of an eye, so you gain nothing at checkpoint time by
cleaning it. Also, since it isn't near the LRU, cleaning it has no
effect on buffer replacement I/O. If a block is at the LRU, then it is
by definition the least likely to be reused, and is a candidate for
replacement anyway. So concentrating on the LRU, not the number of dirty
buffers seems to be the better thing to do.

That would then be a much simpler way of setting the defaults. With that
definition, we would set the defaults:

bgwriter_percent = 2 (according to my new suggestion here)
bgwriter_delay = 200
bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
tuning)

Thus, for the default shared_buffers=1000 the bgwriter would clear a
space of up to 20 blocks each cycle.
For a config with shared_buffers=60000, the bgwriter default would clear
space for 600 blocks (max) each cycle - a reasonable setting.

Overall that would need very little specific tuning, because it would
scale upwards as you changed the shared_buffers higher.

So, that interpretation of bgwriter_percent gives these advantages:
- we bound the StrategyDirtyBufferList scan to a small % of the whole
list, rather than the whole list...so we could realistically set the
bgwriter_delay lower if required
- we can set a default that scales, so would not often need to change it
- the parameter is defined in terms of the thing we really care about:
sufficient clean blocks at the LRU of the buffer lists
- these changes are very isolated and actually minor - just a different
way of specifying which buffers the bgwriter will clean

Patch attached...again for discussion and to help understanding of this
proposal. Will submit to patches if we agree it seems like the best way
to allow the bgwriter defaults to be sensibly set.

[...and yes, everybody, I do know where we are in the release cycle]

--
Best Regards, Simon Riggs

Attachments:

bg2.patchtext/x-patch; charset=UTF-8; name=bg2.patchDownload
Index: buffer/bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -d -c -r1.182 bufmgr.c
*** buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- buffer/bufmgr.c	12 Dec 2004 21:53:10 -0000
***************
*** 681,686 ****
--- 681,687 ----
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
+     int         maxdirty;
  	int			num_buffer_dirty;
  	int			i;
  
***************
*** 688,717 ****
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
- 	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
- 											   NBuffers);
  
! 	/*
! 	 * If called by the background writer, we are usually asked to only
! 	 * write out some portion of dirty buffers now, to prevent the IO
! 	 * storm at checkpoint time.
! 	 */
! 	if (percent > 0)
! 	{
! 		Assert(percent <= 100);
! 		num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
! 	}
! 	if (maxpages > 0 && num_buffer_dirty > maxpages)
! 		num_buffer_dirty = maxpages;
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 689,720 ----
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
+     /* Set number of buffers we will clean at LRUs of buffer lists 
+      * If no limits set, then clean the whole of shared_buffers
+      */
+ 	if (maxpages > 0)
+     	maxdirty = maxpages;
+     else {
+     	if (percent > 0) {
+        		Assert(percent <= 100);
+         	maxdirty = (NBuffers * percent + 99) / 100;
+         }
+         else
+         	maxdirty = NBuffers;
+     }
+ 
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  
!    	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   maxdirty);
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
Index: buffer/freelist.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.48
diff -d -c -r1.48 freelist.c
*** buffer/freelist.c	16 Sep 2004 16:58:31 -0000	1.48
--- buffer/freelist.c	12 Dec 2004 21:53:11 -0000
***************
*** 735,741 ****
   * StrategyDirtyBufferList
   *
   * Returns a list of dirty buffers, in priority order for writing.
-  * Note that the caller may choose not to write them all.
   *
   * The caller must beware of the possibility that a buffer is no longer dirty,
   * or even contains a different page, by the time he reaches it.  If it no
--- 735,740 ----
***************
*** 755,760 ****
--- 754,760 ----
  	int			cdb_id_t2;
  	int			buf_id;
  	BufferDesc *buf;
+ 	int			i;
  
  	/*
  	 * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
***************
*** 765,771 ****
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
! 	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
  	{
  		if (cdb_id_t1 >= 0)
  		{
--- 765,771 ----
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
! 	for (i = 0; i < max_buffers; i++)
  	{
  		if (cdb_id_t1 >= 0)
  		{
***************
*** 779,786 ****
  					buffers[num_buffer_dirty] = buf;
  					buftags[num_buffer_dirty] = buf->tag;
  					num_buffer_dirty++;
- 					if (num_buffer_dirty >= max_buffers)
- 						break;
  				}
  			}
  
--- 779,784 ----
***************
*** 799,806 ****
  					buffers[num_buffer_dirty] = buf;
  					buftags[num_buffer_dirty] = buf->tag;
  					num_buffer_dirty++;
- 					if (num_buffer_dirty >= max_buffers)
- 						break;
  				}
  			}
  
--- 797,802 ----
#4Neil Conway
neilc@samurai.com
In reply to: Simon Riggs (#3)
Re: [Testperf-general] BufferSync and bgwriter

On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I was just curious why you were bothering to special-case
bgwriter_percent = 100 if it's not going to be the default setting (in
which case I would be surprised if more than 1 in 10 users would take
advantage of the patch).

Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers.

BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
this scan while holding the BufMgrLock, which is a well known source of
contention. So reducing the time we hold that lock would be good.

Your question has made me rethink the exact objective of the bgwriter's
actions: The way it is coded now the bgwriter looks for dirty blocks, no
matter where they are in the list.

Not sure what you mean. StrategyDirtyBufferList() returns the specified
number of dirty buffers in order, starting with the T1/T2 LRUs and going
back to the MRUs of both lists. bgwriter_percent effectively ignores
some portion of the tail of that list, so we end up just flushing the
buffers closest to the L1/L2 LRUs. How is this different from what
you're describing?

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are
then written to disk.

By definition, buffers closest to the LRU end of the lists are not
frequently accessed. If we only search the N% of the lists closest to
LRU, we will probably end up flushing just those pages to disk -- and
then not flushing anything else to disk in the subsequent bgwriter calls
because all the buffers close to the LRU will be non-dirty. That's okay
if all we're concerned about is avoiding write() by a real backend, but
we also want to smooth out checkpoint load, which I don't think this
approach would do well.

I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
is all the tuning we need, and I think "max # of pages to write" is a
simpler and more logical tuning knob than "% of the buffer pool to scan
looking for dirty buffers." So at each bufmgr invocation, we pick the at
most bgwriter_maxpages dirty pages from the pool, using the pages
closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
implement that if you think it sounds okay.

-Neil

#5Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#3)
1 attachment(s)
Re: [Testperf-general] BufferSync and bgwriter

Simon,

I am seeing a reasonably reproducible performance boost after applying
your patch (I'm not sure if that was one of the main objectives, but it
certainly is nice).

I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running
pgbench. However, after applying your patch, 8.0 is pretty much back to
being the same.

Now I know pgbench is ..err... not always the most reliable for this
sort of thing, so I am interested if this seems like a reasonable sort
of thing to be noticing (and also if anyone else has noticed the
decrement)?

(The attached brief results are for Linux x86, but I can see a similar
performance decrement 7.4.6->8.0.0RC1 on FreeBSD 5.3 x86)

regards

Mark
Simon Riggs wrote:

Show quoted text

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

Attachments:

pgbench.resultstext/plain; name=pgbench.resultsDownload
#6Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#5)
Re: [Testperf-general] BufferSync and bgwriter

On Mon, 2004-12-13 at 04:39, Mark Kirkwood wrote:

I am seeing a reasonably reproducible performance boost after applying
your patch (I'm not sure if that was one of the main objectives, but it
certainly is nice).

I *was* seeing a noticeable decrease between 7.4.6 and 8.0.0RC1 running
pgbench. However, after applying your patch, 8.0 is pretty much back to
being the same.

Thanks Mark - brilliant to have some confirming test results back so
quickly.

The tests indicate that we're on the right track here and that we should
test this on the OSDL platform also on a long run, to check out the
effects of both normal running and checkpointing.

Given these test settings:
bgwriter_delay = 200
bgwriter_percent = 2
bgwriter_maxpages = 100

This shows the importance of reducing the length of the BufMgr lock in
StrategyDirtyBufferList() -- which I think Neil also agrees is the main
problem here.

______________________________________________________________________
System
------
P4 2.8Ghz 1G 1xSeagate Barracuda 40G
Linux 2.6.9 glibc 2.3.3 gcc 3.4.2
Postgresql 7.4.6 | 8.0.0RC1

Test
----
Pgbench with scale factor = 200

Pg 7.4.6
--------

clients transactions tps
1 1000 65.1
2 1000 72.5
4 1000 69.2
8 1000 48.3

Pg 8.0.0RC1
-----------

clients transactions tps tps (new buff patch + settings)
1 1000 55.8 70.9
2 1000 68.3 77.9
4 1000 38.4 62.8
8 1000 29.4 38.1

(averages over 3 runs, database dropped and recreated after each set, with a
checkpoint performed after each individual run)

Parameters
----------

Non default postgresql.conf parameters:

tcpip_socket = true [listen_addresses = "*"]
max_connections = 100
shared_buffers = 10000
wal_buffers = 1024
checkpoint_segments = 10
effective_cache_size = 40000
random_page_cost = 0.8

bgwriter settings (used with patch only)

bgwriter_delay = 200
bgwriter_percent = 2
bgwriter_maxpages = 100

--
Best Regards, Simon Riggs

#7Simon Riggs
simon@2ndquadrant.com
In reply to: Neil Conway (#4)
1 attachment(s)
Re: [Testperf-general] BufferSync and bgwriter

On Mon, 2004-12-13 at 02:43, Neil Conway wrote:

On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I was just curious why you were bothering to special-case
bgwriter_percent = 100 if it's not going to be the default setting (in
which case I would be surprised if more than 1 in 10 users would take
advantage of the patch).

Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers.

BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
this scan while holding the BufMgrLock, which is a well known source of
contention. So reducing the time we hold that lock would be good.

Yes, the duration of the BufMgrLock held during StrategyDirtyBufferList
and its effect on system performance is my concern. Reducing that is one
of the primary objectives here (point (ii)).

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are
then written to disk.

By definition, buffers closest to the LRU end of the lists are not
frequently accessed. If we only search the N% of the lists closest to
LRU, we will probably end up flushing just those pages to disk -- and
then not flushing anything else to disk in the subsequent bgwriter calls
because all the buffers close to the LRU will be non-dirty. That's okay
if all we're concerned about is avoiding write() by a real backend, but
we also want to smooth out checkpoint load, which I don't think this
approach would do well.

My argument for that was: N% of lists closest to LRU approach gives
- constant search time (searching for N dirty buffers causes a variable
number of buffers to be searched, so lock time varies...)
- if blocks are no longer used, they eventually migrate to the LRU, so
they then get written away by bgwriter rather than at checkpoint time.
- the blocks near the MRU get dirtied again fairly quickly, so still
need to be flushed again at checkpoint
So, overall, I think this would smooth out the checkpoint load

We've little time left: If we do not manage to perform a performance
test that shows that this argument is valid, then I'd agree that we drop
that idea (for now) because of the risk that it does have the
side-effect you mention.

Longer term, I think possibly having two types of bgwriter activity
would be worthwhile:
1) short and frequent LRU cleaning
2) longer but less frequent mini-checkpoints that reach up towards the
MRU

I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
is all the tuning we need, and I think "max # of pages to write" is a
simpler and more logical tuning knob than "% of the buffer pool to scan
looking for dirty buffers." So at each bufmgr invocation, we pick the at
most bgwriter_maxpages dirty pages from the pool, using the pages
closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
implement that if you think it sounds okay.

Whichever way we do it, we agree that bgwriter_maxpages is all the
tuning that you and I need.

My suggestion was to provide both the tuning knob AND removing the need
for the knob completely for the (as you say) 9 out of 10 people that
never will perform any tuning, by using bgwriter_percent to set a value
that is approximately correct all of the time.

Anyway, thanks for taking the time to read all of these postings. We're
clearly agreed on the main aspect of this, AFAICS.

I'd be happy to supply a patch to

implement that if you think it sounds okay.

...my understanding is that you'd only be touching BufferSync() to
simplify it, and to remove all of the bgwriter_percent GUC stuff and its
call path to BufferSync()?

I've hacked my patch down to show what I think you mean for the
BufferSync() changes.... to allow perf comparisons if time allows.
Clearly your own patch will more accurately portray those...

--
Best Regards, Simon Riggs

Attachments:

bg3.patchtext/x-patch; charset=UTF-8; name=bg3.patchDownload
Index: buffer/bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -d -c -r1.182 bufmgr.c
*** buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- buffer/bufmgr.c	12 Dec 2004 21:53:10 -0000
***************
*** 688,717 ****
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
- 	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
- 											   NBuffers);
  
- 	/*
- 	 * If called by the background writer, we are usually asked to only
- 	 * write out some portion of dirty buffers now, to prevent the IO
- 	 * storm at checkpoint time.
- 	 */
- 	if (percent > 0)
- 	{
- 		Assert(percent <= 100);
- 		num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
- 	}
- 	if (maxpages > 0 && num_buffer_dirty > maxpages)
- 		num_buffer_dirty = maxpages;
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 689,720 ----
  	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(maxpages * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxpages * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  
!    	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   maxpages);
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
#8Mark Wong
markw@osdl.org
In reply to: Simon Riggs (#7)
Re: [Testperf-general] BufferSync and bgwriter

Sorry for the delay; here are results with the bg3.patch with database
parameters that should match run 207. I haven't been able to take the
time too look over the results myself, but I tried to make sure this
run was the same as 207:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/207

Mark

#9Mark Wong
markw@osdl.org
In reply to: Mark Wong (#8)
Re: [Testperf-general] BufferSync and bgwriter

Sorry, wrong link, right one here:
http://www.osdl.org/projects/dbt2dev/results/dev4-010/211

Mark

#10Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Wong (#9)
Re: [Testperf-general] BufferSync and bgwriter

On Wed, 2004-12-15 at 00:00, Mark Wong wrote:

http://www.osdl.org/projects/dbt2dev/results/dev4-010/211

Thanks Mark for turning that around so quickly. Looks good...

Results performed to compare
test 207
http://www.osdl.org/projects/dbt2dev/results/dev4-010/207

test 211 with bg3.patch which matches Neil/my option (3)
http://www.osdl.org/projects/dbt2dev/results/dev4-010/211

The overall results show 3% throughput gain. The negative effects of
checkpointing are significantly reduced and this shows up in the New
Order Transaction response time max dropping from 37s to 25s, which
looks like a significant user-visible performance gain. Similar
reduction in max response times is shown for all transaction types:
consistent removal of the longest wait times.

The gains come from greater effectiveness of the bgwriter, which reduces
I/O wait time spikes to almost zero once the shared_buffers are
completely full. (see Processor Utilization graph: wait)

It looks to me that reducing the bgwriter_delay slightly might yield
additional gains, say to 180 or 160. That should now be possible since
the cost of doing so has been greatly reduced. StrategyDirtyBufferList
has now dropped way down the list in oprofile results.

Neil very kindly points out privately that the patch has a missing
sanity check bug in it, which has shown up in Neil's testing. That
wouldn't effect these performance results, however. I leave it to Neil
to post a corrected version as a result of his efforts.

I leave it to the consensus to decide whether these results represent
significant gains and whether to add to 8.0, or defer.

Neil's suggestion (2) should also needs to be considered - test results
could still show that as the better option, so I keep an open mind.

--
Best Regards, Simon Riggs

#11Jan Wieck
JanWieck@Yahoo.com
In reply to: Simon Riggs (#3)
Re: [Testperf-general] BufferSync and bgwriter

On 12/12/2004 5:08 PM, Simon Riggs wrote:

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
Simon Riggs wrote:

If the bgwriter_percent = 100, then we should actually do the sensible
thing and prepare the list that we need, i.e. limit
StrategyDirtyBufferList to finding at most bgwriter_maxpages.

Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I'm worried that we're late in the day for changes, but I'm equally
worried that a) the bgwriter is very tuning sensitive b) we don't really
have much info on how to set the defaults in a meaningful way for the
majority of cases c) there are some issues that greatly reduce the
effectiveness of the bgwriter in many circumstances.

The 100pct.patch was my first attempt at getting something acceptable in
the next few days that gives sufficient room for the DBA to perform
tuning.

Doesn't cranking up the bgwriter_percent to 100 effectively make the
entire shared memory a write-through cache? In other words, with 100%
the bgwriter will allways write all dirty blocks out and it becomes
unlikely to avoid an IO for subsequent modificaitons to the same data block.

Jan

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:

I wonder if we even need to retain the bgwriter_percent GUC var. Is
there actually a situation in which the combination of bgwriter_maxpages
and bgwriter_delay does not give the DBA sufficient flexibility in
tuning bgwriter behavior?

Yes, I do now think that only two GUCs are required to tune the
behaviour; but you make me think - which two? Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers. (I see that as a bug)

Your question has made me rethink the exact objective of the bgwriter's
actions: The way it is coded now the bgwriter looks for dirty blocks, no
matter where they are in the list. What we are bothered about is the
number of clean buffers at the LRU, which has a direct influence on the
probability that BufferAlloc() will need to call FlushBuffer(), since
StrategyGetBuffer() returns the first unpinned buffer, dirty or not.
After further thought, I would prefer a subtle change in behaviour so
that the bgwriter checks that clean blocks are available at the LRUs for
when buffer replacement occurs. With that slight change, I'd keep the
bgwriter_percent GUC but make it mean something different.

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are then
written to disk. That means the number of dirty blocks written to disk
is between 0 and the number of buffers searched, but we're not hugely
bothered what that number is... [This change to StrategyDirtyBufferList
resolves the unusability of the bgwriter with large shared_buffers]

Writing away dirty blocks towards the MRU end is more likely to be
wasted effort. If a block stays near the MRU then it will be dirty again
in the wink of an eye, so you gain nothing at checkpoint time by
cleaning it. Also, since it isn't near the LRU, cleaning it has no
effect on buffer replacement I/O. If a block is at the LRU, then it is
by definition the least likely to be reused, and is a candidate for
replacement anyway. So concentrating on the LRU, not the number of dirty
buffers seems to be the better thing to do.

That would then be a much simpler way of setting the defaults. With that
definition, we would set the defaults:

bgwriter_percent = 2 (according to my new suggestion here)
bgwriter_delay = 200
bgwriter_maxpages = -1 (i.e. mostly ignore it, but keep it for fine
tuning)

Thus, for the default shared_buffers=1000 the bgwriter would clear a
space of up to 20 blocks each cycle.
For a config with shared_buffers=60000, the bgwriter default would clear
space for 600 blocks (max) each cycle - a reasonable setting.

Overall that would need very little specific tuning, because it would
scale upwards as you changed the shared_buffers higher.

So, that interpretation of bgwriter_percent gives these advantages:
- we bound the StrategyDirtyBufferList scan to a small % of the whole
list, rather than the whole list...so we could realistically set the
bgwriter_delay lower if required
- we can set a default that scales, so would not often need to change it
- the parameter is defined in terms of the thing we really care about:
sufficient clean blocks at the LRU of the buffer lists
- these changes are very isolated and actually minor - just a different
way of specifying which buffers the bgwriter will clean

Patch attached...again for discussion and to help understanding of this
proposal. Will submit to patches if we agree it seems like the best way
to allow the bgwriter defaults to be sensibly set.

[...and yes, everybody, I do know where we are in the release cycle]

------------------------------------------------------------------------

Index: buffer/bufmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -d -c -r1.182 bufmgr.c
*** buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- buffer/bufmgr.c	12 Dec 2004 21:53:10 -0000
***************
*** 681,686 ****
--- 681,687 ----
{
BufferDesc **dirty_buffers;
BufferTag  *buftags;
+     int         maxdirty;
int			num_buffer_dirty;
int			i;

***************
*** 688,717 ****
if (percent == 0 || maxpages == 0)
return 0;

/*
* Get a list of all currently dirty buffers and how many there are.
* We do not flush buffers that get dirtied after we started. They
* have to wait until the next checkpoint.
*/
! dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));

LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
- num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
- NBuffers);

! /*
! * If called by the background writer, we are usually asked to only
! * write out some portion of dirty buffers now, to prevent the IO
! * storm at checkpoint time.
! */
! if (percent > 0)
! {
! Assert(percent <= 100);
! num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
! }
! if (maxpages > 0 && num_buffer_dirty > maxpages)
! num_buffer_dirty = maxpages;

/* Make sure we can handle the pin inside the loop */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 689,720 ----
if (percent == 0 || maxpages == 0)
return 0;
+     /* Set number of buffers we will clean at LRUs of buffer lists 
+      * If no limits set, then clean the whole of shared_buffers
+      */
+ 	if (maxpages > 0)
+     	maxdirty = maxpages;
+     else {
+     	if (percent > 0) {
+        		Assert(percent <= 100);
+         	maxdirty = (NBuffers * percent + 99) / 100;
+         }
+         else
+         	maxdirty = NBuffers;
+     }
+ 
/*
* Get a list of all currently dirty buffers and how many there are.
* We do not flush buffers that get dirtied after we started. They
* have to wait until the next checkpoint.
*/
! 	dirty_buffers = (BufferDesc **) palloc(maxdirty * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxdirty * sizeof(BufferTag));

LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);

! num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! maxdirty);

/* Make sure we can handle the pin inside the loop */
ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
Index: buffer/freelist.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.48
diff -d -c -r1.48 freelist.c
*** buffer/freelist.c	16 Sep 2004 16:58:31 -0000	1.48
--- buffer/freelist.c	12 Dec 2004 21:53:11 -0000
***************
*** 735,741 ****
* StrategyDirtyBufferList
*
* Returns a list of dirty buffers, in priority order for writing.
-  * Note that the caller may choose not to write them all.
*
* The caller must beware of the possibility that a buffer is no longer dirty,
* or even contains a different page, by the time he reaches it.  If it no
--- 735,740 ----
***************
*** 755,760 ****
--- 754,760 ----
int			cdb_id_t2;
int			buf_id;
BufferDesc *buf;
+ 	int			i;

/*
* Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
***************
*** 765,771 ****
cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];

! 	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
{
if (cdb_id_t1 >= 0)
{
--- 765,771 ----
cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];

! for (i = 0; i < max_buffers; i++)
{
if (cdb_id_t1 >= 0)
{
***************
*** 779,786 ****
buffers[num_buffer_dirty] = buf;
buftags[num_buffer_dirty] = buf->tag;
num_buffer_dirty++;
- if (num_buffer_dirty >= max_buffers)
- break;
}
}

--- 779,784 ----
***************
*** 799,806 ****
buffers[num_buffer_dirty] = buf;
buftags[num_buffer_dirty] = buf->tag;
num_buffer_dirty++;
- 					if (num_buffer_dirty >= max_buffers)
- 						break;
}
}
--- 797,802 ----

------------------------------------------------------------------------

---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#12Jan Wieck
JanWieck@Yahoo.com
In reply to: Neil Conway (#4)
Re: [Testperf-general] BufferSync and bgwriter

On 12/12/2004 9:43 PM, Neil Conway wrote:

On Sun, 2004-12-12 at 22:08 +0000, Simon Riggs wrote:

On Sun, 2004-12-12 at 05:46, Neil Conway wrote:
Is the plan to make bgwriter_percent = 100 the default setting?

Hmm...must confess that my only plan is:
i) discover dynamic behaviour of bgwriter
ii) fix any bugs or wierdness as quickly as possible
iii) try to find a way to set the bgwriter defaults

I was just curious why you were bothering to special-case
bgwriter_percent = 100 if it's not going to be the default setting (in
which case I would be surprised if more than 1 in 10 users would take
advantage of the patch).

Right now, bgwriter_delay
is useless because the O(N) behaviour makes it impossible to set any
lower when you have a large shared_buffers.

BTW, I wouldn't be _too_ worried about O(N) behavior, except that we do
this scan while holding the BufMgrLock, which is a well known source of
contention. So reducing the time we hold that lock would be good.

Your question has made me rethink the exact objective of the bgwriter's
actions: The way it is coded now the bgwriter looks for dirty blocks, no
matter where they are in the list.

Not sure what you mean. StrategyDirtyBufferList() returns the specified
number of dirty buffers in order, starting with the T1/T2 LRUs and going
back to the MRUs of both lists. bgwriter_percent effectively ignores
some portion of the tail of that list, so we end up just flushing the
buffers closest to the L1/L2 LRUs. How is this different from what
you're describing?

bgwriter_percent would be the % of shared_buffers that are searched
(from the LRU end) to see if they contain dirty buffers, which are
then written to disk.

By definition, buffers closest to the LRU end of the lists are not
frequently accessed. If we only search the N% of the lists closest to
LRU, we will probably end up flushing just those pages to disk -- and
then not flushing anything else to disk in the subsequent bgwriter calls
because all the buffers close to the LRU will be non-dirty. That's okay
if all we're concerned about is avoiding write() by a real backend, but
we also want to smooth out checkpoint load, which I don't think this
approach would do well.

I suggest just getting rid of bgwriter_percent: AFAICS bgwriter_maxpages
is all the tuning we need, and I think "max # of pages to write" is a
simpler and more logical tuning knob than "% of the buffer pool to scan
looking for dirty buffers." So at each bufmgr invocation, we pick the at
most bgwriter_maxpages dirty pages from the pool, using the pages
closest to the LRUs of T1 and T2. I'd be happy to supply a patch to
implement that if you think it sounds okay.

I too don't think that this approach will retain the checkpoing smooting
effect, the current implementation has.

The real problem is that the "cleaner" the buffer pool is, the longer
the scan for dirty buffers will take because the dirty blocks tend to be
at the very end of the scan order. The real solution for this would be
not to scan the whole pool, but to maintain a separate chain of only
dirty buffers in LRU order.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#13Josh Berkus
josh@agliodbs.com
In reply to: Jan Wieck (#12)
Re: [Testperf-general] BufferSync and bgwriter

Jan,

I too don't think that this approach will retain the checkpoing smooting
effect, the current implementation has.

The real problem is that the "cleaner" the buffer pool is, the longer
the scan for dirty buffers will take because the dirty blocks tend to be
at the very end of the scan order. The real solution for this would be
not to scan the whole pool, but to maintain a separate chain of only
dirty buffers in LRU order.

Hmmm, I've not seen this. For example, with people who are having trouble
with checkpoint spikes on Linux, I've taken to recommending that they call
sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!).
Believe it or not, this does help smooth out the spikes and give better
overall performance in a many-small-writes situation.

Simon, one of the problems with the OSDL-DBT2 test is that it's too steady.
DBT2 gives a constant stream of small writes at a regular, predictable rate.
This does not, in fact, match any real-world application I know.

To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to
change the following:

1) Randomize the timing of the commits, so that sometimes there is only 30
writes/minute, and other times there is 300. A timing pattern that would
produce a "sine wave" with occasional random spikes would be best; in my
experience, OLTP applications tend to have wave-like spikes and lulls.

2) Include a sprinkling of random or regular "large writes" which affect
several tables and 1000's of rows. For example, once per hour, change
10,000 pending orders to "shipped", and archive 10,000 "old orders" to an
archive table.

However, this would require "splitting" DBT2; there's the DBT2 which simulates
the TPC-C test, and the DBT2 which will help us tune for real-world
applications. The two tests will not be the same.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#14Josh Berkus
josh@agliodbs.com
In reply to: Josh Berkus (#13)
Re: [Testperf-general] BufferSync and bgwriter

Folks,

To allow DBT2 to be used for real bgwriter benchmarking, Mark would need to
change the following:

1) Randomize the timing of the commits, so that sometimes there is only 30
writes/minute, and other times there is 300. A timing pattern that would
produce a "sine wave" with occasional random spikes would be best; in my
experience, OLTP applications tend to have wave-like spikes and lulls.

2) Include a sprinkling of random or regular "large writes" which affect
several tables and 1000's of rows. For example, once per hour, change
10,000 pending orders to "shipped", and archive 10,000 "old orders" to an
archive table.

Oh, also we need to:
3) Run the test for 3+ hours after scaling up, and turn on autovacuum.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#15Noname
simon@2ndquadrant.com
In reply to: Josh Berkus (#14)
Re: Re: [Testperf-general] BufferSync and bgwriter

Josh Berkus <josh@agliodbs.com> wrote on 15.12.2004, 18:36:53:

Hmmm, I've not seen this. For example, with people who are having trouble
with checkpoint spikes on Linux, I've taken to recommending that they call
sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!).
Believe it or not, this does help smooth out the spikes and give better
overall performance in a many-small-writes situation.

Yes, but bgwriter needs to issue the writes first before the kernel
cache can be flushed, which is the activity I've been focusing on. If
the bgwriter isn't writing enough, flushing the cache is pointless. If
the bgwriter is writing too much, then thats a waste and likely causing
buffer list contention.

Simon, one of the problems with the OSDL-DBT2 test is that it's too steady.
DBT2 gives a constant stream of small writes at a regular, predictable rate.
This does not, in fact, match any real-world application I know.

Clearly, OSDL-DBT2 is not a real world test! That is its benefit, since
it is heavily instrumented and we are able to re-run it many times
without different parameter settings. The application is well known and
doesn't suffer that badly from factors that would allow certain effects
to be swamped. If it had too much randomness or variation, it would be
difficult to interpret.

My goal has been to tune the server, not to derive marketing numbers.
What DBT-2 does reveal is where contention occurs within the PostgreSQL
server. If things are bad in a static, well known workload then they
will be much more erratic in the choppy waters of the real world.
Simulating reality isn't what any of us need to do - there's always a
system to look at and be confused by its peaks and troughs, user
complaints and hardware idiosyncracies.

DBT2 is just one workload amongst many you can choose as your "tuning
goal". The investigations on that have, IMHO, been genuinely useful in
discovering performance problems in the server. Mark's efforts to
improve the instrumentation of the tests will be useful on other
workloads also.

I'd encourage you to develop variations of DBT2 that can also be used to
tune the server, hopefully running on OSDL. I support open testing
methods as much as I support open development methods and projects.

DBT3 next...

Best Regards, Simon Riggs

#16Josh Berkus
josh@agliodbs.com
In reply to: Noname (#15)
Re: [Testperf-general] BufferSync and bgwriter

Simon,

Clearly, OSDL-DBT2 is not a real world test! That is its benefit, since
it is heavily instrumented and we are able to re-run it many times
without different parameter settings. The application is well known and
doesn't suffer that badly from factors that would allow certain effects
to be swamped. If it had too much randomness or variation, it would be
difficult to interpret.

I don't think you followed me. The issue is that for parameters designed to
"smooth out spikes" like bgwriter and vacuum delay, it helps to have really
bad spikes to begin with. There's a possibility that the parameters (and
calculations) that work well for for a "steady-state" OLTP application are
actually bad for an application with much more erratic usage, just as a high
sort_mem is good for DSS and bad for OLTP.

Mark's efforts to
improve the instrumentation of the tests will be useful on other
workloads also.

Yep, it's been a lot of help. Heck, this is the first time we've had
parameters based on planned tests and not just anecdotes. That's a huge step
forward.

I'm just suggesting that we can improve the test still further specifically
for testing things like bgwriter.

I'd encourage you to develop variations of DBT2 that can also be used to
tune the server, hopefully running on OSDL. I support open testing
methods as much as I support open development methods and projects.

Yeah, I'll just have to do it in a different programming language ;-b

DBT3 next...

Yes, I started setting up a 200GB DBT3 database on one of OSDL's machines.
You're welcome to it, I don't see myself completeting those tests before the
holidays. Want login?

--Josh

--
__Aglio Database Solutions_______________
Josh Berkus Consultant
josh@agliodbs.com www.agliodbs.com
Ph: 415-752-2500 Fax: 415-752-2387
2166 Hayes Suite 200 San Francisco, CA

#17Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Josh Berkus (#16)
Re: [Testperf-general] BufferSync and bgwriter

Hmmm, I've not seen this. For example, with people who are having trouble
with checkpoint spikes on Linux, I've taken to recommending that they call
sync() (via cron) every 5-10 seconds (thanks, Bruce, for suggestion!).
Believe it or not, this does help smooth out the spikes and give better
overall performance in a many-small-writes situation.

The reason is imho that the checkpoint otherwise also syncs all other
writes. These can be writes other backends had to do to replace a buffer.
Linux obviously lacks a mechanism to distribute the IO for cached writes
over time ala bgwriter (or does not do it when already faced with an IO bottleneck).

Andreas

#18Richard Huxton
dev@archonet.com
In reply to: Josh Berkus (#16)
Re: [Testperf-general] BufferSync and bgwriter

Josh Berkus wrote:

Simon,

Clearly, OSDL-DBT2 is not a real world test! That is its benefit, since
it is heavily instrumented and we are able to re-run it many times
without different parameter settings. The application is well known and
doesn't suffer that badly from factors that would allow certain effects
to be swamped. If it had too much randomness or variation, it would be
difficult to interpret.

I don't think you followed me. The issue is that for parameters designed to
"smooth out spikes" like bgwriter and vacuum delay, it helps to have really
bad spikes to begin with. There's a possibility that the parameters (and
calculations) that work well for for a "steady-state" OLTP application are
actually bad for an application with much more erratic usage, just as a high
sort_mem is good for DSS and bad for OLTP.

I'm a little concerned that in an erratic, or even just a changing
environment there isn't going to be any set of values that are "correct".

If I've got this right, the behaviour we're trying to get is:
1. Starting from the oldest dirty block,
2. Write as many dirty blocks as you can, but don't...
3. Re-write frequently used blocks too much (wasteful)

So, can we not just keep track of two numbers:
1. Change in the number of dirty blocks this time vs last
2. Number of re-writes we perform (count collisions in a hash or
similar - doesn't need to be perfect).

If #1 is increasing, then we need to become more active (reduce
bgwriter_delay, increase bgwriter_maxpages).
If #2 starts to go up, or goes past some threshold then we reduce
activity (increase bgwriter_delay, decrease bgwriter_maxpages).
If of the last N blocks written, C have been collisions then assume
we've run out of low-activity blocks to write, stop and sleep.

This has a downside that the figures will never be completely accurate,
but has the advantage that it will automatically track activity.

I'm clearly beyond my technical knowledge here, so if I haven't
understood / it's impractical / will never work, then don't be afraid to
step up and let me know. If it helps, you could always think of me as an
idiot savant who failed his savant exams :-)

--
Richard Huxton
Archonet Ltd

#19Greg Stark
gsstark@mit.edu
In reply to: Jan Wieck (#11)
Re: [Testperf-general] BufferSync and bgwriter

Jan Wieck <JanWieck@yahoo.com> writes:

Doesn't cranking up the bgwriter_percent to 100 effectively make the entire
shared memory a write-through cache? In other words, with 100% the bgwriter
will allways write all dirty blocks out and it becomes unlikely to avoid an IO
for subsequent modificaitons to the same data block.

If the goal is to not write out hot pages why look in T1 at all? Why not just
flush 100% of the dirty pages from T2 and ignore T1 entirely?

--
greg

#20Simon Riggs
simon@2ndquadrant.com
In reply to: Richard Huxton (#18)
Re: [Testperf-general] BufferSync and bgwriter

On Thu, 2004-12-16 at 17:54, Richard Huxton wrote:

Josh Berkus wrote:

Clearly, OSDL-DBT2 is not a real world test! That is its benefit, since
it is heavily instrumented and we are able to re-run it many times
without different parameter settings. The application is well known and
doesn't suffer that badly from factors that would allow certain effects
to be swamped. If it had too much randomness or variation, it would be
difficult to interpret.

I don't think you followed me. The issue is that for parameters designed to
"smooth out spikes" like bgwriter and vacuum delay, it helps to have really
bad spikes to begin with. There's a possibility that the parameters (and
calculations) that work well for for a "steady-state" OLTP application are
actually bad for an application with much more erratic usage, just as a high
sort_mem is good for DSS and bad for OLTP.

I'm a little concerned that in an erratic, or even just a changing
environment there isn't going to be any set of values that are "correct".

I think this expresses my own thoughts most clearly, however: There have
been many good ideas expressed on this thread, though none of them,
including my own, are IMHO suitable for inclusion in 8.0, given the
stage of the release process we are now at.

***
Please give your support now to the addition of Neil's recent bgwriter
patch to the 8.0 release. It simplifies tuning, is proven to remove a
clear performance blockage, yet does so without changing the underlying
algorithm used by the bgwriter - so there is no case to answer along the
lines that this might not apply in some situations. Neil's bgwriter does
the same thing, just avoids holding a critical lock for longer than it
needs to.
***

I will happily discuss further ideas for 8.1 at a later stage.

I'll be around tomorrow for further discussion and better replies to
different individual points. Please excuse my slow answers during this
debate.

--
Best Regards, Simon Riggs

#21Simon Riggs
simon@2ndquadrant.com
In reply to: Richard Huxton (#18)
Re: [Testperf-general] BufferSync and bgwriter

On Thu, 2004-12-16 at 17:54, Richard Huxton wrote:

Josh Berkus wrote:

Clearly, OSDL-DBT2 is not a real world test! That is its benefit, since
it is heavily instrumented and we are able to re-run it many times
without different parameter settings. The application is well known and
doesn't suffer that badly from factors that would allow certain effects
to be swamped. If it had too much randomness or variation, it would be
difficult to interpret.

I don't think you followed me. The issue is that for parameters designed to
"smooth out spikes" like bgwriter and vacuum delay, it helps to have really
bad spikes to begin with. There's a possibility that the parameters (and
calculations) that work well for for a "steady-state" OLTP application are
actually bad for an application with much more erratic usage, just as a high
sort_mem is good for DSS and bad for OLTP.

I'm a little concerned that in an erratic, or even just a changing
environment there isn't going to be any set of values that are "correct".

If I've got this right, the behaviour we're trying to get is:
1. Starting from the oldest dirty block,
2. Write as many dirty blocks as you can, but don't...
3. Re-write frequently used blocks too much (wasteful)

So, can we not just keep track of two numbers:
1. Change in the number of dirty blocks this time vs last
2. Number of re-writes we perform (count collisions in a hash or
similar - doesn't need to be perfect).

If #1 is increasing, then we need to become more active (reduce
bgwriter_delay, increase bgwriter_maxpages).
If #2 starts to go up, or goes past some threshold then we reduce
activity (increase bgwriter_delay, decrease bgwriter_maxpages).
If of the last N blocks written, C have been collisions then assume
we've run out of low-activity blocks to write, stop and sleep.

This has a downside that the figures will never be completely accurate,
but has the advantage that it will automatically track activity.

I'm clearly beyond my technical knowledge here, so if I haven't
understood / it's impractical / will never work, then don't be afraid to
step up and let me know. If it helps, you could always think of me as an
idiot savant who failed his savant exams :-)

Richard,

I like your ideas very much.

For 8.1 or beyond, it seems clear to me that a self-adapting bgwriter
with no/few parameters is the way forward.

My first step will be to instrument the bgwriter, so we have more input
about the dynamic behaviour of the ARC lists and their effect. Then use
that information to trial an adaptive mechanism along the general lines
you suggest.

--
Best Regards, Simon Riggs