bgwriter changes

Started by Neil Conwayabout 21 years ago22 messages
#1Neil Conway
neilc@samurai.com
1 attachment(s)

In recent discussion[1]http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php with Simon Riggs, there has been some talk of
making some changes to the bgwriter. To summarize the problem, the
bgwriter currently scans the entire T1+T2 buffer lists and returns a
list of all the currently dirty buffers. It then selects a subset of
that list (computed using bgwriter_percent and bgwriter_maxpages) to
flush to disk. Not only does this mean we can end up scanning a
significant portion of shared_buffers for every invocation of the
bgwriter, we also do the scan while holding the BufMgrLock, likely
hurting scalability.

I think a fix for this in some fashion is warranted for 8.0. Possible
solutions:

(1) Special-case bgwriter_percent=100. The only reason we need to return
a list of all the dirty buffers is so that we can choose n% of them to
satisfy bgwriter_percent. That is obviously unnecessary if we have
bgwriter_percent=100. I think this change won't help most users,
*unless* we also change bgwriter_percent=100 in the default configuration.

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
and one less GUC var is a good thing, all else being equal. This is
effectively the same as #1 with the default changed, only less flexibility.

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
it mean "the percentage of the buffer pool to scan, at most, to look for
dirty buffers". I don't think this is workable, at least not at this
point in the release cycle, because it means we might not smooth of
checkpoint load, one of the primary goals of the bgwriter (in this
proposal bgwriter would only ever consider writing out a small subset of
the total shared buffer cache: the least-recently-used n%, with 2% being
a suggested default). Some variant of this might be worth exploring for
8.1 though.

A patch (implementing #2) is attached -- any benchmark results would be
helpful. Increasing shared_buffers (to 10,000 or more) should make the
problem noticeable.

Opinions on which route is the best, or on some alternative solution? My
inclination is toward #2, but I'm not dead-set on it.

-Neil

[1]: http://archives.postgresql.org/pgsql-hackers/2004-12/msg00386.php

Attachments:

bgwriter_rem_percent-1.patchtext/x-patch; name=bgwriter_rem_percent-1.patch; x-mac-creator=0; x-mac-type=0Download
Index: doc/src/sgml/runtime.sgml
===================================================================
RCS file: /var/lib/cvs/pgsql/doc/src/sgml/runtime.sgml,v
retrieving revision 1.296
diff -c -r1.296 runtime.sgml
*** doc/src/sgml/runtime.sgml	13 Dec 2004 18:05:09 -0000	1.296
--- doc/src/sgml/runtime.sgml	14 Dec 2004 04:52:26 -0000
***************
*** 1350,1382 ****
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by the
!          following parameters).  The selected buffers will always be
!          the least recently used ones among the currently dirty
!          buffers.  It then sleeps for <varname>bgwriter_delay</>
!          milliseconds, and repeats.  The default value is 200. Note
!          that on many systems, the effective resolution of sleep
!          delays is 10 milliseconds; setting <varname>bgwriter_delay</>
!          to a value that is not a multiple of 10 may have the same
!          results as setting it to the next higher multiple of 10.
!          This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
!         </para>
!        </listitem>
!       </varlistentry>
! 
!       <varlistentry id="guc-bgwriter-percent" xreflabel="bgwriter_percent">
!        <term><varname>bgwriter_percent</varname> (<type>integer</type>)</term>
!        <indexterm>
!         <primary><varname>bgwriter_percent</> configuration parameter</primary>
!        </indexterm>
!        <listitem>
!         <para>
!          In each round, no more than this percentage of the currently
!          dirty buffers will be written (rounding up any fraction to
!          the next whole number of buffers).  The default value is
!          1. This option can only be set at server start or in the
!          <filename>postgresql.conf</filename> file.
          </para>
         </listitem>
        </varlistentry>
--- 1350,1367 ----
          <para>
           Specifies the delay between activity rounds for the
           background writer.  In each round the writer issues writes
!          for some number of dirty buffers (controllable by
!          <varname>bgwriter_maxpages</varname>).  The selected buffers
!          will always be the least recently used ones among the
!          currently dirty buffers.  It then sleeps for
!          <varname>bgwriter_delay</> milliseconds, and repeats.  The
!          default value is 200. Note that on many systems, the
!          effective resolution of sleep delays is 10 milliseconds;
!          setting <varname>bgwriter_delay</> to a value that is not a
!          multiple of 10 may have the same results as setting it to the
!          next higher multiple of 10.  This option can only be set at
!          server start or in the <filename>postgresql.conf</filename>
!          file.
          </para>
         </listitem>
        </varlistentry>
***************
*** 1398,1409 ****
       </variablelist>
  
       <para>
!       Smaller values of <varname>bgwriter_percent</varname> and
!       <varname>bgwriter_maxpages</varname> reduce the extra I/O load
!       caused by the background writer, but leave more work to be done
!       at checkpoint time.  To reduce load spikes at checkpoints,
!       increase the values.  To disable background writing entirely,
!       set <varname>bgwriter_percent</varname> and/or
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
--- 1383,1396 ----
       </variablelist>
  
       <para>
!       Decreasing <varname>bgwriter_maxpages</varname> or increasing
!       <varname>bgwriter_delay</varname> will reduce the extra I/O load
!       caused by the background writer, but will leave more work to be
!       done at checkpoint time. To reduce load spikes at checkpoints,
!       increase the number of pages written per round
!       (<varname>bgwriter_maxpages</varname>) or reduce the delay
!       between rounds (<varname>bgwriter_delay</varname>). To disable
!       background writing entirely, set
        <varname>bgwriter_maxpages</varname> to zero.
       </para>
      </sect3>
Index: src/backend/catalog/index.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/catalog/index.c,v
retrieving revision 1.242
diff -c -r1.242 index.c
*** src/backend/catalog/index.c	1 Dec 2004 19:00:39 -0000	1.242
--- src/backend/catalog/index.c	14 Dec 2004 04:32:39 -0000
***************
*** 1062,1068 ****
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync(-1, -1);
  	}
  	else if (dirty)
  	{
--- 1062,1068 ----
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync(-1);
  	}
  	else if (dirty)
  	{
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/commands/dbcommands.c,v
retrieving revision 1.147
diff -c -r1.147 dbcommands.c
*** src/backend/commands/dbcommands.c	18 Nov 2004 01:14:26 -0000	1.147
--- src/backend/commands/dbcommands.c	14 Dec 2004 04:40:19 -0000
***************
*** 332,338 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(-1, -1);
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
--- 332,338 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database, but bufmgr.c provides no API for that.)
  	 */
! 	BufferSync(-1);
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
***************
*** 1206,1212 ****
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(-1, -1);
  
  #ifndef WIN32
  
--- 1206,1212 ----
  		 * up-to-date for the copy.  (We really only need to flush buffers for
  		 * the source database, but bufmgr.c provides no API for that.)
  		 */
! 		BufferSync(-1);
  
  #ifndef WIN32
  
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.11
diff -c -r1.11 bgwriter.c
*** src/backend/postmaster/bgwriter.c	5 Nov 2004 17:11:28 -0000	1.11
--- src/backend/postmaster/bgwriter.c	14 Dec 2004 04:44:26 -0000
***************
*** 116,122 ****
   * GUC parameters
   */
  int			BgWriterDelay = 200;
- int			BgWriterPercent = 1;
  int			BgWriterMaxPages = 100;
  
  int			CheckPointTimeout = 300;
--- 116,121 ----
***************
*** 372,378 ****
  			n = 1;
  		}
  		else
! 			n = BufferSync(BgWriterPercent, BgWriterMaxPages);
  
  		/*
  		 * Nap for the configured time or sleep for 10 seconds if there
--- 371,377 ----
  			n = 1;
  		}
  		else
! 			n = BufferSync(BgWriterMaxPages);
  
  		/*
  		 * Nap for the configured time or sleep for 10 seconds if there
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.182
diff -c -r1.182 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	24 Nov 2004 02:56:17 -0000	1.182
--- src/backend/storage/buffer/bufmgr.c	14 Dec 2004 04:40:18 -0000
***************
*** 671,717 ****
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * percent/maxpages should be -1 in the former case, and limit values (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int percent, int maxpages)
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
  	int			num_buffer_dirty;
  	int			i;
  
! 	/* If either limit is zero then we are disabled from doing anything... */
! 	if (percent == 0 || maxpages == 0)
  		return 0;
  
  	/*
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(NBuffers * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(NBuffers * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   NBuffers);
! 
! 	/*
! 	 * If called by the background writer, we are usually asked to only
! 	 * write out some portion of dirty buffers now, to prevent the IO
! 	 * storm at checkpoint time.
! 	 */
! 	if (percent > 0)
! 	{
! 		Assert(percent <= 100);
! 		num_buffer_dirty = (num_buffer_dirty * percent + 99) / 100;
! 	}
! 	if (maxpages > 0 && num_buffer_dirty > maxpages)
! 		num_buffer_dirty = maxpages;
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
--- 671,710 ----
   *
   * This is called at checkpoint time to write out all dirty shared buffers,
   * and by the background writer process to write out some of the dirty blocks.
!  * maxpages should be -1 in the former case, and a limit value (>= 0)
   * in the latter.
   *
   * Returns the number of buffers written.
   */
  int
! BufferSync(int maxpages)
  {
  	BufferDesc **dirty_buffers;
  	BufferTag  *buftags;
  	int			num_buffer_dirty;
  	int			i;
  
! 	/* If maxpages is zero then we're effectively disabled */
! 	if (maxpages == 0)
  		return 0;
  
+ 	/* If -1, flush all dirty buffers */
+ 	if (maxpages == -1)
+ 		maxpages = NBuffers;
+ 
  	/*
+ 	 * Get a list of up to "maxpages" dirty buffers, starting from LRU and
  	 * Get a list of all currently dirty buffers and how many there are.
  	 * We do not flush buffers that get dirtied after we started. They
  	 * have to wait until the next checkpoint.
  	 */
! 	dirty_buffers = (BufferDesc **) palloc(maxpages * sizeof(BufferDesc *));
! 	buftags = (BufferTag *) palloc(maxpages * sizeof(BufferTag));
  
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(dirty_buffers, buftags,
! 											   maxpages);
! 	Assert(num_buffer_dirty <= maxpages);
  
  	/* Make sure we can handle the pin inside the loop */
  	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
***************
*** 947,953 ****
  void
  FlushBufferPool(void)
  {
! 	BufferSync(-1, -1);
  	smgrsync();
  }
  
--- 940,946 ----
  void
  FlushBufferPool(void)
  {
! 	BufferSync(-1);
  	smgrsync();
  }
  
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.48
diff -c -r1.48 freelist.c
*** src/backend/storage/buffer/freelist.c	16 Sep 2004 16:58:31 -0000	1.48
--- src/backend/storage/buffer/freelist.c	14 Dec 2004 04:22:02 -0000
***************
*** 753,810 ****
  	int			num_buffer_dirty = 0;
  	int			cdb_id_t1;
  	int			cdb_id_t2;
- 	int			buf_id;
- 	BufferDesc *buf;
  
  	/*
! 	 * Traverse the T1 and T2 list LRU to MRU in "parallel" and add all
! 	 * dirty buffers found in that order to the list. The ARC strategy
! 	 * keeps all used buffers including pinned ones in the T1 or T2 list.
! 	 * So we cannot miss any dirty buffers.
  	 */
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
  	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
  	{
  		if (cdb_id_t1 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t1].buf_id;
- 			buf = &BufferDescriptors[buf_id];
- 
- 			if (buf->flags & BM_VALID)
- 			{
- 				if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
- 				{
- 					buffers[num_buffer_dirty] = buf;
- 					buftags[num_buffer_dirty] = buf->tag;
- 					num_buffer_dirty++;
- 					if (num_buffer_dirty >= max_buffers)
- 						break;
- 				}
- 			}
- 
  			cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
  		}
! 
! 		if (cdb_id_t2 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t2].buf_id;
! 			buf = &BufferDescriptors[buf_id];
  
! 			if (buf->flags & BM_VALID)
  			{
! 				if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
! 				{
! 					buffers[num_buffer_dirty] = buf;
! 					buftags[num_buffer_dirty] = buf->tag;
! 					num_buffer_dirty++;
! 					if (num_buffer_dirty >= max_buffers)
! 						break;
! 				}
  			}
- 
- 			cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
  		}
  	}
  
--- 753,797 ----
  	int			num_buffer_dirty = 0;
  	int			cdb_id_t1;
  	int			cdb_id_t2;
  
  	/*
! 	 * Traverse the T1 and T2 list from LRU to MRU in "parallel" and
! 	 * add all dirty buffers found in that order to the list. The ARC
! 	 * strategy keeps all used buffers including pinned ones in the T1
! 	 * or T2 list.  So we cannot miss any dirty buffers.
  	 */
  	cdb_id_t1 = StrategyControl->listHead[STRAT_LIST_T1];
  	cdb_id_t2 = StrategyControl->listHead[STRAT_LIST_T2];
  
  	while (cdb_id_t1 >= 0 || cdb_id_t2 >= 0)
  	{
+ 		int			buf_id;
+ 		BufferDesc *buf;
+ 
  		if (cdb_id_t1 >= 0)
  		{
  			buf_id = StrategyCDB[cdb_id_t1].buf_id;
  			cdb_id_t1 = StrategyCDB[cdb_id_t1].next;
  		}
! 		else
  		{
+ 			Assert(cdb_id_t2 >= 0);
  			buf_id = StrategyCDB[cdb_id_t2].buf_id;
! 			cdb_id_t2 = StrategyCDB[cdb_id_t2].next;
! 		}
! 
! 		buf = &BufferDescriptors[buf_id];
  
! 		if (buf->flags & BM_VALID)
! 		{
! 			if ((buf->flags & BM_DIRTY) || (buf->cntxDirty))
  			{
! 				buffers[num_buffer_dirty] = buf;
! 				buftags[num_buffer_dirty] = buf->tag;
! 				num_buffer_dirty++;
! 				if (num_buffer_dirty >= max_buffers)
! 					break;
  			}
  		}
  	}
  
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/guc.c,v
retrieving revision 1.250
diff -c -r1.250 guc.c
*** src/backend/utils/misc/guc.c	24 Nov 2004 19:51:03 -0000	1.250
--- src/backend/utils/misc/guc.c	14 Dec 2004 04:44:40 -0000
***************
*** 1249,1263 ****
  	},
  
  	{
- 		{"bgwriter_percent", PGC_SIGHUP, RESOURCES,
- 			gettext_noop("Background writer percentage of dirty buffers to flush per round"),
- 			NULL
- 		},
- 		&BgWriterPercent,
- 		1, 0, 100, NULL, NULL
- 	},
- 
- 	{
  		{"bgwriter_maxpages", PGC_SIGHUP, RESOURCES,
  			gettext_noop("Background writer maximum number of pages to flush per round"),
  			NULL
--- 1249,1254 ----
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /var/lib/cvs/pgsql/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.134
diff -c -r1.134 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample	5 Nov 2004 19:16:16 -0000	1.134
--- src/backend/utils/misc/postgresql.conf.sample	14 Dec 2004 04:54:47 -0000
***************
*** 96,106 ****
  #vacuum_cost_page_dirty = 20	# 0-10000 credits
  #vacuum_cost_limit = 200	# 0-10000 credits
  
! # - Background writer -
  
  #bgwriter_delay = 200		# 10-10000 milliseconds between rounds
! #bgwriter_percent = 1		# 0-100% of dirty buffers in each round
! #bgwriter_maxpages = 100	# 0-1000 buffers max per round
  
  
  #---------------------------------------------------------------------------
--- 96,105 ----
  #vacuum_cost_page_dirty = 20	# 0-10000 credits
  #vacuum_cost_limit = 200	# 0-10000 credits
  
! # - Background Writer -
  
  #bgwriter_delay = 200		# 10-10000 milliseconds between rounds
! #bgwriter_maxpages = 100	# max buffers written per round, 0 disables
  
  
  #---------------------------------------------------------------------------
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.3
diff -c -r1.3 bgwriter.h
*** src/include/postmaster/bgwriter.h	29 Aug 2004 04:13:09 -0000	1.3
--- src/include/postmaster/bgwriter.h	14 Dec 2004 04:44:44 -0000
***************
*** 18,24 ****
  
  /* GUC options */
  extern int	BgWriterDelay;
- extern int	BgWriterPercent;
  extern int	BgWriterMaxPages;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
--- 18,23 ----
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /var/lib/cvs/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.88
diff -c -r1.88 bufmgr.h
*** src/include/storage/bufmgr.h	16 Oct 2004 18:57:26 -0000	1.88
--- src/include/storage/bufmgr.h	14 Dec 2004 04:40:09 -0000
***************
*** 150,156 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern int	BufferSync(int percent, int maxpages);
  
  extern void InitLocalBuffer(void);
  
--- 150,156 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern int	BufferSync(int maxpages);
  
  extern void InitLocalBuffer(void);
  
#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Neil Conway (#1)
Re: bgwriter changes

Neil Conway wrote:

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
and one less GUC var is a good thing, all else being equal. This is
effectively the same as #1 with the default changed, only less flexibility.

I prefer #2, and agree with you and Simon that something has to be done
for 8.0.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Neil Conway (#1)
Re: bgwriter changes

Neil Conway <neilc@samurai.com> writes:

...
(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
...

Of the three offered solutions, I agree that that makes the most sense
(unless Jan steps up with a strong argument why this knob is needed).

However, due consideration should also be given to

(4) Do nothing until 8.1.

At this point in the release cycle I'm not sure we should be making
any significant changes for anything less than a crashing bug.

A patch (implementing #2) is attached -- any benchmark results would be
helpful. Increasing shared_buffers (to 10,000 or more) should make the
problem noticeable.

I'd want to see some pretty impressive benchmark results before we
consider making a change now.

regards, tom lane

#4Andrew Dunstan
andrew@dunslane.net
In reply to: Tom Lane (#3)
Re: bgwriter changes

Tom Lane wrote:

However, due consideration should also be given to

(4) Do nothing until 8.1.

At this point in the release cycle I'm not sure we should be making
any significant changes for anything less than a crashing bug.

If that's not the policy, then I don't understand the dev cycle state
labels used.

In the commercial world, my approach would be that if this was
determined to be necessary (about which I am moderately agnostic) then
we would abort the current RC stage, effectively postponing the release.

cheers

andrew

#5Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Andrew Dunstan (#4)
Re: bgwriter changes

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning
bgwriter behavior,

One argument for it is to avoid writing very hot pages.

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
it mean "the percentage of the buffer pool to scan, at most, to look for
dirty buffers". I don't think this is workable, at least not at this

a la long I think we want to avoid that checkpoint needs to do a lot of
writing, without writing hot pages too often. This can only reasonably be
defined with a max number of pages we want to allow dirty at checkpoint time.
bgwriter_percent comes close to this meaning, although in this sense the value
would need to be high, like 80%.

I think we do want 2 settings. Think of one as a short time value
(so bgwriter does not write everything in one run) and one a long term
target over multiple runs.

Is it possible to do a patch that produces a dirty buffer list in LRU order
and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Andreas

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas DAZ SD (#5)
Re: bgwriter changes

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:

Is it possible to do a patch that produces a dirty buffer list in LRU order
and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

regards, tom lane

#7Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#6)
Re: bgwriter changes

On Tue, 2004-12-14 at 19:40, Tom Lane wrote:

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:

Is it possible to do a patch that produces a dirty buffer list in LRU order
and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

...which was exactly what was proposed for option (3).

--
Best Regards, Simon Riggs

#8Neil Conway
neilc@samurai.com
In reply to: Tom Lane (#3)
Re: bgwriter changes

On Tue, 2004-12-14 at 09:23 -0500, Tom Lane wrote:

At this point in the release cycle I'm not sure we should be making
any significant changes for anything less than a crashing bug.

Yes, that's true, and I am definitely hesitant to make changes during
RC. That said, "adjust bgwriter defaults" has been on the "open items"
list for quite some time -- in some sense #2 is just a variant on that
idea.

I'd want to see some pretty impressive benchmark results before we
consider making a change now.

http://archives.postgresql.org/pgsql-hackers/2004-12/msg00426.php

is with a patch from Simon that implements #3. While that's not exactly
the same as #2, it does seem to suggest that the performance difference
is rather noticeable. If the problem does indeed exacerbate BufMgrLock
contention, it might be more noticeable still on an SMP machine.

I'm going to try and get some more benchmark data; if anyone else wants
to try the patch and contribute results they are welcome to.

-Neil

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Neil Conway (#1)
Re: bgwriter changes

On Tue, 2004-12-14 at 13:30, Neil Conway wrote:

In recent discussion[1] with Simon Riggs, there has been some talk of
making some changes to the bgwriter. To summarize the problem, the
bgwriter currently scans the entire T1+T2 buffer lists and returns a
list of all the currently dirty buffers. It then selects a subset of
that list (computed using bgwriter_percent and bgwriter_maxpages) to
flush to disk. Not only does this mean we can end up scanning a
significant portion of shared_buffers for every invocation of the
bgwriter, we also do the scan while holding the BufMgrLock, likely
hurting scalability.

Neil's summary is very clear, many thanks.

There has been many suggestions, patches and test results, so I have
attempted to summarise everything here, using Neil's post to give
structure to the other information:

I think a fix for this in some fashion is warranted for 8.0. Possible
solutions:

I add 2 things to this structure
i) the name of the patch that implements that (authors initials)
ii) benchmark results published that run those

(1) Special-case bgwriter_percent=100. The only reason we need to return
a list of all the dirty buffers is so that we can choose n% of them to
satisfy bgwriter_percent. That is obviously unnecessary if we have
bgwriter_percent=100. I think this change won't help most users,
*unless* we also change bgwriter_percent=100 in the default configuration.

100pct.patch (SR)

Test results to date:
1. Mark Kirkwood ([HACKERS] [Testperf-general] BufferSync and bgwriter)
pgbench 1xCPU 1xDisk shared_buffers=10000
showed 8.0RC1 had regressed compared with 7.4.6, but patch improved
performance significantly against 8.0RC1

Discounted now by both Neil and myself, since the same idea has been
more generally implemented as ideas (2) and (3) below.

(2) Remove bgwriter_percent. I have yet to hear anyone argue that
there's an actual need for bgwriter_percent in tuning bgwriter behavior,
and one less GUC var is a good thing, all else being equal. This is
effectively the same as #1 with the default changed, only less flexibility.

There are 2 patches published which do same thing:
- Partially implemented following Neil's suggestion: bg3.patch (SR)
- Fully implemented: bgwriter_rem_percent-1.patch (NC)
Patches have an identical effect on performance.

Test results to date:
1. Neil's testing was "inconclusive" for shared_buffers = 2500 on a
single cpu, single disk system (test used bgwriter_rem_percent-1.patch)
2. Mark Wong's OSDL tests published as test 211
analysis already posted on this thread;
dbt-2 4 CPU, many disk, shared_buffers=60000 (test used bg3.patch)
3% overall benefit, greatly reduced max transaction times
3. Mark Kirkwood's tests
pgbench 2xCPU 2xdisk, shared_buffers=10000 (test used
bgwriter_rem_percent-1.patch)
Showed slight regression against RC1 - must be test variability because
the patch does less work and is very unlikely to cause a regression

(3) Change the meaning of bgwriter_percent, per Simon's proposal. Make
it mean "the percentage of the buffer pool to scan, at most, to look for
dirty buffers". I don't think this is workable, at least not at this
point in the release cycle, because it means we might not smooth of
checkpoint load, one of the primary goals of the bgwriter (in this
proposal bgwriter would only ever consider writing out a small subset of
the total shared buffer cache: the least-recently-used n%, with 2% being
a suggested default). Some variant of this might be worth exploring for
8.1 though.

Implemented as bg2.patch (SR)
Contains a small bug, easily fixed, which would not effect performance

Test results to date:
1. Mark Kirkwood's tests
pgbench 2xCPU 2xdisk, shared_buffers=10000 (test used bg2.patch)
Showed improvement on RC1 and best option out of all three tests
(compared RC1, bg2.patch, bgwriter_rem_percent-1.patch), possibly
similar within bounds of test variability - but interesting enough to
investigate further.

Current situation seems to be:
- all test results indicate performance regressions in RC1 when
shared_buffers >= 10000 and using multi-cpu/multi-disk systems
- option (2) has the most thoroughly confirmable test results and is
thought by all parties to be the simplest and most robust approach.
- some more test results would be useful to compare, to ensure that
applying the patch would be useful in all circumstances.

Approach (3) looks interesting and should be investigated for 8.1, since
it introduces a subtlely different algorithm that may have "interesting
flight characteristics" and is more of a risk to the 8.0 release.

Thanks very much to all performance testers. It's important work.

--
Best Regards, Simon Riggs

#10Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Simon Riggs (#9)
Re: bgwriter changes

and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

...which was exactly what was proposed for option (3).

But the benchmark run was with bgwriter_percent 100. I wanted to point out,
that I think 100% is too much (writes hot pages multiple times between checkpoints).
In the benchmark, bgwriter obviously falls behind, the delay is too long. But if you
reduce the delay you will start to see what I mean.

Actually I think what is really needed is a max number of pages we want dirty
during checkpoint. Since that would again require scanning all pages, the next best
definition would imho be stop at a percentage (or a number of pages short) of total T1/T2.
Then you can still calculate a worst case IO for checkpoint (assume that all hot pages are dirty)

Andreas

#11Noname
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas DAZ SD (#10)
Re: Re: bgwriter changes

Zeugswetter Andreas DAZ SD <ZeugswetterA@spardat.at> wrote on
15.12.2004, 11:39:44:

and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

...which was exactly what was proposed for option (3).

But the benchmark run was with bgwriter_percent 100.

Yes, it was for run 211, but the patch that was used effectively
disabled bgwriter_percent in favour of looking only at
bgwriter_maxpages.

The patch used was not exactly what was being discussed here. In that
patch, StrategyDirtyBufferList scans until it find bgwriter_maxpages
dirty pages, then stops. That means a varying number of buffers on the
list are scanned, starting from the LRU.

What is being suggested here was implemented for bg2.patch. The
algorithm in there was for StrategyDirtyBufferList to scan until it had
looked at the dirty/clean status of bgwriter_maxpages buffers. That
means a constant number of buffers on the list are scanned, starting
from the LRU.

The two alternative algorithms are similar, but have these differences:
The former (option (2)) finds a constant number of dirty pages, though
has varying search time. The latter (option (3)) has constant search
time, yet finds a varying number of dirty pages. Both alternatives
avoid scanning the whole of the buffer list, as is the case in 8.0RC1,
allowing the bgwriter to act more frequently at lower cost.

There's some evidence that the second algorithm may be better, but may
have other characteristics or side-effects that we don't yet know. So
At this stage of the game, I'm happier not to progress option (3) any
further for r8.0, since option(2) is closest to the one that has been
through beta-testing.

Best Regards, Simon Riggs

#12Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Noname (#11)
Re: bgwriter changes

The two alternative algorithms are similar, but have these
differences:
The former (option (2)) finds a constant number of dirty pages, though
has varying search time.

This has the disadvantage of converging against 0 dirty pages.
A system that has less than maxpages dirty will write every page with
every bgwriter run.

The latter (option (3)) has constant search
time, yet finds a varying number of dirty pages.

This might have the disadvantage of either leaving too much for the
checkpoint or writing too many dirty pages in one run. Is writing a lot
in one run actually a problem though ? Or does the bgwriter pause
periodically while writing the pages of one run ?
If this is expressed in pages it would naturally need to be more than the
current maxpages (to accomodate for clean pages). The suggested 2% sounded
way too low for me (that leaves 98% to the checkpoint).

Also I think we are doing too frequent checkpoints with bgwriter in
place. Every 15-30 minutes should be sufficient, even for benchmarks.
We need a tuned bgwriter for this though.

Andreas

#13Noname
simon@2ndquadrant.com
In reply to: Zeugswetter Andreas DAZ SD (#12)
Re: RE: Re: bgwriter changes

Zeugswetter Andreas DAZ SD <ZeugswetterA@spardat.at> wrote on
15.12.2004, 15:33:16:

The two alternative algorithms are similar, but have these
differences:
The former (option (2)) finds a constant number of dirty pages, though
has varying search time.

This has the disadvantage of converging against 0 dirty pages.
A system that has less than maxpages dirty will write every page with
every bgwriter run.

Yes, that is my issue with that algorithm.... it causes more contention
when there are less dirty pages.

The latter (option (3)) has constant search
time, yet finds a varying number of dirty pages.

This might have the disadvantage of either leaving too much for the
checkpoint or writing too many dirty pages in one run. Is writing a lot
in one run actually a problem though ? Or does the bgwriter pause
periodically while writing the pages of one run ?
If this is expressed in pages it would naturally need to be more than the
current maxpages (to accomodate for clean pages). The suggested 2% sounded
way too low for me (that leaves 98% to the checkpoint).

This remains to be seen. We have Mark Kirkwood's test results that show
that the algorithm may work better, but no large scale OSDL run as yet.

My view is that the 2% is misleading. The whole buffer list is like a
conveyor belt moving towards the LRU. It is my *conjecture* that
cleaning the LRU would be sufficient to clean the whole list
eventually. Blocks in the buffer list that always stay near the MRU
would be dirtied again quickly even if you did clean them, so if they
don't reach nearly to the LRU then there is less benefit in cleaning
them. (1%, 2% or 5% would need to be a tunable factor; 2% was the
suggested default)

If the bgwriter writes too often it would get in the way of other work,
so there is clearly an optimum setting for any workload.

Also I think we are doing too frequent checkpoints with bgwriter in
place. Every 15-30 minutes should be sufficient, even for benchmarks.
We need a tuned bgwriter for this though.

Well, yes, you're right. ...but the bug limiting us to 255 files
restricts us there for higher performance situations.

Best Regards, Simon Riggs

#14Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#6)
Re: bgwriter changes

On 12/14/2004 2:40 PM, Tom Lane wrote:

"Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at> writes:

Is it possible to do a patch that produces a dirty buffer list in LRU order
and stops early when eighter maxpages is reached or bgwriter_percent
pages are scanned ?

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

That definition is identical to a fixed maximum number of pages to write
per call. And since that parameter exists too, it would be redundant.

The other way around would make sense. In order to avoid writing the
busiest buffers at all (except for checkpoinging), the parameter should
mean "don't scan the last x% of the queue at all".

Still, we need to avoid scanning over all the clean blocks of a large
buffer pool, so there is need for a separate dirty-LRU.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#15Tom Lane
tgl@sss.pgh.pa.us
In reply to: Jan Wieck (#14)
Re: bgwriter changes

Jan Wieck <JanWieck@Yahoo.com> writes:

Still, we need to avoid scanning over all the clean blocks of a large
buffer pool, so there is need for a separate dirty-LRU.

That's not happening, unless you want to undo the cntxDirty stuff,
with unknown implications for performance and deadlock safety. It's
definitely not happening in 8.0 ;-)

regards, tom lane

#16Jan Wieck
JanWieck@Yahoo.com
In reply to: Tom Lane (#15)
Re: bgwriter changes

On 12/15/2004 12:10 PM, Tom Lane wrote:

Jan Wieck <JanWieck@Yahoo.com> writes:

Still, we need to avoid scanning over all the clean blocks of a large
buffer pool, so there is need for a separate dirty-LRU.

That's not happening, unless you want to undo the cntxDirty stuff,
with unknown implications for performance and deadlock safety. It's
definitely not happening in 8.0 ;-)

Sure not.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#17Mark Kirkwood
markir@coretech.co.nz
In reply to: Simon Riggs (#9)
Re: bgwriter changes

Simon Riggs wrote:

100pct.patch (SR)

Test results to date:
1. Mark Kirkwood ([HACKERS] [Testperf-general] BufferSync and bgwriter)
pgbench 1xCPU 1xDisk shared_buffers=10000
showed 8.0RC1 had regressed compared with 7.4.6, but patch improved
performance significantly against 8.0RC1

It occurs to me that cranking up the number of transactions (say
1000->100000) and seeing if said regression persists would be
interesting. This would give the smoothing effect of the bgwriter (plus
the ARC) a better chance to shine.

regards

Mark

#18Zeugswetter Andreas DAZ SD
ZeugswetterA@spardat.at
In reply to: Mark Kirkwood (#17)
Re: bgwriter changes

Only if you redefine the meaning of bgwriter_percent. At present it's
defined by reference to the total number of dirty pages, and that can't
be known without collecting them all.

If it were, say, a percentage of the total length of the T1/T2 lists,
then we'd have some chance of stopping the scan early.

The other way around would make sense. In order to avoid writing the
busiest buffers at all (except for checkpoinging), the parameter should
mean "don't scan the last x% of the queue at all".

Your meaning is 1 - above meaning (at least that is what Tom and I meant),
but is probably easier to understand (== Informix LRU_MIN_DIRTY).

Still, we need to avoid scanning over all the clean blocks of a large
buffer pool, so there is need for a separate dirty-LRU.

Maybe a "may be dirty" bitmap would be easier to do without beeing deadlock prone ?

Andreas

#19Neil Conway
neilc@samurai.com
In reply to: Zeugswetter Andreas DAZ SD (#12)
Re: bgwriter changes

Zeugswetter Andreas DAZ SD wrote:

This has the disadvantage of converging against 0 dirty pages.
A system that has less than maxpages dirty will write every page with
every bgwriter run.

Yeah, I'm concerned about the bgwriter being overly aggressive if we
disable bgwriter_percent. If we leave the settings as they are (delay =
200, maxpages = 100, shared_buffers = 1000 by default), we will be
writing all the dirty pages to disk every 2 seconds, which seems far too
much.

It might also be good to reduce the delay, in order to more proactively
keep the LRUs clean (e.g. scanning to find N dirty pages once per second
is likely to reach father away from the LRU than scanning for N/M pages
once per 1/M seconds). On the other hand the more often the bgwriter
scans the buffer pool, the more times the BufMgrLock needs to be
acquired -- and in a system in which pages aren't being dirtied very
rapidly (or the dirtied pages tend to be very hot), each of those scans
is going to take a while to find enough dirty pages using #2. So perhaps
it is best to leave the delay as is for 8.0.

This might have the disadvantage of either leaving too much for the
checkpoint or writing too many dirty pages in one run. Is writing a lot
in one run actually a problem though ? Or does the bgwriter pause
periodically while writing the pages of one run ?

The bgwriter does not pause between writing pages. What would be the
point of doing that? The kernel is going to be caching the write() anyway.

If this is expressed in pages it would naturally need to be more than the
current maxpages (to accomodate for clean pages). The suggested 2% sounded
way too low for me (that leaves 98% to the checkpoint).

I agree this might be a problem, but it doesn't necessarily leave 98% to
be written at checkpoint: if the buffers in the LRU change over time,
the set of pages searched by the bgwriter will also change. I'm not sure
how quickly the pages near the LRU change in a "typical workload";
moreover, I think this would vary between different workloads.

-Neil

#20Mark Kirkwood
markir@coretech.co.nz
In reply to: Mark Kirkwood (#17)
Re: bgwriter changes

Mark Kirkwood wrote:

It occurs to me that cranking up the number of transactions (say
1000->100000) and seeing if said regression persists would be
interesting. This would give the smoothing effect of the bgwriter
(plus the ARC) a better chance to shine.

I ran a few of these over the weekend - since it rained here :-) , and
the results are quite interesting:

[2xPIII, 2G, 2xATA RAID 0, FreeBSD 5.3 with the same non default Pg
parameters as before]

clients = 4 transactions = 100000 (/client), each test run twice

Version tps
7.4.6 49
8.0.0.0RC1 50
8.0.0.0RC1 + rem 49
8.0.0.0RC1 + bg2 50

Needless to way, all well within measurement error of each other (the
variability was about 1).

I suspect that my previous tests had too few transactions to trigger
many (or any) checkpoints. With them now occurring in the test, they
look to be the most significant factor (contrast with 70-80 tps for 4
clients with 1000 transactions).

Also with a small number of transactions, the fsyn'ed blocks may have
all fitted in the ATA disk caches (2x2M). In hindsight I should have
disabled this! (might run the smaller no. transactions again with
hw.ata.wc=0 and see if this is enlightening)

regards

Mark

#21Simon Riggs
simon@2ndquadrant.com
In reply to: Mark Kirkwood (#20)
Re: bgwriter changes

On Mon, 2004-12-20 at 01:17, Mark Kirkwood wrote:

Mark Kirkwood wrote:

It occurs to me that cranking up the number of transactions (say
1000->100000) and seeing if said regression persists would be
interesting. This would give the smoothing effect of the bgwriter
(plus the ARC) a better chance to shine.

I ran a few of these over the weekend - since it rained here :-) , and
the results are quite interesting:

[2xPIII, 2G, 2xATA RAID 0, FreeBSD 5.3 with the same non default Pg
parameters as before]

clients = 4 transactions = 100000 (/client), each test run twice

Version tps
7.4.6 49
8.0.0.0RC1 50
8.0.0.0RC1 + rem 49
8.0.0.0RC1 + bg2 50

Needless to way, all well within measurement error of each other (the
variability was about 1).

I suspect that my previous tests had too few transactions to trigger
many (or any) checkpoints. With them now occurring in the test, they
look to be the most significant factor (contrast with 70-80 tps for 4
clients with 1000 transactions).

Also with a small number of transactions, the fsyn'ed blocks may have
all fitted in the ATA disk caches (2x2M). In hindsight I should have
disabled this! (might run the smaller no. transactions again with
hw.ata.wc=0 and see if this is enlightening)

These test results do seem to have greatly reduced variability: thanks.

From what you say, this means parameter setting were: (?)

shared_buffers = 10000
bgwriter_delay = 200
bgwriter_maxpages = 100

My interpretation of this is that the bgwriter is not effective with
these (the default) parameter settings.

I think the optimum performance is by reducing both bgwriter_delay and
bgwriter_maxpages, though reducing the delay isn't sensibly possible
with 8.0RCn when shared_buffers is large.

--
Best Regards, Simon Riggs

#22Simon Riggs
simon@2ndquadrant.com
In reply to: Neil Conway (#19)
Re: bgwriter changes

On Thu, 2004-12-16 at 11:07, Neil Conway wrote:

Zeugswetter Andreas DAZ SD wrote:

This has the disadvantage of converging against 0 dirty pages.
A system that has less than maxpages dirty will write every page with
every bgwriter run.

Yeah, I'm concerned about the bgwriter being overly aggressive if we
disable bgwriter_percent. If we leave the settings as they are (delay =
200, maxpages = 100, shared_buffers = 1000 by default), we will be
writing all the dirty pages to disk every 2 seconds, which seems far too
much.

It might also be good to reduce the delay, in order to more proactively
keep the LRUs clean (e.g. scanning to find N dirty pages once per second
is likely to reach father away from the LRU than scanning for N/M pages
once per 1/M seconds). On the other hand the more often the bgwriter
scans the buffer pool, the more times the BufMgrLock needs to be
acquired -- and in a system in which pages aren't being dirtied very
rapidly (or the dirtied pages tend to be very hot), each of those scans
is going to take a while to find enough dirty pages using #2. So perhaps
it is best to leave the delay as is for 8.0.

I think this is probably the right thing to do, since the majority of
users will have low/medium workloads, not the extremes of performance
that we have mainly been discussing.

This might have the disadvantage of either leaving too much for the
checkpoint or writing too many dirty pages in one run. Is writing a lot
in one run actually a problem though ? Or does the bgwriter pause
periodically while writing the pages of one run ?

The bgwriter does not pause between writing pages. What would be the
point of doing that? The kernel is going to be caching the write() anyway.

If this is expressed in pages it would naturally need to be more than the
current maxpages (to accomodate for clean pages). The suggested 2% sounded
way too low for me (that leaves 98% to the checkpoint).

I agree this might be a problem, but it doesn't necessarily leave 98% to
be written at checkpoint: if the buffers in the LRU change over time,
the set of pages searched by the bgwriter will also change.

Agreed.

I'm not sure
how quickly the pages near the LRU change in a "typical workload";
moreover, I think this would vary between different workloads.

Yes, clearly we need to be able to change the parameters according to
the workload....and long term have them vary as needs change.

My concern at the moment is that the bgwriter_delay looks to me like it
needs to be set lower for busier workloads, yet that is not possible
because of the contention for the BufMgrLock. Investigating optimal
parameter settings isn't possible while this contention exists.

Incidentally, setting debug_shared_buffers also causes some contention
which I'll look at reducing for 8.1, so it can be be used more
frequently as a log_ setting.

--
Best Regards, Simon Riggs