drop duplicate buffers in OS

Started by KONDO Mitsumasaalmost 12 years ago8 messages
#1KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
1 attachment(s)

Hi,

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature seems to
be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in OS, and
executing posix_fadvice DONTNEED, it can only free in file cache without writing
physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

drop_duplicate_buffers_v4.patchtext/x-diff; name=drop_duplicate_buffers_v4.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1107,1112 **** include 'filename'
--- 1107,1130 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-drop-duplicate-buffers" xreflabel="drop-duplicate-buffers">
+       <term><varname>drop-duplicate-buffers</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>drop-duplicate-buffers</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Sets target min usage count in shared_buffers, which is droped in file cache.
+         When you use this parameter, you set large shared_buffers that is about 50% 
+         or higher, and set parameter with 4 or 5. It need to set carefuly. If you 
+         use this parameter with small shared_buffers, transaction performance will be
+         decliend. This parameter is needed to support posix_fadvise() system call.
+         If your system doesn't support posix_fadvise(), it doesn't work at all.
+        </para>
+ 
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
        <term><varname>temp_buffers</varname> (<type>integer</type>)</term>
        <indexterm>
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 69,74 ****
--- 69,75 ----
  /* GUC variables */
  bool		zero_damaged_pages = false;
  int			bgwriter_lru_maxpages = 100;
+ int		DropDuplicateBuffers = -1;
  double		bgwriter_lru_multiplier = 2.0;
  bool		track_io_timing = false;
  
***************
*** 111,116 **** static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 112,118 ----
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
  static void AtProcExit_Buffers(int code, Datum arg);
  static int	rnode_comparator(const void *p1, const void *p2);
+ static void DropDuplicateOSCache(volatile BufferDesc *bufHdr);
  
  
  /*
***************
*** 835,841 **** BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
--- 837,843 ----
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_FADVED);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
***************
*** 1101,1107 **** PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
--- 1103,1120 ----
  		if (strategy == NULL)
  		{
  			if (buf->usage_count < BM_MAX_USAGE_COUNT)
+ 			{
  				buf->usage_count++;
+ 
+ 				/*
+ 				 * If the buffer is clean, we can remove duplicate buffers
+ 				 * from the file cache in OS without physical disk writing,
+ 				 * which is for more efficient whole memory using. It is needed
+ 				 * to execute at once per buffers.
+ 				 */
+ 				if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ 					DropDuplicateOSCache(buf);
+ 			}
  		}
  		else
  		{
***************
*** 1953,1958 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1966,1974 ----
  	 */
  	recptr = BufferGetLSN(buf);
  
+ 	/* mark this buffer's file cache in OS isn't clean */
+ 	buf->flags |= BM_FADVED;
+ 
  	/* To check if block content changes while flushing. - vadim 01/17/97 */
  	buf->flags &= ~BM_JUST_DIRTIED;
  	UnlockBufHdr(buf);
***************
*** 2101,2106 **** BufferGetLSNAtomic(Buffer buffer)
--- 2117,2144 ----
  	return lsn;
  }
  
+ /*
+  * Drop duplicate OS cache which is in OS and shared_buffers. This purpose
+  * is for more efficient file cache space and dirty file cache management.
+  */
+ static void
+ DropDuplicateOSCache(volatile BufferDesc *bufHdr)
+ {
+ 	SMgrRelation    reln;
+ 	MdfdVec         *v;
+ 	off_t           seekpos;
+ 
+ 	/* Drop OS cache which is higher usage_count in shared_buffer */
+ 	if(DropDuplicateBuffers != -1 && bufHdr->usage_count >= DropDuplicateBuffers)
+ 	{
+ 		reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+ 		v = _mdfd_getseg(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum, false, 0);
+ 		seekpos = (off_t) BLCKSZ *(bufHdr->tag.blockNum % ((BlockNumber) RELSEG_SIZE));
+ 		FileCacheAdvise(v->mdfd_vfd, seekpos, (off_t) BLCKSZ, POSIX_FADV_DONTNEED);
+ 		bufHdr->flags |= BM_FADVED;
+ 	}
+ }
+ 
  /* ---------------------------------------------------------------------
   *		DropRelFileNodeBuffers
   *
***************
*** 2414,2419 **** FlushRelationBuffers(Relation rel)
--- 2452,2459 ----
  				error_context_stack = &errcallback;
  
  				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+ 				
+ 				bufHdr->flags |= BM_FADVED;
  
  				smgrwrite(rel->rd_smgr,
  						  bufHdr->tag.forkNum,
*** a/src/backend/storage/buffer/localbuf.c
--- b/src/backend/storage/buffer/localbuf.c
***************
*** 204,209 **** LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
--- 204,211 ----
  
  		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
  
+ 		bufHdr->flags |= BM_FADVED;
+ 
  		/* And write... */
  		smgrwrite(oreln,
  				  bufHdr->tag.forkNum,
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 383,403 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1157,1171 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 109,121 ****
   *	All MdfdVec objects are palloc'd in the MdCxt memory context.
   */
  
- typedef struct _MdfdVec
- {
- 	File		mdfd_vfd;		/* fd number in fd.c's pool */
- 	BlockNumber mdfd_segno;		/* segment number, from 0 */
- 	struct _MdfdVec *mdfd_chain;	/* next segment, or NULL */
- } MdfdVec;
- 
  static MemoryContext MdCxt;		/* context for all md.c allocations */
  
  
--- 109,114 ----
***************
*** 163,175 **** static CycleCtr mdsync_cycle_ctr = 0;
  static CycleCtr mdckpt_cycle_ctr = 0;
  
  
- typedef enum					/* behavior for mdopen & _mdfd_getseg */
- {
- 	EXTENSION_FAIL,				/* ereport if segment not present */
- 	EXTENSION_RETURN_NULL,		/* return NULL if not present */
- 	EXTENSION_CREATE			/* create new segments as needed */
- } ExtensionBehavior;
- 
  /* local routines */
  static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
  			 bool isRedo);
--- 156,161 ----
***************
*** 183,190 **** static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
  			  BlockNumber segno);
  static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
  			  BlockNumber segno, int oflags);
- static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
- 			 BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
  static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
  		   MdfdVec *seg);
  
--- 169,174 ----
***************
*** 1701,1707 **** _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
   * segment, according to "behavior".  Note: skipFsync is only used in the
   * EXTENSION_CREATE case.
   */
! static MdfdVec *
  _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  			 bool skipFsync, ExtensionBehavior behavior)
  {
--- 1685,1691 ----
   * segment, according to "behavior".  Note: skipFsync is only used in the
   * EXTENSION_CREATE case.
   */
! extern MdfdVec *
  _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  			 bool skipFsync, ExtensionBehavior behavior)
  {
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 61,66 ****
--- 61,67 ----
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/bufmgr.h"
+ #include "storage/buf_internals.h"
  #include "storage/dsm_impl.h"
  #include "storage/standby.h"
  #include "storage/fd.h"
***************
*** 2056,2061 **** static struct config_int ConfigureNamesInt[] =
--- 2057,2072 ----
  	},
  
  	{
+ 		{"drop_duplicate_buffers", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ 			gettext_noop("drop duplicate buffers in OS file cache"),
+ 			NULL
+ 		},
+ 		&DropDuplicateBuffers,
+ 		-1, -1, BM_MAX_USAGE_COUNT,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
  			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 115,120 ****
--- 115,122 ----
  
  #shared_buffers = 32MB			# min 128kB
  					# (change requires restart)
+ #drop_duplicate_buffers = -1            # set target min usage_count in shared_buffers
+ 					# -1 is disables
  #temp_buffers = 8MB			# min 800kB
  #max_prepared_transactions = 0		# zero disables the feature
  					# (change requires restart)
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
***************
*** 23,28 ****
--- 23,29 ----
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
+ extern int	DropDuplicateBuffers;
  extern double CheckPointCompletionTarget;
  
  extern void BackgroundWriterMain(void) __attribute__((noreturn));
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
***************
*** 40,45 ****
--- 40,46 ----
  #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
  #define BM_PERMANENT			(1 << 8)		/* permanent relation (not
  												 * unlogged) */
+ #define BM_FADVED		(1 << 9)			/* remove duplicate file cache flag */
  
  typedef bits16 BufFlags;
  
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 114,120 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 17,23 ****
  #include "fmgr.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
! 
  
  /*
   * smgr.c maintains a table of SMgrRelation objects, which are essentially
--- 17,23 ----
  #include "fmgr.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
! #include "storage/fd.h"
  
  /*
   * smgr.c maintains a table of SMgrRelation objects, which are essentially
***************
*** 76,81 **** typedef SMgrRelationData *SMgrRelation;
--- 76,96 ----
  #define SmgrIsTemp(smgr) \
  	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
  
+ /* If you want to get more detail about MdfdVec, you read src/storage/smgr/md.c */
+ typedef struct _MdfdVec
+ {
+ 	File		mdfd_vfd;	/* fd number in fd.c's pool */
+ 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+ 	struct _MdfdVec *mdfd_chain;	/* next segment, or NULL */
+ } MdfdVec;
+ 
+ typedef enum		/* behavior for mdopen & _mdfd_getseg */
+ {
+ 	EXTENSION_FAIL,			/* ereport if segment not present */
+ 	EXTENSION_RETURN_NULL,		/* return NULL if not present */
+ 	EXTENSION_CREATE		/* create new segments as needed */
+ } ExtensionBehavior;
+ 
  extern void smgrinit(void);
  extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
  extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
***************
*** 128,133 **** extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
--- 143,150 ----
  extern void mdpreckpt(void);
  extern void mdsync(void);
  extern void mdpostckpt(void);
+ extern MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+ 		BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
  
  extern void SetForwardFsyncRequests(void);
  extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
*** a/src/include/utils/guc_tables.h
--- b/src/include/utils/guc_tables.h
***************
*** 63,68 **** enum config_group
--- 63,69 ----
  	RESOURCES_KERNEL,
  	RESOURCES_VACUUM_DELAY,
  	RESOURCES_BGWRITER,
+ 	RESOURCES_CHECKPOINTER,
  	RESOURCES_ASYNCHRONOUS,
  	WAL,
  	WAL_SETTINGS,
#2Robert Haas
robertmhaas@gmail.com
In reply to: KONDO Mitsumasa (#1)
Re: drop duplicate buffers in OS

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature seems to
be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in OS, and
executing posix_fadvice DONTNEED, it can only free in file cache without writing
physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...

The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5. So I'm reluctant to rely on
that for much of anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Aidan Van Dyk
aidan@highrise.ca
In reply to: Robert Haas (#2)
Re: drop duplicate buffers in OS

Can we just get the backend that dirties the page to the posix_fadvice
DONTNEED?

Or have another helper that sweeps the shared buffers and does this
post-first-dirty?

a.

On Wed, Jan 15, 2014 at 1:34 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature

seems to

be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache.

Because

they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count

in

shared_buffers and is clean state in OS. If file cache are clean state

in OS, and

executing posix_fadvice DONTNEED, it can only free in file cache without

writing

physical disk. This algorithm will solve double-buffered situation

problem and

can use memory more efficiently.

I am testing DBT-2 benchmark now...

The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5. So I'm reluctant to rely on
that for much of anything.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a
slave.

#4KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Aidan Van Dyk (#3)
Re: drop duplicate buffers in OS

(2014/01/16 21:38), Aidan Van Dyk wrote:

Can we just get the backend that dirties the page to the posix_fadvice DONTNEED?

No, it can remove clean page in OS file caches. Because if page is dirtied, it
cause physical-disk-writing. However, it is experimental patch so it might be
changed by future benchmark testing.

Or have another helper that sweeps the shared buffers and does this post-first-dirty?

We can add DropDuplicateOSCache() function to checkpointer process or other
process. And we can chenged posix_fadvice() DONTNEED to sync_file_range(). It can
cause physical-disk-writing in target buffer, not to free OS file caches.

I'm considering that sync_file_range() SYNC_FILE_RANGE_WAIT_BEFORE |
SYNC_FILE_RANGE_WRITE in executing checkpoint. It can avoid fsync freeze
situaition in part of of finnal checkpoint.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Robert Haas (#2)
Re: drop duplicate buffers in OS

(2014/01/16 3:34), Robert Haas wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature seems to
be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in OS, and
executing posix_fadvice DONTNEED, it can only free in file cache without writing
physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...

The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5. So I'm reluctant to rely on
that for much of anything.

This patch aims to large shared_buffers situations, so 10% memory shared_buffers
situaition
might be not effective. This patch is in experimental and to show how to solve
the double-buffers for one of a example.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Jeff Janes
jeff.janes@gmail.com
In reply to: Robert Haas (#2)
Re: drop duplicate buffers in OS

On Wed, Jan 15, 2014 at 10:34 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature

seems to

be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache.

Because

they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count

in

shared_buffers and is clean state in OS. If file cache are clean state

in OS, and

executing posix_fadvice DONTNEED, it can only free in file cache without

writing

physical disk. This algorithm will solve double-buffered situation

problem and

can use memory more efficiently.

I am testing DBT-2 benchmark now...

Have you had any luck with it? I have reservations about this approach.
Among other reasons, if the buffer is truly nailed in shared_buffers for
the long term, the kernel won't see any activity on it and will be able to
evict it fairly efficiently on its own.

So I'm reluctant to do a detailed review if the author cannot demonstrate a
performance improvement. I'm going to mark it waiting-on-author for that
reason.

The thing about this is that our usage counts for shared_buffers don't
really work right now; it's common for everything, or nearly
everything, to have a usage count of 5.

I'm surprised that that is common. The only cases I've seen that was
either when the database exactly fits in shared_buffers, or when the
database is mostly appended, and the appends are done with inserts in a
loop rather than COPY.

Cheers,

Jeff

#7KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Jeff Janes (#6)
1 attachment(s)
Re: drop duplicate buffers in OS

Hi,

Attached is latest patch.
I change little bit at PinBuffer() in bufmgr.c. It will evict target clean file
cache in OS more exactly.

- if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ if (!(buf->flags & BM_DIRTY) && !(buf->flags & BM_FADVED) && !(buf->flags & 
BM_JUST_DIRTIED))

(2014/01/29 8:20), Jeff Janes wrote:

On Wed, Jan 15, 2014 at 10:34 AM, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp <mailto:kondo.mitsumasa@lab.ntt.co.jp>> wrote:

I create patch that can drop duplicate buffers in OS using usage_count
alogorithm. I have developed this patch since last summer. This feature

seems to

be discussed in hot topic, so I submit it more faster than my schedule.

When usage_count is high in shared_buffers, they are hard to drop from
shared_buffers. However, these buffers wasn't required in file cache. Because
they aren't accessed by postgres(postgres access to shared_buffers).
So I create algorithm that dropping file cache which is high usage_count in
shared_buffers and is clean state in OS. If file cache are clean state in

OS, and

executing posix_fadvice DONTNEED, it can only free in file cache without

writing

physical disk. This algorithm will solve double-buffered situation problem and
can use memory more efficiently.

I am testing DBT-2 benchmark now...

Have you had any luck with it? I have reservations about this approach. Among
other reasons, if the buffer is truly nailed in shared_buffers for the long term,
the kernel won't see any activity on it and will be able to evict it fairly
efficiently on its own.

My patch aims not to evict other useful file cache in OS. If we doesn't evict
useful file caches in shered_buffers, they will be evicted with other useful file
cache in OS. But if we evict them as soon as possible, it will be difficult to
evict other useful file cache all the more.

So I'm reluctant to do a detailed review if the author cannot demonstrate a
performance improvement. I'm going to mark it waiting-on-author for that reason.

Will you review my patch? Thank you so much! However, My patch performance is be
little bit better. It might be error range. Optimize kernel readahead patch is grate.
Too readahead in OS is too bad, and to be full of not useful file cache in OS.
Here is the test result. Plain result is tested before(readahead patch test).

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
OS: RHEL 6.4(x86_64)
FS: Ext4

* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum
------------------------------------------------
plain | 3589 | 9.751 | 33.680 | 87.8036
patched | 3799 | 9.914 | 22.451 | 119.4259

* Main Settings
shared_buffers= 2458MB
drop_duplicate_buffers = 5 // patched only

I tested benchmark with drop_duplicate_buffers = 3 and 4 settings. But I didn't
get good result. So I will test with more larger shared_buffers and these settings.

[detail settings]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/dbserver/param.out

* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html

[patched]
http://pgstatsinfo.projects.pgfoundry.org/drop_os_cache/drop_dupulicate_cache20140129/HTML/index_thput.html

We can see faster response time at OS witeback situation(maybe) and executing
CHECKPOINT. Because when these are happened, read transactions hit file cache
more in my patch. So responce times are better than plain.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

remove_duplicate_buffers_v05.patchtext/x-diff; name=remove_duplicate_buffers_v05.patchDownload
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1107,1112 **** include 'filename'
--- 1107,1130 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-drop-duplicate-buffers" xreflabel="drop-duplicate-buffers">
+       <term><varname>drop-duplicate-buffers</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>drop-duplicate-buffers</> configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         Sets target min usage count in shared_buffers, which is droped in file cache.
+         When you use this parameter, you set large shared_buffers that is about 50% 
+         or higher, and set parameter with 4 or 5. It need to set carefuly. If you 
+         use this parameter with small shared_buffers, transaction performance will be
+         decliend. This parameter is needed to support posix_fadvise() system call.
+         If your system doesn't support posix_fadvise(), it doesn't work at all.
+        </para>
+ 
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="guc-temp-buffers" xreflabel="temp_buffers">
        <term><varname>temp_buffers</varname> (<type>integer</type>)</term>
        <indexterm>
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 69,74 ****
--- 69,75 ----
  /* GUC variables */
  bool		zero_damaged_pages = false;
  int			bgwriter_lru_maxpages = 100;
+ int		DropDuplicateBuffers = -1;
  double		bgwriter_lru_multiplier = 2.0;
  bool		track_io_timing = false;
  
***************
*** 111,116 **** static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 112,118 ----
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
  static void AtProcExit_Buffers(int code, Datum arg);
  static int	rnode_comparator(const void *p1, const void *p2);
+ static void DropDuplicateOSCache(volatile BufferDesc *bufHdr);
  
  
  /*
***************
*** 838,844 **** BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
--- 840,846 ----
  	 * 1 so that the buffer can survive one clock-sweep pass.)
  	 */
  	buf->tag = newTag;
! 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_FADVED);
  	if (relpersistence == RELPERSISTENCE_PERMANENT)
  		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
  	else
***************
*** 1104,1110 **** PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy)
--- 1106,1123 ----
  		if (strategy == NULL)
  		{
  			if (buf->usage_count < BM_MAX_USAGE_COUNT)
+ 			{
  				buf->usage_count++;
+ 
+ 				/*
+ 				 * If the buffer is clean, we can remove duplicate buffers
+ 				 * from the file cache in OS without physical disk writing,
+ 				 * which is for more efficient whole memory using. It is needed
+ 				 * to execute at once per buffers.
+ 				 */
+ 				if (!(buf->flags & BM_DIRTY) && !(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ 					DropDuplicateOSCache(buf);
+ 			}
  		}
  		else
  		{
***************
*** 1956,1961 **** FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln)
--- 1969,1977 ----
  	 */
  	recptr = BufferGetLSN(buf);
  
+ 	/* mark this buffer's file cache in OS isn't clean */
+ 	buf->flags |= BM_FADVED;
+ 
  	/* To check if block content changes while flushing. - vadim 01/17/97 */
  	buf->flags &= ~BM_JUST_DIRTIED;
  	UnlockBufHdr(buf);
***************
*** 2104,2109 **** BufferGetLSNAtomic(Buffer buffer)
--- 2120,2147 ----
  	return lsn;
  }
  
+ /*
+  * Drop duplicate OS cache which is in OS and shared_buffers. This purpose
+  * is for more efficient file cache space and dirty file cache management.
+  */
+ static void
+ DropDuplicateOSCache(volatile BufferDesc *bufHdr)
+ {
+ 	SMgrRelation    reln;
+ 	MdfdVec         *v;
+ 	off_t           seekpos;
+ 
+ 	/* Drop OS cache which is higher usage_count in shared_buffer */
+ 	if(DropDuplicateBuffers != -1 && bufHdr->usage_count >= DropDuplicateBuffers)
+ 	{
+ 		reln = smgropen(bufHdr->tag.rnode, InvalidBackendId);
+ 		v = _mdfd_getseg(reln, bufHdr->tag.forkNum, bufHdr->tag.blockNum, false, 0);
+ 		seekpos = (off_t) BLCKSZ *(bufHdr->tag.blockNum % ((BlockNumber) RELSEG_SIZE));
+ 		FileCacheAdvise(v->mdfd_vfd, seekpos, (off_t) BLCKSZ, POSIX_FADV_DONTNEED);
+ 		bufHdr->flags |= BM_FADVED;
+ 	}
+ }
+ 
  /* ---------------------------------------------------------------------
   *		DropRelFileNodeBuffers
   *
***************
*** 2417,2422 **** FlushRelationBuffers(Relation rel)
--- 2455,2462 ----
  				error_context_stack = &errcallback;
  
  				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
+ 				
+ 				bufHdr->flags |= BM_FADVED;
  
  				smgrwrite(rel->rd_smgr,
  						  bufHdr->tag.forkNum,
*** a/src/backend/storage/buffer/localbuf.c
--- b/src/backend/storage/buffer/localbuf.c
***************
*** 204,209 **** LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
--- 204,211 ----
  
  		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
  
+ 		bufHdr->flags |= BM_FADVED;
+ 
  		/* And write... */
  		smgrwrite(oreln,
  				  bufHdr->tag.forkNum,
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 383,403 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1157,1171 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 109,121 ****
   *	All MdfdVec objects are palloc'd in the MdCxt memory context.
   */
  
- typedef struct _MdfdVec
- {
- 	File		mdfd_vfd;		/* fd number in fd.c's pool */
- 	BlockNumber mdfd_segno;		/* segment number, from 0 */
- 	struct _MdfdVec *mdfd_chain;	/* next segment, or NULL */
- } MdfdVec;
- 
  static MemoryContext MdCxt;		/* context for all md.c allocations */
  
  
--- 109,114 ----
***************
*** 163,175 **** static CycleCtr mdsync_cycle_ctr = 0;
  static CycleCtr mdckpt_cycle_ctr = 0;
  
  
- typedef enum					/* behavior for mdopen & _mdfd_getseg */
- {
- 	EXTENSION_FAIL,				/* ereport if segment not present */
- 	EXTENSION_RETURN_NULL,		/* return NULL if not present */
- 	EXTENSION_CREATE			/* create new segments as needed */
- } ExtensionBehavior;
- 
  /* local routines */
  static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
  			 bool isRedo);
--- 156,161 ----
***************
*** 183,190 **** static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
  			  BlockNumber segno);
  static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
  			  BlockNumber segno, int oflags);
- static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
- 			 BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
  static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
  		   MdfdVec *seg);
  
--- 169,174 ----
***************
*** 1701,1707 **** _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
   * segment, according to "behavior".  Note: skipFsync is only used in the
   * EXTENSION_CREATE case.
   */
! static MdfdVec *
  _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  			 bool skipFsync, ExtensionBehavior behavior)
  {
--- 1685,1691 ----
   * segment, according to "behavior".  Note: skipFsync is only used in the
   * EXTENSION_CREATE case.
   */
! extern MdfdVec *
  _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
  			 bool skipFsync, ExtensionBehavior behavior)
  {
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 61,66 ****
--- 61,67 ----
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
  #include "storage/bufmgr.h"
+ #include "storage/buf_internals.h"
  #include "storage/dsm_impl.h"
  #include "storage/standby.h"
  #include "storage/fd.h"
***************
*** 2056,2061 **** static struct config_int ConfigureNamesInt[] =
--- 2057,2072 ----
  	},
  
  	{
+ 		{"drop_duplicate_buffers", PGC_SIGHUP, RESOURCES_CHECKPOINTER,
+ 			gettext_noop("drop duplicate buffers in OS file cache"),
+ 			NULL
+ 		},
+ 		&DropDuplicateBuffers,
+ 		-1, -1, BM_MAX_USAGE_COUNT,
+ 		NULL, NULL, NULL
+ 	},
+ 
+ 	{
  		{"wal_buffers", PGC_POSTMASTER, WAL_SETTINGS,
  			gettext_noop("Sets the number of disk-page buffers in shared memory for WAL."),
  			NULL,
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 115,120 ****
--- 115,122 ----
  
  #shared_buffers = 32MB			# min 128kB
  					# (change requires restart)
+ #drop_duplicate_buffers = -1            # set target min usage_count in shared_buffers
+ 					# -1 is disables
  #temp_buffers = 8MB			# min 800kB
  #max_prepared_transactions = 0		# zero disables the feature
  					# (change requires restart)
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
***************
*** 23,28 ****
--- 23,29 ----
  extern int	BgWriterDelay;
  extern int	CheckPointTimeout;
  extern int	CheckPointWarning;
+ extern int	DropDuplicateBuffers;
  extern double CheckPointCompletionTarget;
  
  extern void BackgroundWriterMain(void) __attribute__((noreturn));
*** a/src/include/storage/buf_internals.h
--- b/src/include/storage/buf_internals.h
***************
*** 40,45 ****
--- 40,46 ----
  #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
  #define BM_PERMANENT			(1 << 8)		/* permanent relation (not
  												 * unlogged) */
+ #define BM_FADVED		(1 << 9)			/* remove duplicate file cache flag */
  
  typedef bits16 BufFlags;
  
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 114,120 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 17,23 ****
  #include "fmgr.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
! 
  
  /*
   * smgr.c maintains a table of SMgrRelation objects, which are essentially
--- 17,23 ----
  #include "fmgr.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
! #include "storage/fd.h"
  
  /*
   * smgr.c maintains a table of SMgrRelation objects, which are essentially
***************
*** 76,81 **** typedef SMgrRelationData *SMgrRelation;
--- 76,96 ----
  #define SmgrIsTemp(smgr) \
  	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
  
+ /* If you want to get more detail about MdfdVec, you read src/storage/smgr/md.c */
+ typedef struct _MdfdVec
+ {
+ 	File		mdfd_vfd;	/* fd number in fd.c's pool */
+ 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+ 	struct _MdfdVec *mdfd_chain;	/* next segment, or NULL */
+ } MdfdVec;
+ 
+ typedef enum		/* behavior for mdopen & _mdfd_getseg */
+ {
+ 	EXTENSION_FAIL,			/* ereport if segment not present */
+ 	EXTENSION_RETURN_NULL,		/* return NULL if not present */
+ 	EXTENSION_CREATE		/* create new segments as needed */
+ } ExtensionBehavior;
+ 
  extern void smgrinit(void);
  extern SMgrRelation smgropen(RelFileNode rnode, BackendId backend);
  extern bool smgrexists(SMgrRelation reln, ForkNumber forknum);
***************
*** 128,133 **** extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
--- 143,150 ----
  extern void mdpreckpt(void);
  extern void mdsync(void);
  extern void mdpostckpt(void);
+ extern MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+ 		BlockNumber blkno, bool skipFsync, ExtensionBehavior behavior);
  
  extern void SetForwardFsyncRequests(void);
  extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
*** a/src/include/utils/guc_tables.h
--- b/src/include/utils/guc_tables.h
***************
*** 63,68 **** enum config_group
--- 63,69 ----
  	RESOURCES_KERNEL,
  	RESOURCES_VACUUM_DELAY,
  	RESOURCES_BGWRITER,
+ 	RESOURCES_CHECKPOINTER,
  	RESOURCES_ASYNCHRONOUS,
  	WAL,
  	WAL_SETTINGS,
#8Robert Haas
robertmhaas@gmail.com
In reply to: KONDO Mitsumasa (#7)
Re: drop duplicate buffers in OS

On Wed, Jan 29, 2014 at 2:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

Attached is latest patch.
I change little bit at PinBuffer() in bufmgr.c. It will evict target clean
file cache in OS more exactly.

- if (!(buf->flags & BM_FADVED) && !(buf->flags & BM_JUST_DIRTIED))
+ if (!(buf->flags & BM_DIRTY) && !(buf->flags & BM_FADVED) && !(buf->flags
& BM_JUST_DIRTIED))

(2014/01/29 8:20), Jeff Janes wrote:

On Wed, Jan 15, 2014 at 10:34 AM, Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com>> wrote:

On Wed, Jan 15, 2014 at 1:53 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp <mailto:kondo.mitsumasa@lab.ntt.co.jp>>
wrote:

I create patch that can drop duplicate buffers in OS using

usage_count

alogorithm. I have developed this patch since last summer. This

feature
seems to

be discussed in hot topic, so I submit it more faster than my

schedule.

When usage_count is high in shared_buffers, they are hard to drop

from

shared_buffers. However, these buffers wasn't required in file

cache. Because

they aren't accessed by postgres(postgres access to

shared_buffers).

So I create algorithm that dropping file cache which is high

usage_count in

shared_buffers and is clean state in OS. If file cache are clean

state in
OS, and

executing posix_fadvice DONTNEED, it can only free in file cache

without
writing

physical disk. This algorithm will solve double-buffered situation

problem and

can use memory more efficiently.

I am testing DBT-2 benchmark now...

Have you had any luck with it? I have reservations about this approach.
Among
other reasons, if the buffer is truly nailed in shared_buffers for the
long term,
the kernel won't see any activity on it and will be able to evict it
fairly
efficiently on its own.

My patch aims not to evict other useful file cache in OS. If we doesn't
evict useful file caches in shered_buffers, they will be evicted with other
useful file cache in OS. But if we evict them as soon as possible, it will
be difficult to evict other useful file cache all the more.

So I'm reluctant to do a detailed review if the author cannot demonstrate
a
performance improvement. I'm going to mark it waiting-on-author for that
reason.

Will you review my patch? Thank you so much! However, My patch performance
is be
little bit better. It might be error range. Optimize kernel readahead patch
is grate.
Too readahead in OS is too bad, and to be full of not useful file cache in
OS.
Here is the test result. Plain result is tested before(readahead patch
test).

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
OS: RHEL 6.4(x86_64)
FS: Ext4

* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum
------------------------------------------------
plain | 3589 | 9.751 | 33.680 | 87.8036
patched | 3799 | 9.914 | 22.451 | 119.4259

* Main Settings
shared_buffers= 2458MB
drop_duplicate_buffers = 5 // patched only

I tested benchmark with drop_duplicate_buffers = 3 and 4 settings. But I
didn't get good result. So I will test with more larger shared_buffers and
these settings.

[detail settings]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/dbserver/param.out

* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html
[patched]
http://pgstatsinfo.projects.pgfoundry.org/drop_os_cache/drop_dupulicate_cache20140129/HTML/index_thput.html

We can see faster response time at OS witeback situation(maybe) and
executing CHECKPOINT. Because when these are happened, read transactions hit
file cache more in my patch. So responce times are better than plain.

I think it's pretty clear that these results are not good enough to
justify committing this patch. To do something like this, we need to
have a lot of confidence that this will be a win not just on one
particular system or workload, but rather that it's got to be a
general win across many systems and workloads. I'm not convinced
that's true, and if it is true the test results submitted thus far are
nowhere near sufficient to establish it, and I can't see that changing
in the next few weeks. So I think it's pretty clear that we should
mark this Returned with Feedback for now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers