Optimize kernel readahead using buffer access strategy

Started by KONDO Mitsumasaabout 12 years ago27 messages
#1KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
3 attachment(s)

Hi,

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter becomes big.
However it is based on experienced or heuristic algorithm, it causes waste
disk-read and throws out useful OS file caches in some case. It is bad for
disk-read performance a lot.

My proposed method is controlling OS readahead parameter by using buffer access
strategy in PostgreSQL and posix_fadvice() system call which can control OS
readahead parameter. Though, it is a general method in database.

For your information of effect of this patch, I got results of pgbench which are
in-memory-size database and out-memory-size database, and postgresql.conf
settings are always used by us. It seems to improve performance to a better. And
I think that this feature is going to be necessary for business intelligence
which will be realized at PostgreSQL version 10. I seriously believe Simon's
presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

PostgreSQL have a lot of kind of disk-read method that are selected by planner,
however. I think that we need to discuss more other situations except pgbench,
and other cache cold situations. I think that optimizing kernel readahead
parameter with considering planner in PostgreSQL seems to be quite difficult, so
I seriously recruit co-author in this patch:-)

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

pgbench-result_in_RAM.jpgimage/jpeg; name=pgbench-result_in_RAM.jpgDownload
pgbench-result_over_RAM.jpgimage/jpeg; name=pgbench-result_over_RAM.jpgDownload
optimize_kernel-readahead_using_buffer-access-strategy_v1.patchtext/x-diff; name=optimize_kernel-readahead_using_buffer-access-strategy_v1.patchDownload
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0b31f55..e4b411f 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9117,7 +9117,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 		/* If we got a cancel signal during the copy of the data, quit */
 		CHECK_FOR_INTERRUPTS();
 
-		smgrread(src, forkNum, blkno, buf);
+		smgrread(src, forkNum, blkno, buf, BAS_BULKREAD);
 
 		if (!PageIsVerified(page, blkno))
 			ereport(ERROR,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f848391..488cdf1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -41,6 +41,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "storage/buf.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
@@ -451,7 +452,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			if (track_io_timing)
 				INSTR_TIME_SET_CURRENT(io_start);
 
-			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
+			smgrread(smgr, forkNum, blockNum, (char *) bufBlock, strategy);
 
 			if (track_io_timing)
 			{
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index de4d902..8cda2f9 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -73,8 +73,10 @@
 #include "catalog/pg_tablespace.h"
 #include "common/relpath.h"
 #include "pgstat.h"
+#include "storage/buf.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/bufmgr.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -383,6 +385,21 @@ pg_flush_data(int fd, off_t offset, off_t amount)
 	return 0;
 }
 
+/*
+ * pg_fadvise --- advise OS that the cache will need or not
+ *
+ * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+ * we do nothing about here.
+ */
+int
+pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+{
+#if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM) && defined(POSIX_FADV_SEQUENTIAL)
+	return posix_fadvise(fd, offset, amount, advise);
+#else
+	return 0;
+#endif
+}
 
 /*
  * fsync_fname -- fsync a file or directory, handling errors properly
@@ -1142,6 +1159,33 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 }
 
 /*
+ * Controling OS file cache using posix_fadvise()
+ */
+int
+FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+{
+	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+}
+
+/*
+ * Select OS readahead strategy using buffer hint. If we select POSIX_FADV_SEQUENTIAL,
+ * readahead parameter becomes the maximum and can read more faster. On the other hand,
+ * if we select POSIX_FADV_RANDOM, readahead wasn't executed at all and file cache
+ * replace algorithm will be more smart. Because it can calculate correct number of accesses
+ * which are hot data.
+ */
+int
+BufferHintIOAdvise(File file, off_t offset, off_t amount, char *strategy)
+{
+	if(strategy != NULL)
+			/* use maximum readahead setting in kernel, we can read more faster */
+			return FileCacheAdvise(file, offset, amount, POSIX_FADV_SEQUENTIAL);
+	else
+			/* don't use readahead in kernel, so we can more effectively use OS file cache */
+			return FileCacheAdvise(file, offset, amount, POSIX_FADV_RANDOM);
+}
+
+/*
  * close a file when done with it
  */
 void
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e629181..e8ff0b0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -653,7 +653,7 @@ mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
  */
 void
 mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer)
+	   char *buffer, char *strategy)
 {
 	off_t		seekpos;
 	int			nbytes;
@@ -677,6 +677,7 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 				 errmsg("could not seek to block %u in file \"%s\": %m",
 						blocknum, FilePathName(v->mdfd_vfd))));
 
+	BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);
 	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
 
 	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f7f1437..7a38aec 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,7 +50,7 @@ typedef struct f_smgr
 	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 											  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
-										  BlockNumber blocknum, char *buffer);
+					  BlockNumber blocknum, char *buffer, char *strategy);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -588,9 +588,9 @@ smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
  */
 void
 smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-		 char *buffer)
+		 char *buffer, char *strategy)
 {
-	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
+	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer, strategy);
 }
 
 /*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6dc031e..ca9a16a 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -44,6 +44,8 @@ typedef enum
 /* in globals.c ... this duplicates miscadmin.h */
 extern PGDLLIMPORT int NBuffers;
 
+
+
 /* in bufmgr.c */
 extern bool zero_damaged_pages;
 extern int	bgwriter_lru_maxpages;
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 2a60229..3922c0a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -68,6 +68,7 @@ extern int	max_safe_fds;
 extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
+extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
 extern int	FilePrefetch(File file, off_t offset, int amount);
 extern int	FileRead(File file, char *buffer, int amount);
 extern int	FileWrite(File file, char *buffer, int amount);
@@ -75,6 +76,7 @@ extern int	FileSync(File file);
 extern off_t FileSeek(File file, off_t offset, int whence);
 extern int	FileTruncate(File file, off_t offset);
 extern char *FilePathName(File file);
+extern int	BufferHintIOAdvise(File file, off_t offset, off_t amount, char *strategy);
 
 /* Operations that allow use of regular stdio --- USE WITH CAUTION */
 extern FILE *AllocateFile(const char *name, const char *mode);
@@ -113,6 +115,7 @@ extern int	pg_fsync_no_writethrough(int fd);
 extern int	pg_fsync_writethrough(int fd);
 extern int	pg_fdatasync(int fd);
 extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
 extern void fsync_fname(char *fname, bool isdir);
 
 /* Filename components for OpenTemporaryFile */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 98b6f13..0c4f14e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,7 +92,7 @@ extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer);
+			BlockNumber blocknum, char *buffer, char *strategy);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 		  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -118,7 +118,7 @@ extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
+	   char *buffer, char *strategy);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 		BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
#2Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#1)
Re: Optimize kernel readahead using buffer access strategy

On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter becomes big.
However it is based on experienced or heuristic algorithm, it causes waste
disk-read and throws out useful OS file caches in some case. It is bad for
disk-read performance a lot.

It would be relevant to know which kernel did you use for those tests.

@@ -677,6 +677,7 @@ mdread(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
errmsg("could not seek to block %u in file \"%s\": %m",
blocknum, FilePathName(v->mdfd_vfd))));

+ BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);
nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);

TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,

A while back, I tried to use posix_fadvise to prefetch index pages. I
ended up finding out that interleaving posix_fadvise with I/O like
that severly hinders (ie: completely disables) the kernel's read-ahead
algorithm.

How exactly did you set up those benchmarks? pg_bench defaults?

pg_bench does not exercise heavy sequential access patterns, or long
index scans. It performs many single-page index lookups per
transaction and that's it. You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.

Try OLAP-style queries.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Fujii Masao
masao.fujii@gmail.com
In reply to: KONDO Mitsumasa (#1)
Re: Optimize kernel readahead using buffer access strategy

On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

Hi,

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

When I compiled the HEAD code with this patch on MacOS, I got the following
error and warnings.

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -c -o fd.o fd.c
fd.c: In function 'BufferHintIOAdvise':
fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in
this function)
fd.c:1182: error: (Each undeclared identifier is reported only once
fd.c:1182: error: for each function it appears in.)
fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function)
make[4]: *** [fd.o] Error 1
make[3]: *** [file-recursive] Error 2
make[2]: *** [storage-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes
pointer from integer without a cast
bufmgr.c:455: warning: passing argument 5 of 'smgrread' from
incompatible pointer type

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Claudio Freire (#2)
Re: Optimize kernel readahead using buffer access strategy

Hi Claudio,

(2013/11/14 22:53), Claudio Freire wrote:

On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter becomes big.
However it is based on experienced or heuristic algorithm, it causes waste
disk-read and throws out useful OS file caches in some case. It is bad for
disk-read performance a lot.

It would be relevant to know which kernel did you use for those tests.

I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this test.

A while back, I tried to use posix_fadvise to prefetch index pages.

I search your past work. Do you talk about this ML-thread? Or is there another
latest discussion? I see your patch is interesting, but it wasn't submitted to CF
and stopping discussions.
/messages/by-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com

I ended up finding out that interleaving posix_fadvise with I/O like
that severly hinders (ie: completely disables) the kernel's read-ahead
algorithm.

Your patch becomes maximum readahead, when a sql is selected index range scan. Is
it right? I think that your patch assumes that pages are ordered by index-data.
This assumption is partially wrong. If your assumption is true, we don't need
CLUSTER command. In actuary, CLUSTER command becomes better performance than nothing.

How exactly did you set up those benchmarks? pg_bench defaults?

My detail test setting is under following,
* Server info
CPU: Intel(R) Xeon(R) CPU E5645 @ 2.40GHz (2U/12C)
RAM: 6GB
-> I reduced it intentionally in OS paraemter, because large memory tests
have long time.
HDD: SEAGATE Model: ST2000NM0001 @ 7200rpm * 1
RAID: none.

* postgresql.conf(summarized)
shared_buffers = 600MB (10% of RAM = 6GB)
work_mem = 1MB
maintenance_work_mem = 64MB
wal_level = archive
fsync = on
archive_mode = on
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7

* pgbench settings
pgbench -j 4 -c 32 -T 600 pgbench

pg_bench does not exercise heavy sequential access patterns, or long
index scans. It performs many single-page index lookups per
transaction and that's it.

Yes, your argument is right. And it is also a fact that performance becomes
better in these situations.

You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.

The decisive difference with your patch is that my patch uses buffer hint control
architecture, so it can control readahaed smarter in some cases.
However, my patch is on the way and needed to more improvement. I am going to add
method of controlling readahead by GUC, for user can freely select readahed
parameter in their transactions.

Try OLAP-style queries.

I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell me
good OLAP benchmark tools?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Peter Geoghegan
pg@heroku.com
In reply to: KONDO Mitsumasa (#1)
Re: Optimize kernel readahead using buffer access strategy

On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I will fix it. Could you tell me your Mac OS version and gcc version? I have
only mac book air with Maverick OS(10.9).

I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't
you use the relevant macros so that the code at least builds on those
platforms?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Fujii Masao (#3)
Re: Optimize kernel readahead using buffer access strategy

(2013/11/15 2:03), Fujii Masao wrote:

On Thu, Nov 14, 2013 at 9:09 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

Hi,

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

When I compiled the HEAD code with this patch on MacOS, I got the following
error and warnings.

gcc -O0 -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -g -I../../../../src/include -c -o fd.o fd.c
fd.c: In function 'BufferHintIOAdvise':
fd.c:1182: error: 'POSIX_FADV_SEQUENTIAL' undeclared (first use in
this function)
fd.c:1182: error: (Each undeclared identifier is reported only once
fd.c:1182: error: for each function it appears in.)
fd.c:1185: error: 'POSIX_FADV_RANDOM' undeclared (first use in this function)
make[4]: *** [fd.o] Error 1
make[3]: *** [file-recursive] Error 2
make[2]: *** [storage-recursive] Error 2
make[1]: *** [install-backend-recurse] Error 2
make: *** [install-src-recurse] Error 2

tablecmds.c:9120: warning: passing argument 5 of 'smgrread' makes
pointer from integer without a cast
bufmgr.c:455: warning: passing argument 5 of 'smgrread' from
incompatible pointer type

Thanks you for your report!
I will fix it. Could you tell me your Mac OS version and gcc version? I have only
mac book air with Maverick OS(10.9).

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Peter Geoghegan (#5)
Re: Optimize kernel readahead using buffer access strategy

(2013/11/15 11:17), Peter Geoghegan wrote:

On Thu, Nov 14, 2013 at 6:18 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I will fix it. Could you tell me your Mac OS version and gcc version? I have
only mac book air with Maverick OS(10.9).

I have an idea that Mac OSX doesn't have posix_fadvise at all. Didn't
you use the relevant macros so that the code at least builds on those
platforms?

Thank you for your nice advice, too.
I try to fix macro program.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#4)
Re: Optimize kernel readahead using buffer access strategy

On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

Hi Claudio,

(2013/11/14 22:53), Claudio Freire wrote:

On Thu, Nov 14, 2013 at 9:09 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I create a patch that is improvement of disk-read and OS file caches. It
can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

In general OS, readahead parameter was dynamically decided by disk-read
situations. If long time disk-read was happened, readahead parameter
becomes big.
However it is based on experienced or heuristic algorithm, it causes
waste
disk-read and throws out useful OS file caches in some case. It is bad
for
disk-read performance a lot.

It would be relevant to know which kernel did you use for those tests.

I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
test.

That's close to the kernel version I was using, so you should see the
same effect.

A while back, I tried to use posix_fadvise to prefetch index pages.

I search your past work. Do you talk about this ML-thread? Or is there
another latest discussion? I see your patch is interesting, but it wasn't
submitted to CF and stopping discussions.
/messages/by-id/CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com

Yes, I didn't, exactly because of that bad interaction with the
kernel. It needs either more smarts to only do fadvise on known-random
patterns (what you did mostly), or an accompanying kernel patch (which
I was working on, but ran out of test machines).

I ended up finding out that interleaving posix_fadvise with I/O like
that severly hinders (ie: completely disables) the kernel's read-ahead
algorithm.

Your patch becomes maximum readahead, when a sql is selected index range
scan. Is it right?

Ehm... sorta.

I think that your patch assumes that pages are ordered by
index-data.

No. It just knows which pages will be needed, and fadvises them. No
guessing involved, except the guess that the scan will not be aborted.
There's a heuristic to stop limited scans from attempting to fadvise,
and that's that prefetch strategy is applied only from the Nth+ page
walk.

It improves index-only scans the most, but I also attempted to handle
heap prefetches. That's where the kernel started conspiring against
me, because I used many naturally-clustered indexes, and THERE
performance was adversely affected because of that kernel bug.

You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.

The decisive difference with your patch is that my patch uses buffer hint
control architecture, so it can control readahaed smarter in some cases.

Indeed, but it's not enough. See my above comment about naturally
clustered indexes. The planner expects that, and plans accordingly. It
will notice correlation between a PK and physical location, and will
treat an index scan over PK to be almost sequential. With your patch,
that assumption will be broken I believe.

However, my patch is on the way and needed to more improvement. I am going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

Try OLAP-style queries.

I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
me good OLAP benchmark tools?

I don't really know. Skimming the specs, I'm not sure if those queries
generate large index range queries. You could try, maybe with
autoexplain?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Peter Eisentraut
peter_e@gmx.net
In reply to: KONDO Mitsumasa (#1)
Re: Optimize kernel readahead using buffer access strategy

On 11/14/13, 7:09 AM, KONDO Mitsumasa wrote:

I create a patch that is improvement of disk-read and OS file caches. It can
optimize kernel readahead parameter using buffer access strategy and
posix_fadvice() in various disk-read situations.

Various compiler warnings:

tablecmds.c: In function ‘copy_relation_data’:
tablecmds.c:9120:3: warning: passing argument 5 of ‘smgrread’ makes pointer from integer without a cast [enabled by default]
In file included from tablecmds.c:79:0:
../../../src/include/storage/smgr.h:94:13: note: expected ‘char *’ but argument is of type ‘int’

bufmgr.c: In function ‘ReadBuffer_common’:
bufmgr.c:455:4: warning: passing argument 5 of ‘smgrread’ from incompatible pointer type [enabled by default]
In file included from ../../../../src/include/storage/buf_internals.h:22:0,
from bufmgr.c:45:
../../../../src/include/storage/smgr.h:94:13: note: expected ‘char *’ but argument is of type ‘BufferAccessStrategy’

md.c: In function ‘mdread’:
md.c:680:2: warning: passing argument 2 of ‘BufferHintIOAdvise’ makes integer from pointer without a cast [enabled by default]
In file included from md.c:34:0:
../../../../src/include/storage/fd.h:79:12: note: expected ‘off_t’ but argument is of type ‘char *’

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Claudio Freire (#8)
Re: Optimize kernel readahead using buffer access strategy

(2013/11/15 13:48), Claudio Freire wrote:

On Thu, Nov 14, 2013 at 11:13 PM, KONDO Mitsumasa

I use CentOS 6.4 which kernel version is 2.6.32-358.23.2.el6.x86_64 in this
test.

That's close to the kernel version I was using, so you should see the
same effect.

OK. You proposed readahead maximum patch, I think it seems to get benefit for
perofomance and your part of argument is really true.

Your patch becomes maximum readahead, when a sql is selected index range
scan. Is it right?

Ehm... sorta.

I think that your patch assumes that pages are ordered by
index-data.

No. It just knows which pages will be needed, and fadvises them. No
guessing involved, except the guess that the scan will not be aborted.
There's a heuristic to stop limited scans from attempting to fadvise,
and that's that prefetch strategy is applied only from the Nth+ page
walk.

We may completely optimize kernel readahead in PostgreSQL in the future,
however it is very difficult and takes long time that it completely comes true
from a beginning. So I propose GUC switch that can use in their transactions.(I 
will create this patch in this CF.). If someone off readahed for using file cache
more efficient in his transactions, he can set "SET readahead = off". PostgreSQL
is open source, and I think that it becomes clear which case it is effective for,
by using many people.

It improves index-only scans the most, but I also attempted to handle
heap prefetches. That's where the kernel started conspiring against
me, because I used many naturally-clustered indexes, and THERE
performance was adversely affected because of that kernel bug.

I also create gaussinan-distributed pgbench now and submit this CF. It can clear
which situasion is effective, partially we will know.

You may want to try your patch with more
real workloads, and maybe you'll confirm what I found out last time I
messed with posix_fadvise. If my experience is still relevant, those
patterns will have suffered a severe performance penalty with this
patch, because it will disable kernel read-ahead on sequential index
access. It may still work for sequential heap scans, because the
access strategy will tell the kernel to do read-ahead, but many other
access methods will suffer.

The decisive difference with your patch is that my patch uses buffer hint
control architecture, so it can control readahaed smarter in some cases.

Indeed, but it's not enough. See my above comment about naturally
clustered indexes. The planner expects that, and plans accordingly. It
will notice correlation between a PK and physical location, and will
treat an index scan over PK to be almost sequential. With your patch,
that assumption will be broken I believe.

~

However, my patch is on the way and needed to more improvement. I am going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

I think we had better to develop these patches in step by step each patches,
because it is difficult that readahead optimizetion is completely come true from
a beginning of one patch. We need flame-work in these patches, first.

Try OLAP-style queries.

I have DBT-3(TPC-H) benchmark tools. If you don't like TPC-H, could you tell
me good OLAP benchmark tools?

I don't really know. Skimming the specs, I'm not sure if those queries
generate large index range queries. You could try, maybe with
autoexplain?

OK, I do. And, I will use simple large index range queries with explain command.

Regards,
--
Mitsuamsa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#10)
Re: Optimize kernel readahead using buffer access strategy

On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

However, my patch is on the way and needed to more improvement. I am
going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

I think we had better to develop these patches in step by step each patches,
because it is difficult that readahead optimizetion is completely come true
from a beginning of one patch. We need flame-work in these patches, first.

Well, problem is, that without those smarts, I don't think this patch
can be enabled by default. It will considerably hurt common use cases
for postgres.

But I guess we'll have a better idea about that when we see how much
of a performance impact it makes when you run those tests, so no need
to guess in the dark.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Claudio Freire (#11)
Re: Optimize kernel readahead using buffer access strategy

(2013/11/18 11:25), Claudio Freire wrote:

On Sun, Nov 17, 2013 at 11:02 PM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

However, my patch is on the way and needed to more improvement. I am
going
to add method of controlling readahead by GUC, for user can freely select
readahed parameter in their transactions.

Rather, I'd try to avoid fadvising consecutive or almost-consecutive
blocks. Detecting that is hard at the block level, but maybe you can
tie that detection into the planner, and specify a sequential strategy
when the planner expects index-heap correlation?

I think we had better to develop these patches in step by step each patches,
because it is difficult that readahead optimizetion is completely come true
from a beginning of one patch. We need flame-work in these patches, first.

Well, problem is, that without those smarts, I don't think this patch
can be enabled by default. It will considerably hurt common use cases
for postgres.

Yes. I have thought as much you that defalut setting is false.
(use normal readahead as before). Next version of my patch will become these.

But I guess we'll have a better idea about that when we see how much
of a performance impact it makes when you run those tests, so no need
to guess in the dark.

Yes, sure.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#7)
4 attachment(s)
Re: Optimize kernel readahead using buffer access strategy

Hi,

I revise this patch and re-run performance test, it can work collectry in Linux
and no complile wanings. I add GUC about enable_kernel_readahead option in new
version. When this GUC is on(default), it works in POSIX_FADV_NORMAL which is
general readahead in OS. And when it is off, it works in POSXI_FADV_RANDOM or
POSIX_FADV_SEQUENTIAL which is judged by buffer hint in Postgres, readahead
parameter is optimized by postgres. We can change this parameter in their
transactions everywhere and everytime.

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
OS: RHEL 6.4(x86_64)
FS: Ext4

* Test setting
I use "pgbench -c 8 -j 4 -T 2400 -S -P 10 -a"
I also use my accurate patch in this test. So I exexuted under following
command before each benchmark.
1. cluster all database
2. truncate pgbench_history
3. checkpoint
4. sync
5. checkpoint

* postresql.conf
shared_buffers = 2048MB
maintenance_work_mem = 64MB
wal_level = minimal
checkpoint_segments = 300
checkpoint_timeout = 15min
checkpoint_completion_target = 0.7

* Performance test result
** In memory database size
s=1000 | 1 | 2 | 3 | avg
---------------------------------------------
readahead=on | 39836 | 40229 | 40055 | 40040
readahead=off | 31259 | 29656 | 30693 | 30536
ratio | 78% | 74% | 77% | 76%

** Over memory database size
s=2000 | 1 | 2 | 3 | avg
---------------------------------------------
readahead=on | 1288 | 1370 | 1367 | 1341
readahead=off | 1683 | 1688 | 1395 | 1589
ratio | 131% | 123% | 102% | 118%

s=3000 | 1 | 2 | 3 | avg
---------------------------------------------
readahead=on | 965 | 862 | 993 | 940
readahead=off | 1113 | 1098 | 935 | 1049
ratio | 115% | 127% | 94% | 112%

It seems good performance expect scale factor=1000. When readahead parameter is
off, disk IO keep to a minimum or necessary, therefore it is faster than
"readahead=on". "readahead=on" uses useless diskIO. For example, which is faster
8KB random read or 12KB random read from disks in many times transactions? It is
self-evident that the former is faster.

In scale factor 1000, it becomes to slower buffer-is-hot than "readahead=on". So
it seems to less performance. But it is essence in measuring perfomance. And you
can confirm it by attached benchmark graphs. We can use this parameter when
buffer is reratively hot. If you want to see other trial graphs, I will send.

And I will support to MacOS and create document about this patch in this week.
#MacOS is in my house.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

optimizing_kernel-readahead_using_buffer-access-strategy_v3.patchtext/x-diff; name=optimizing_kernel-readahead_using_buffer-access-strategy_v3.patchDownload
*** a/configure
--- b/configure
***************
*** 19937,19943 **** LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
--- 19937,19943 ----
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 9119,9125 **** copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9119,9125 ----
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/buf.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 451,457 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			if (track_io_timing)
  			{
--- 452,458 ----
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock, (char *) strategy);
  
  			if (track_io_timing)
  			{
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 73,80 ****
--- 73,82 ----
  #include "catalog/pg_tablespace.h"
  #include "common/relpath.h"
  #include "pgstat.h"
+ #include "storage/buf.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
+ #include "storage/bufmgr.h"
  #include "utils/guc.h"
  #include "utils/resowner_private.h"
  
***************
*** 123,129 **** int			max_files_per_process = 1000;
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! 
  
  /* Debugging.... */
  
--- 125,131 ----
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! bool			enable_kernel_readahead = true ;
  
  /* Debugging.... */
  
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 385,405 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM) && defined(POSIX_FADV_SEQUENTIAL)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1159,1195 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
+  * Select OS readahead strategy using buffer hint. If we select POSIX_FADV_SEQUENTIAL,
+  * readahead parameter becomes the maximum and can read more faster. On the other hand,
+  * if we select POSIX_FADV_RANDOM, readahead wasn't executed at all and file cache
+  * replace algorithm will be more smart. Because it can calculate correct number of accesses
+  * which are hot data.
+  */
+ int
+ BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy)
+ {
+ 	if(enable_kernel_readahead)
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_NORMAL);
+ 
+ 	/* readahead optimization */
+ 	if(strategy != NULL)
+ 		/* use maximum readahead setting in kernel, we can read more faster */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_SEQUENTIAL);
+ 	else
+ 		/* don't use readahead in kernel, so we can more effectively use OS file cache */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_RANDOM);
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 162,168 **** static List *pendingUnlinks = NIL;
  static CycleCtr mdsync_cycle_ctr = 0;
  static CycleCtr mdckpt_cycle_ctr = 0;
  
- 
  typedef enum					/* behavior for mdopen & _mdfd_getseg */
  {
  	EXTENSION_FAIL,				/* ereport if segment not present */
--- 162,167 ----
***************
*** 653,659 **** mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer)
  {
  	off_t		seekpos;
  	int			nbytes;
--- 652,658 ----
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy)
  {
  	off_t		seekpos;
  	int			nbytes;
***************
*** 677,682 **** mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
--- 676,683 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+ 	/* Control buffered IO in OS by using posix_fadvise() */
+ 	BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);
  	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 50,56 **** typedef struct f_smgr
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 										  BlockNumber blocknum, char *buffer);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
--- 50,56 ----
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 					  BlockNumber blocknum, char *buffer, char *strategy);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
***************
*** 588,596 **** smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
  }
  
  /*
--- 588,596 ----
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer, char *strategy)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer, strategy);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 762,767 **** static struct config_bool ConfigureNamesBool[] =
--- 762,776 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"enable_kernel_readahead", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("On is optimize readahead by kernel, off is optimized by postgres."),
+ 			NULL
+ 		},
+ 		&enable_kernel_readahead,
+ 		true,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 135,140 ****
--- 135,142 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #enable_kernel_readahead = on		# on is optimized by OS,
+ 					# off is optimized by postgres
  
  # - Kernel Resource Usage -
  
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 44,55 **** typedef enum
--- 44,58 ----
  /* in globals.c ... this duplicates miscadmin.h */
  extern PGDLLIMPORT int NBuffers;
  
+ 
+ 
  /* in bufmgr.c */
  extern bool zero_damaged_pages;
  extern int	bgwriter_lru_maxpages;
  extern double bgwriter_lru_multiplier;
  extern bool track_io_timing;
  extern int	target_prefetch_pages;
+ extern bool	enable_kernel_readahead;
  
  /* in buf_init.c */
  extern PGDLLIMPORT char *BufferBlocks;
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 75,80 **** extern int	FileSync(File file);
--- 76,82 ----
  extern off_t FileSeek(File file, off_t offset, int whence);
  extern int	FileTruncate(File file, off_t offset);
  extern char *FilePathName(File file);
+ extern int	BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 115,121 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 92,98 **** extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 		 BlockNumber blocknum, char *buffer);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
--- 92,98 ----
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 			BlockNumber blocknum, char *buffer, char *strategy);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
***************
*** 118,124 **** extern void mdextend(SMgrRelation reln, ForkNumber forknum,
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
--- 118,124 ----
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
*** a/src/test/regress/expected/rangefuncs.out
--- b/src/test/regress/expected/rangefuncs.out
***************
*** 1,18 ****
  SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
!          name         | setting 
! ----------------------+---------
!  enable_bitmapscan    | on
!  enable_hashagg       | on
!  enable_hashjoin      | on
!  enable_indexonlyscan | on
!  enable_indexscan     | on
!  enable_material      | on
!  enable_mergejoin     | on
!  enable_nestloop      | on
!  enable_seqscan       | on
!  enable_sort          | on
!  enable_tidscan       | on
! (11 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
--- 1,19 ----
  SELECT name, setting FROM pg_settings WHERE name LIKE 'enable%';
!           name           | setting 
! -------------------------+---------
!  enable_bitmapscan       | on
!  enable_hashagg          | on
!  enable_hashjoin         | on
!  enable_indexonlyscan    | on
!  enable_indexscan        | on
!  enable_kernel_readahead | on
!  enable_material         | on
!  enable_mergejoin        | on
!  enable_nestloop         | on
!  enable_seqscan          | on
!  enable_sort             | on
!  enable_tidscan          | on
! (12 rows)
  
  CREATE TABLE foo2(fooid int, f2 int);
  INSERT INTO foo2 VALUES(1, 11);
s=1000-try1.pngimage/png; name="s=1000-try1.png"Download
�PNG


IHDR���*�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�s9IDATx������*Eq���LT���e�=�&�D�Nm�*m!�B!�B!�DL���<��Exu��D��^P��XO��}K�J L
	�V����>A%�X���&���>8A���<�p�[+�z�dT���	�<�?��\�(�(�(�(�b����=��;�O��f}�mf����o���9D{�����
2�ChG-��"��"@���c���
������>�������5��������mfK�o?�m��&K?xi����^��H�E�h��M��5���������0������T�vv!�����O6�0�"y/H�xk��"�<J����q)��h�]���?*�fw!�������M�n�TR��{��)�>�I�;&����^��KtFQ��T��������7j����%�a��\�N�9���V�&���<�KpF�j�'i�j�9�6z;/�>�A�[5@T�+;�{�N1�(]��S�������{h�2��8S�Om���d�]�W>���k[�����a�}@����A����������Y���dBMN�J)�F������l��Mj�LV�ouk���]'�����+���m�6�a���7|ji�fS%������+����~�An���/n"T�}������O�on]�Om}"��_K��D�r�.�7��L�M.��wH`k�
�n���"R��f�(�\�Gd�f���
0��k4@����{(��)�((�����H�hD����7p���;���|?�q	n~��@��t�e�\
����y�����
���8���������IP�\ �=��=�>�q�\!�5�/1@�����������7�y,�2�6�Q�k��5y����8����A�f�4>I!P�z�O(���� �?(���8����}CV(���xA0
`TW��@B�@Bt��
 -#
�D���!5�	�xg��PpD<uo ��Z ��@Dqm�#�	�����k�w��f����}A�7	�}
�����@�X�M�F��&�a�
@�eD�(@��`���2�P(��B(���(�(�?��2�P(69qPlr������[�R.�a
 N�a�����]�������w��1�z�&#���;8}5��PP:��d�
�H���"�.A��b�����`��	��(R����\ ��
P>D(�r4���nk
MM"@9@9@9��/Hd	_(��T0�/N�"a��d"M�`�	�Pa�G@��}4@9@9���?`X
�
��0�@=@9�����
dj�
 ����DP �"��g
�v47����	�F�0���6D���W����X������Q,���(�(�(�(�(�(�(G���PPP�Rx9��l\�<{��.k���q�vpFGP0gX���C��Bs�	�o�����N`�pf�g�>]��
@��5,~5��0v�\��o����>@�� ����H.�d���'���M�ag �M@��h��(��h��^=�+���d ��}�g2j��A���$�3��t�����@
�"K�
1���#�	�`C��>��4D�(��$�x(�r(�r(�r�
��;(�r(�r(�r4
��4
`Xx(�rt
��G��d�PP�J�����7����9���-!��@�F��}@8���V��@����9��"(��N�{d @�P�t����� N�����������Q'��������Q(��� �@9�`�P@B��e)@� �r(E�����Q��G�����$ ��3�1���(@� �tp��(G��&�p�ps�9�A��a0�
!�}� (��#Ng�����]�TP_'b��1
��@��a�X�W7�@7o�p�#j���3A E�D��L�A|U���%��P�8PD��5@	I�8
�U���P#��e��+��/ F����X�K����@��E	��	d�����>�O"G��>�5��=
`������>	P����(*`��
k�*@9J`�_to�wu	���W�-1����i 
P�@���C{&�9h�;��6p���SA�~���o�9
@��Y�~
`������h"d���Q��B�D���.�2	���Q!�#F�c����qX����l��!��i��!2�7�
�B�w*��?F�,���"��`�>����}@�S�`�^1���9dp���
8�����3��� ��Y�����`P�3���2�PX.�~�e����_�����1������7���28��[�O�#A��5��04o����4������y��k����7�#���a���?���@�|�����{@�R�@�A�-[�.�`�w�j�
�O��n�}����E�����@�r`��
�A����}�;4��@�
�?>�N�R�L��\�*�!n���~g���
 e|�����\���q�[����V�y��k��!u���c�`��H�����=�^?���'�����+��3��x[NI�'�H�Y���R�$He�5�`�:�?��V����Y�W�CTZ����~�r����QT����vTD��)u}{����Nc�m�����\��(���( E�T��7��3H3�����M��{\��F�R�;*��'V��Q�{^=���~��s��sW| ���*?j�T��"_?��QO����[��f�/-��Q��u�*�W���$�����[��D�v$O���&��>%���{O|F�R��b���$�;�7���/�s������/�_*������s�w%@P�������~��x2��x\�`D�L��%�f�������$<(�l��}���"������t�����?������M�<%�<�>0 ���=xIs_���j$pn���c9��)_e�
���G����=�x^(������:O	`���GU��?�~<�N��1��#�t�t>(�,�l�2E��'���~����N]�=��������9�tB��Ce
���`�I��9�$z����j��U�Q���sN���{j�9���&�`�w��w�N7�e�^��e��������U�Q��R���Au�����&����d_�=�w�j6G����^��h�V@ ;<
P�`�)��iGS��_���C��{���_���C��1H:_ ������X#��gz�|
���-!�B!����nkK�;�u�8��D����,[�!�CC~��a��k�n%\�I���9����]Is����$5B�&<�	,��"�0
��&�DU0�x'���^�Zg��R���R���#�0�d;;������K���^���Zj�!N�C<�Z���B�������������T��X�a�R�J}5�Q�o� OX���)�zm�"!�B!�B!���<5Z�0�T�7pS�����L>�� S�����������E�D
�a8���.o_�A�AQ	�	���J*
�
b|3|f2��*�N@�	�����
C����B!������PJ��vIEND�B`�
s=2000-try1.pngimage/png; name="s=2000-try1.png"Download
�PNG


IHDR���*�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�siIDATx��]���:u�����;`W\���2�d]��`�B�P(
�B�P(
�B�PD����?�?
�P|�����������+>������KC���	���2m��uD����/ZE���5M�J�O�7�v M��hy���[PL@	�q(>%����8����P|J��C	�q(>%����8����P|J��C	�q(>%����l�;%���8�_��Q�Z	�&���+�k�����]P���}�����W\��	`
��Hq^�����W/�U	`��og����q��+� ��m��f�}�s������� ��
"�e�����`�g�6�o5��# I���R����mt�%"�o�6�J��A j���H��c�I���%��	�e0K�����?���tu�	`<bYJ1����)��}<������;��F�M�4
��A@�d@*�������&��h!HSH2������W9�z&��������B�R��W�����w�{[��1�7�V	p&���f	&$@`�C���dR<t��(�+T������/!����tI��g<<�j�� U	 ���N~a=�&��o��}�a���?�,uJ����/�W���aG`������������	`���&��c_���u(`��&�!�&��Q=��10������H�E�d_s%cJ�j|`�U?&@���E(Z���"-��c/���	��u6]P��+N�f���O�:��I�����Ig�*.f��p�++`�X�;��s&��pZ=�`�����T����>�x����E��k�w��1<�^>#�3�,@	0�`B�P��0���O�0���ybPL"����z��.�� ����u��@	0k�\���,�>p`�y9��P��F�	gf�A��@]��� �}��f�>�������R,��9��R��Z�W��yj'(&���
�}6;�2�C����;�E��F	�E?@�=
���N�`
��fw�L��p�J�q��|��{������Q� K�(��7ae� �J�q�]����]`�M�2��QL �>f�C�#���������s^D� "Y#*����i�|��
(�=��2��"��������*���-b�BS�.L�x��7Ho��=��*^)�;��A��O�T���
,a|�����J���{
���1��=��P��O��Z�GK�w� ��4�[<�)Pt��(.m�Z�\�w@	�{�%@S�k��oS��/����
�}��{�=�^}���<��x]�A	��k�p
�����~��Z�[��`P�A	�p�������^��%@���Y�y^D���
ID�?h�G^D����I�`_����N	�W��=k�W'^E����?�;�Q���
�����K@����	�����=�&[���?X��,���]Y"^�����M�J��[��-��$�������k�����:��`^�.j������W��&Q��[�p�-������_��,	�M�|���N"���7�Xp�"�\lt�&Q��Y1������Lz��6����&Q>w�����pG�w���w�k��8�'!����{v	�'�/�9�B��I�h�^���&Qf�4_&D������YQ�`�����w�h�fB�1������_0t8j���_B���0�@>��LV�Q�2��44~W������h��B��Q�^&���n&@4����������-b	�B�y�����?g
��1����]�o!t���
�vc��Y@����%j�V����5�l�
��	'a���a�w�?n�H�KN5�q�#"�^���W(���I�2zB�;�8E�-�
�U���
V���?����h0d��0��^A>�o<
@��,��j�M������`�C�|�����8��f4�J�8��k�J��q�M��������^j`������q���
%/ ��9���B0��7�z�\���n6���)#>���y�}L�q�����)�3@6(����a������7�'O~����{
��v�����.P�+u��OBy?C��F'���	v�5��|e�0BC%��4T�RA��Bl�LC�?�����G�7����������u������N
K4��?��/ @0�2p#�([z�w���<�%h`���+����?"]�+�q���'@XH��uJ�����'�5f}s��	46@�h���?*M8	{<�o�NF#�mQ�	fH�`�Yi���Y(~�k�o���bY�'@��N-�m���Z�����[����n���	�>1 �:�!:��p0H�	@�������%>���Pb�;�$l�g�������B��w�N]��
L���&��-���}��<4�ADY���#���8	��D�>a�v�����t�����A��d�EH3�V,
����C	�z�K>oA�����������b���2Q$��(��'<�{O�8���]"�M���?K��+��]9
vfv���\�!N�G?�t��b����`�`��C�~,;6���dX��IC�"�`��*�������j��^3���jN1�4�7�'P�*s���G�o����~��-��IP7
��m����J1��M�Q$�
}���.�.(��Ab���v�?J����Z�������|��~��@�\�GTp"�t`��{	X8 �ZlKj�_�|3
�=��/W�������V� 	7*���d���7G���"�/c�Ef����a���43�����]���!�]�.$@�x�M���5��O[@~9���@
!
Ch��K��=������V	�c��y �����AM��xd!�eY��Z$��>�I'�?�
F9hP��#	��5�(�I��gK	8l��	J����F�0��Y�3	��������zTSU����JDc���e�x$*���X�l�)�!
��1f��/�+*
I���(h�#J����}�a>�z���X��'��$����1�U�3��9PR��0�#�:R��po��cf���g@�z���N
��4�����8�c��X���������	���\����h�0��s-v],Y��S�P���i�����nS��%J0���<�f�w�[hxG�={@v_0� �Z$F9���T��oG
�������?
l�Ap�C~�T�l��?)
d�%�{�p����p�������\��>pj�`���1GB����+�#����0)�i/�}1��-��Y��
��������4<%���J�1���[h�OXO��9`�+H>5�8��d�(��g|��{`�a��q��������^	$�?��?}4`�>�{��>
"�/����1�.���#�w�Ai`���	�3�z��v���Ah�Nn�	MC=��J�^O�� ���(����!� X������x��x`�����I_H���@<��w������R����9.���V	<l�R��M?�9h�i]:�"���0�:�B�5�3?����k �7��X<�,�n94�'����J�?���d�?���2��.t/7��N<���vr�����llT���!����8���rg-������yC�7�8}�(n5���0���a`!������
"y���c�`�<�����a	@-�I��9a�(
����[�9+d�����!�I�"�I����)�P0�
l
���*�(�
w�������s6��d�f@:@�7��0���� ��(aa�(�`��`���Q�=w�/����x���x@0������
`y������4�[�@K�g����7g��L����3`m@�����P������u2�&���s�2��T�)&��r�QX�NX���>g���%@�e	������@�)�%�Z;��	/U�X��
}gy(��(p��(}� ,G�f�\��gw�Qd�0��q��k���+2X�P���������J�v,H
�]���R�E%E�	�9,99[v��<�#����2���-T�X�4pHY.�����:����o�L.|�:q�H��e ����"���V$�����?�[,���O����	��,*�G�v�������V"
a�K�z�D����[@)
X���j�U�k�J���KX�04f���=[	P�B`E��
��hEr���S�C{W!x�7=]��T �GPv���p��n����u��=�r[�����H��l%H�Ju3{�>��~�	�sA���#�L���w��o��&�Z�
�-���E��0�����0@����UA�?Z��W>�%��p�������"�(9�l&+�I`wG�G�I�6���4(�n��A�Lf������^�_��0������C���~z�^;, �6�#�k������S�_	0�{-�c��Bv�7���hX��.������O\d�N�_��1�F��Q�@�55�H��[�����o��i��.@�T]@�HP�fCk�`�o���e�
�G?4��K���w%@������~S�Q+��?�[c��g�[�����~tQp�Z�S��E���B@w[��	@�����;����p�p�5�W	�)*��3�����M�S�u�EH��!�@6P�q����G�\l����Q��:�3��w��w��7���w�����.�|�c\Kt�=s��7h�D�_�@�����l�����u�W�|TG�-�gM��(R\� `g���\n\�>~��������m���C�,�B���e���`�;��:-�m��=��$*� �?��a��!p��S��� E
B�D��7
���~�������pw��&Q��1������%@#�6�J��ZX��;�����-���
�@�GZN�$��t�=\A�~k�f��Z�����Bf��j���wd�c����!�(9�W���y�W��N	?�%�@����������c\bHM���O1���G>����>f���I`�q�+�h���E?\n����c�%@N'���XbH��W�Jp���/���M���%@��-������y���3%@���^��W��Wd�H�������w
�`��O2-��EG�Z�
�&��h���/��p�E�@2����@���~$����,�w@�����'��	�>�7�f�����w[�����=����+:p#����*�(�pl0$%@;n N�:G�PP4�&`O��w�V���!��l�@��(�
4l�pj�������S����+ M��A��}�z�`#b��D���j{D�~�[��t�&�e�.@h�g�9P3.����S�n��6�\��W��+pe!�����7����\H;?_c���2K���}��`n��fh!�����'��t�asu���I[}����4^����9�{�v���a��w�~	���?Y,H�_�n�?�6����-�����������_T���c?b)�C��_ �<����G��z�������@���B���TW���h{�����������X��S�]1<�
��I���?$�N&3�K�[��F�k�`�	�GbY�
OO��Q��S~�)E�4_���U?��nh������x�����^"��������U7���>����	��_M!"'��+���J��C	�q(*`�[����-�7L����lZ_��"���p84z:��)C�)$@wfq49B-�G�zG�8��[������zP�]�����������(%�����]�����`��B	�� �����s�x
���s��jY�<����?LO����
O�j��+�O@�)?r��"j�/_��U�p����I&SFjk�	U]���=v�R-\�_M!"'\
%���H���Df+�-����D���e/ZE��T|]$�#>�I���U!��X���=��^�R�J���;���B�P(
�B��(s�]�e��$�&sH]i��$D���lSrU�����0sK��A"���{�X�H��<jKH��M����3?i�H�+=�������e�+e���a�I�z��&��D
��l;�CC�}��f2�u��W
�ZZ��U[�5,RV�Ie��	+�UX��U���w�@R�/�-'����T�A�|4%��"
�B�P(
�B�P(
�����E��R-��P&�+��ee�K0�d%�K��8Y���2DvM�puz�C���J�w K�>i��gQ)�����w&a�Q��PPpJ�� ��c���q�%+
�B�P(S�ho8o���IEND�B`�
s=3000-try1.pngimage/png; name="s=3000-try1.png"Download
�PNG


IHDR���*�)PLTE���������������@��  ��� �@����@����`��`��`��@��0`��`@@@@���`�``�`���`���@��`��`�`�����` �```    @@ @�`� `�``����@ � ����������  ���`����`�����`�@�@@�����`���������������������``������������� ����  � �� �  �@ �@��`��`�������@��@��`��p�����������T&�s�IDATx��]���*����7������5����J�]�dc3F�P(
�B�P(
�B�P(�~���;���@������8���0P������XB����.����M��$��5r+>�6)�$�F	��&�o�|
s-�=�6ym��%(�JX�UQ�	`���*����U�ux#�3@�_�W��(��J��WT��������p�^��j��p�)j�:�6�j��"���#�FIP�wj@���2@h�X�W�B	H�5^E�����]0XTP�w���8��@
�j�
�"�)��"���
/"@0L	P�7��~J�*��L�J�*��j��&����
� �+����
_R��%8��h�[�RT�-�i��)r�����A�Ux,���)���&�(��"����`4��k`�J�2��!@<	T	P����_W�����P�J����)@���� ���H@�wE
o ��nS
j�kp{)p�+N
/ �u�%@
^@�$0xu*�gR�H�2����� tbH��8!	e�l��������z~�y3���$��,��6`��wdL��8	E����7���1u�c�F���w���'[�����
N96� �B�����?��[��1@~[��x�	0�'���q�y�*|����*oL�%j��"O�.���O�K�&@���M���0Sw�T�q�(�d�<;X�����`���,��k?�����1
��%�1'�d`'�5P��"	�����A�w/J��P���`�.����: #z~�X�l�������.I�
#���)�5O&���/wv��2{�F	����p��]�7?�i�+ ������"�i�w�8B�����>��h� ~����d��c��.���?+p3xt�Q�1#�(=��
�<8�o��:���;�#�p���T�����W�?�qU���z=��#x1�c�+3��E�7 �hog���b�;	p�F�3�����A������uKR}R�a�E�=�'#����p�!���.G�����j�����{=J����+�������Jo��W$�>���f��b����	@9��)H�'�<����^��T|��	XyL0�s�S����.#��7�'��
��
h��	���po�`����$��tG`�����#y�q?J��II���\���dA�
Y���E��'��/M���}������ =�������������6Fn�K����#�����u�8�W���8�I�h8��n�x$r�����%�F�� �8���+o	�{�1�����LK�<����c��0���Q�OC�L�P���O#��M��.�k����SG�B�7k#�r<Z�8�
�!��O=���>��7!������;�xLy.0�PG��Q�O�������D$������YPQ0u��a/�������������8�G����<���O�6�<�7QU��gf��Z?�V	���O�zG�1�� ����Ad�E��ih���@
,6�����+
��}R�%��?o����ahZ�{R���,&T�}�w�<�aY��0T���J�L����)V
nl�|8���������H��O#@��B���&@�XH�dk��X��&�������������/��5�
����t|��G�	����)�xZh/�j��m���I`K�*�=DN89"��aO#@vc���BB��K��<���O� ����/�[v�����%��j8[\H�mT�d��,�y���J�ln�at,�!�{hS�(��y��(�t��	�L�0��@w���?�������R���8R��t�����C�`����#���K��KE���G�����P��}�vB�x:�3�������[�.X��?�9��K�-���O���o[j���7�k��a���?��l09���s`�.8w�8PG���A�)�Q���8V�Ks�,��Xe��������������
Ln�tN��eY�����7��������n��f�U����I�w��!�����-�F	l��JQ��a�Gf����=��0�<P!�`����Y1W�~.�q�X��{����b��rK,��|�7*	������d`f�,�h�����]tl\�B�K��cO��iM�Y������|B3//U)8���q���-��-���F��Xv)�B;(��|��u"������7��~? f���,�A��T	�|
����6����P�+��m�x�N?�(���AJ��D)������
��O�3�_,#���)����X�p���A`�������M�,�tA>I�sW�c����?�9����|�0�����?���Ws^���Jo�o��w&�^���I�gF��KS!�)
]�����z����F�8
'g��`��T�W�M5H�������!��rB��B`x�F�,�bI�;P���h{\��������N�C���) ��k�6��
Z����l������1��n�����,|���|���P�7e@:�y
HV�I�����e`>.M��yk�0�l$���(3|�-��|B�M���T(x��73�500hL�����!L�@Oy���������k�N|�t�� ������N������2uV���zV��5���@bxH����	k
� @!�9��
t��p=6��{|�'��}A�A":j�
FN�
� 
���k��K��~kp;�6����l�
�+����M��-Pd�N�f�A�[���h5���,�C�N��gj�A>�����'����IK��!��|0�7���/�`5�]�u�����}g�WH�)����a�f{;�	Z=�}�M��������o�
�%�b��,��u�x�� � `�i ����0u�[��N��Tr�[T�s�����H4~`,�]��o��	@Ew�5��o���*C���_�)+u�$���	�v�` H��T`0/C�����*.�av�KD�'@�-L�^	�JXV�j������_Q:���cu���8������,�������9[�@��6�g�g
��T=w�_�k@���d,��,��|���gyJ2a�6C,O�����$��V��O�|E�~00����I<��t�f@�����]��mYC����&��g��w�'�/�w������;-�����r������L��:,��{n��Fgp��=Q_�l�\Z&��E�p	H:���H$��7����DV�C��/r?�0�����5�0#���H�����&������Wp;�sj'o���U��keD��&2`�y�?.����,}���?�I�-+�Y��U �js�@����W��+4��4�����C���
@��-�������e�K��mz�Ls���l:����/,�bm��
� b�(�p����������j�����E��(�n���+��V&2`��������Z� j����5X�]�������#hMO�|@�
�u>�&�@|���n��E'��ot�,M��4��n�=zC���(�1_�)?�]X�E���~���`�.��5�=���8LB	�����L@I�_��@�
��I�w�L����*��4�T��(}�@�B��&�q��>C�e	 4M��_�����B��,#Y]@�����{��irQ����
�M�[��9��b@�=t��(Z	�&�|�8��C�Jg��_�K�Y��a���o%9�����G���.��//�C�f,�������T�e �]��JID�}b�q����#�GM�����
\�-#�1h��w� 0��E�g�2Y|������)(�
�?5�


�Y~Q�������N�r4%sj�BD���_\�g�3������+��@|����l�w��
�B:4b��T����+u��|�2�M%�,��Dc�N������G�#�v]:��%u�K�>�X�J0��9c�+��]�/H#i���~�Q���U���_*�@!����'�m&Q�G�r�����O����]���]r���B�M�pl^��wp��@0�������@����h�(�p��}���?����������X�"99f�g��4����� 2&��}nH�����`�&!��K�?]�#X�����d���=��{��3X�P�T��������p*��@�� ��o�[�)��D�H���P
f%���Al`����?%kT�;��
�i>�+�kK�X���E�?��RP$J�(��`x���&-,��,C� Z��7�����e�+�����q�?��j�����b�]�]���-0P�c�a���wmG�����E 1�A�6����	�����
@��77(�~�_�>��,@r��w`��gh(��6O%���X����#q���������������_�>
8��gtvj�/������z���`|j��R�~U �&�1
8P�+,��2PT����V��l�L�B1�w��"�k�����_^���I`��]�\�S?�B�05!n���`�/?De�7��~�e���?���PTP��2�~���?��p�G6��-���$�0������Tj������N0������o$�IS���������j�<�$�t�y9����
�+n%@o���� ��S��3�����;�r��p`P���Y�/�S�3"@�y8
������Dt�I�!C��C����I�����z�.@c.��c���f����P:�L�!�*��sV�;��T�~�<�S�n�j)�~��(���������lCc� ��o�o��d05������}-$V��s�@����@z`t2`	�A���_��u%(V���\�����M@*���o`* �o��x��i2�^� �I���/����:P�,%����4	�c��u��	���w�2@�b����	��?�9����4�+���E����	����7����yL �m�=M+��]|�	��P�m��4���<�V����
���W\�9��~�|�O	p����*���Q4`J`���s�sa�G�u���`�D}�j����?(�Tb3|�_�)������[�����J��JF	0s,@0���o
���
�oF��LL
��f�&(�0#0-��_���S��^�8
���`*�"���U�?����`*!������@k�7a!�7`!h����
�+fc
h���m84������F���q;��s?���9�J�;q���O	p�#�Vq���r�N���+��)�p�J����}h:�Mqn @�*n��@q;n!��XRF��A��\�G	��,@��a0�#&T1
1������� M���q���j�;6����c�T��A�E��_b��������� D	�KT�KaJ!���x�@E&������0�����a"������oaL
��rQ�v(>�).@��u1�F	�(fY
�Z��C	�q(>��	������y@�$������8��P|J��C	�q(>%@
0�=�r�f1U��+�|������%^J����$@�E��K��(g�J�$�/����(��[[���p�]�������`M�H��������c�%�����7��������	P���=�-���R�`#�xKC�^
�'���t�����"��0�$	@Z)�K_��������Z�"B
f,�!a|�H,I�� ~{l��2T��3��%@I����?�@`u�I����/��w���J�%������K
O]E��?���u$L��������P|J��C	P	��������yrwr�8�$���K��.���|M7��r5��U�9����%���A"_�G������z{/N�O.!��X��������p��o��D
/,�%dJ�uA~�=ob��;�S���3&T��Z�:��Xz}1���^��oOj�����������,J��?��S��Y@-��r��M��E\k�	]����w�R��WR��
S��8Imr�&�Q	L�6�X���u���&��,f�"�8mr�&1a1.�����U!D�P���=�P_�R�J���9�M�P(
�B�P|�9�������$�O��z���I4)�y&E�UJc�
��-�2�oR�I�g�55)��g�����)P�-��O�����^I��3'K���'%��-������'���jR��L���>���L�s&������nR�)��d
������28d�
�*���S�D����	 ��/�-�dI�\���N�4h���6�P(
�B�P(
�B�P(OD�,�/�j��2yZ�%-+^�^%+^~��B����+��h��w���9*/O�P�I�����>�J	�px7@�3���5���x	��?t�o�=�B�P(
�B1�(�ro<�!�IEND�B`�
#14Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#13)
Re: Optimize kernel readahead using buffer access strategy

On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I revise this patch and re-run performance test, it can work collectry in
Linux and no complile wanings. I add GUC about enable_kernel_readahead
option in new version. When this GUC is on(default), it works in
POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it
works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by
buffer hint in Postgres, readahead parameter is optimized by postgres. We
can change this parameter in their transactions everywhere and everytime.

I'd change the naming to

enable_readahead=os|fadvise

with os = on, fadvise = off

And, if you want to keep the on/off values, I'd reverse them. Because
off reads more like "I don't do anything special", and in your patch
it's quite the opposite.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Claudio Freire (#14)
Re: Optimize kernel readahead using buffer access strategy

(2013/12/10 22:55), Claudio Freire wrote:

On Tue, Dec 10, 2013 at 5:03 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

I revise this patch and re-run performance test, it can work collectry in
Linux and no complile wanings. I add GUC about enable_kernel_readahead
option in new version. When this GUC is on(default), it works in
POSIX_FADV_NORMAL which is general readahead in OS. And when it is off, it
works in POSXI_FADV_RANDOM or POSIX_FADV_SEQUENTIAL which is judged by
buffer hint in Postgres, readahead parameter is optimized by postgres. We
can change this parameter in their transactions everywhere and everytime.

I'd change the naming to

OK. I think "on" or "off" naming is not good, too.

enable_readahead=os|fadvise

with os = on, fadvise = off

Hmm. fadvise is method and is not a purpose. So I consider another idea of this GUC.

1)readahead_strategy=os|pg
This naming is good for future another implements. If we will want to set
maximum readahead paraemeter which is always use POSIX_FADV_SEQUENTIAL, we can
set "max".

2)readahead_optimizer=os|pg or readahaed_strategist=os|pg
This naming is easy to understand to who is opitimized readahead.
But it isn't extensibility for future another implements.

And, if you want to keep the on/off values, I'd reverse them. Because
off reads more like "I don't do anything special", and in your patch
it's quite the opposite.

I understand your feeling. If we adopt "on|off" setting, I would like to set GUC
optimized_readahead=off|on.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#15)
Re: Optimize kernel readahead using buffer access strategy

On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

enable_readahead=os|fadvise

with os = on, fadvise = off

Hmm. fadvise is method and is not a purpose. So I consider another idea of
this GUC.

Yeah, I was thinking of opening the door for readahead=aio, but
whatever clearer than on-off would work ;)

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Claudio Freire (#16)
Re: Optimize kernel readahead using buffer access strategy

(2013/12/12 9:30), Claudio Freire wrote:

On Wed, Dec 11, 2013 at 3:14 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

enable_readahead=os|fadvise

with os = on, fadvise = off

Hmm. fadvise is method and is not a purpose. So I consider another idea of
this GUC.

Yeah, I was thinking of opening the door for readahead=aio, but
whatever clearer than on-off would work ;)

I'm very interested in Postgres with libaio, and I'd like to see the perfomance
improvements. I'm not sure about libaio, however, it will face
exclusive-buffer-lock problem in asynchronous IO.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Simon Riggs
simon@2ndQuadrant.com
In reply to: KONDO Mitsumasa (#1)
Re: Optimize kernel readahead using buffer access strategy

On 14 November 2013 12:09, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

For your information of effect of this patch, I got results of pgbench which are
in-memory-size database and out-memory-size database, and postgresql.conf
settings are always used by us. It seems to improve performance to a better. And
I think that this feature is going to be necessary for business intelligence
which will be realized at PostgreSQL version 10. I seriously believe Simon's
presentation in PostgreSQL conference Europe 2013! It was very exciting!!!

Thank you.

I like the sound of this patch, sorry I've not been able to help as yet.

Your tests seem to relate to pgbench. Do we have tests on more BI related tasks?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Mitsumasa KONDO
kondo.mitsumasa@gmail.com
In reply to: Simon Riggs (#18)
Re: Optimize kernel readahead using buffer access strategy

2013/12/12 Simon Riggs <simon@2ndquadrant.com>

On 14 November 2013 12:09, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

For your information of effect of this patch, I got results of pgbench

which are

in-memory-size database and out-memory-size database, and postgresql.conf
settings are always used by us. It seems to improve performance to a

better. And

I think that this feature is going to be necessary for business

intelligence

which will be realized at PostgreSQL version 10. I seriously believe

Simon's

presentation in PostgreSQL conference Europe 2013! It was very

exciting!!!

Thank you.

I like the sound of this patch, sorry I've not been able to help as yet.

Your tests seem to relate to pgbench. Do we have tests on more BI related
tasks?

Yes, off-course! We will need another benchmark test before conclusion of
this patch.
What kind of benchmark do you have?

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

#20Simon Riggs
simon@2ndQuadrant.com
In reply to: Mitsumasa KONDO (#19)
Re: Optimize kernel readahead using buffer access strategy

On 12 December 2013 13:43, Mitsumasa KONDO <kondo.mitsumasa@gmail.com> wrote:

Your tests seem to relate to pgbench. Do we have tests on more BI related
tasks?

Yes, off-course! We will need another benchmark test before conclusion of
this patch.
What kind of benchmark do you have?

I suggest isolating SeqScan and IndexScan and BitmapIndex/HeapScan
examples, as well as some of the simpler TPC-H queries.

But start with some SeqScan and VACUUM cases.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#17)
2 attachment(s)
Re: Optimize kernel readahead using buffer access strategy

Hi,

I fixed the patch to improve followings.

- Can compile in MacOS.
- Change GUC name enable_kernel_readahead to readahead_strategy.
- Change POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL when we select sequential
access strategy, this reason is later...

I tested simple two access paterns which are followings in pgbench tables scale
size is 1000.

A) SELECT count(bid) FROM pgbench_accounts; (Index only scan)
B) SELECT count(bid) FROM pgbench_accounts; (Seq scan)

In each test, I restart postgres and drop file cache before each test.

Unpatched PG is faster than patched in A and B query. It was about 1.3 times
faster. Result of A query as expected, because patched PG cannot execute
readahead at all. So cache cold situation is bad for patched PG. However, it
might good for cache hot situation, because it doesn't read disk IO at all and
can calculate file cache usage and know which cache is important.

However, result of B query as unexpected, because my patch select
POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that,
nevertheless I read kernel source code... Next, I change POSIX_FADV_SEQUNENTIAL
to POISX_FADV_NORMAL in my patch. B query was faster as unpatched PG.

In heavily random access benchmark tests which are pgbench and DBT-2, my patched
PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres buffer hint
strategy algorithm have not optimized for readahead strategy yet, and I don't fix
it. It is still only for ring buffer algorithm in shared_buffer.

Attached printf-debug patch will show you inside postgres buffer strategy. When
you see "S" it selects sequential access strategy, on the other hands, when you
see "R" it selects random access strategy. It might interesting for you. It's
very visual.

Example output is here.

[mitsu-ko@localhost postgresql]$ bin/vacuumdb
SSSSSSSSSSSSSSSSSSSSSSSSSSS~~SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
[mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(aid) FROM pgbench_accounts"
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2854.29..2854.30 rows=1 width=4) (actual time=33.438..33.438 rows=1 loops=1)
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.29..2604.29 rows=100000 width=4) (actual time=0.072..20.912 rows=100000 loops=1)
Heap Fetches: 0
Total runtime: 33.552 ms
(4 rows)

RRRRRRRRRRRRRRRRRRRRRRRRRRR~~RRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
[mitsu-ko@localhost postgresql]$ bin/psql -c "EXPLAIN ANALYZE SELECT count(bid) FROM pgbench_accounts"
SSSSSSSSSSSSSSSSSSSSSSSSSSS~~SSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2890.00..2890.01 rows=1 width=4) (actual time=40.315..40.315 rows=1 loops=1)
-> Seq Scan on pgbench_accounts (cost=0.00..2640.00 rows=100000 width=4) (actual time=0.112..23.001 rows=100000 loops=1)
Total runtime: 40.472 ms
(3 rows)

Thats's all now.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

optimizing_kernel-readahead_using_buffer-access-strategy_v4.patchtext/x-diff; name=optimizing_kernel-readahead_using_buffer-access-strategy_v4.patchDownload
*** a/configure
--- b/configure
***************
*** 19937,19943 **** LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
--- 19937,19943 ----
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1252,1257 **** include 'filename'
--- 1252,1281 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-readahead-strategy" xreflabel="readahead_strategy">
+       <term><varname>readahead_strategy</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>readahead_strategy</>configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         This feature is to select which readahead strategy is used. When we
+         set off(default), readahead strategy is optimized by OS. On the other
+         hands, when we set on, readahead strategy is optimized by Postgres.
+         In typicaly situations, OS readahead strategy will be good working,
+         however Postgres often knows better readahead strategy before 
+         executing disk access. For example, we can easy to predict access 
+         pattern when we input SQLs, because planner of postgres decides 
+         efficient access pattern to read faster. And it might be random access
+         pattern or sequential access pattern. It will be less disk IO and more
+         efficient to use file cache in OS. It will be better performance.
+         However this optimization is not complete now, so it is necessary to
+         choose it carefully in considering situations. Default setting is off
+         that is optimized by OS, and whenever it can change it.
+        </para>
+       </listitem>
+      </varlistentry> 
+ 
       </variablelist>
       </sect2>
  
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 9119,9125 **** copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9119,9125 ----
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/buf.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 451,457 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			if (track_io_timing)
  			{
--- 452,458 ----
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock, (char *) strategy);
  
  			if (track_io_timing)
  			{
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 73,80 ****
--- 73,82 ----
  #include "catalog/pg_tablespace.h"
  #include "common/relpath.h"
  #include "pgstat.h"
+ #include "storage/buf.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
+ #include "storage/bufmgr.h"
  #include "utils/guc.h"
  #include "utils/resowner_private.h"
  
***************
*** 123,129 **** int			max_files_per_process = 1000;
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! 
  
  /* Debugging.... */
  
--- 125,131 ----
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! bool			readahead_strategy = false ;
  
  /* Debugging.... */
  
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 385,405 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM) && defined(POSIX_FADV_SEQUENTIAL)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1159,1201 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
+  * Select OS readahead strategy using buffer hint in postgres. If we select POSIX_FADV_RANDOM,
+  * readahead isn't executed at all and file cache replace algorithm in OS will be more smart.
+  * Because it can calculate correct number of accesses which are hot data.
+  */
+ int
+ BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM)
+ 	if(!readahead_strategy)
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_NORMAL);
+ 
+ 	/* readahead optimization */
+ 	if(strategy != NULL)
+ 	{
+ 		/* use normal readahead setting, we confirmed POSIX_FADV_NORMAL is faster than FADV_SEQUENTIAL in linux. */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_NORMAL);
+ 	}
+ 	else
+ 	{
+ 		/* don't use readahead in kernel, so we can more effectively use OS file cache */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_RANDOM);
+ 	}
+ #else
+ 	return 0;
+ #endif
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 162,168 **** static List *pendingUnlinks = NIL;
  static CycleCtr mdsync_cycle_ctr = 0;
  static CycleCtr mdckpt_cycle_ctr = 0;
  
- 
  typedef enum					/* behavior for mdopen & _mdfd_getseg */
  {
  	EXTENSION_FAIL,				/* ereport if segment not present */
--- 162,167 ----
***************
*** 653,659 **** mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer)
  {
  	off_t		seekpos;
  	int			nbytes;
--- 652,658 ----
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy)
  {
  	off_t		seekpos;
  	int			nbytes;
***************
*** 677,682 **** mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
--- 676,683 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+ 	/* Control buffered IO in OS by using posix_fadvise() */
+ 	BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);
  	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 50,56 **** typedef struct f_smgr
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 										  BlockNumber blocknum, char *buffer);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
--- 50,56 ----
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 					  BlockNumber blocknum, char *buffer, char *strategy);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
***************
*** 588,596 **** smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
  }
  
  /*
--- 588,596 ----
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer, char *strategy)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer, strategy);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 762,767 **** static struct config_bool ConfigureNamesBool[] =
--- 762,776 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"readahead_strategy", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("off is optimized readahead by kernel, on is optimized by postgres."),
+ 			NULL
+ 		},
+ 		&readahead_strategy,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 135,140 ****
--- 135,142 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #readahead_strategy = off		# off is optimized by OS,
+ 					# on is optimized by postgres
  
  # - Kernel Resource Usage -
  
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 44,55 **** typedef enum
--- 44,58 ----
  /* in globals.c ... this duplicates miscadmin.h */
  extern PGDLLIMPORT int NBuffers;
  
+ 
+ 
  /* in bufmgr.c */
  extern bool zero_damaged_pages;
  extern int	bgwriter_lru_maxpages;
  extern double bgwriter_lru_multiplier;
  extern bool track_io_timing;
  extern int	target_prefetch_pages;
+ extern bool	readahead_strategy;
  
  /* in buf_init.c */
  extern PGDLLIMPORT char *BufferBlocks;
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 75,80 **** extern int	FileSync(File file);
--- 76,82 ----
  extern off_t FileSeek(File file, off_t offset, int whence);
  extern int	FileTruncate(File file, off_t offset);
  extern char *FilePathName(File file);
+ extern int	BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 115,121 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 92,98 **** extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 		 BlockNumber blocknum, char *buffer);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
--- 92,98 ----
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 			BlockNumber blocknum, char *buffer, char *strategy);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
***************
*** 118,124 **** extern void mdextend(SMgrRelation reln, ForkNumber forknum,
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
--- 118,124 ----
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
optimizing_kernel-readahead_using_buffer-access-strategy_v4_printf-debug.patchtext/x-diff; name=optimizing_kernel-readahead_using_buffer-access-strategy_v4_printf-debug.patchDownload
*** a/configure
--- b/configure
***************
*** 19937,19943 **** LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
--- 19937,19943 ----
  
  
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do
  as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  { $as_echo "$as_me:$LINENO: checking for $ac_func" >&5
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1252,1257 **** include 'filename'
--- 1252,1281 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-readahead-strategy" xreflabel="readahead_strategy">
+       <term><varname>readahead_strategy</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>readahead_strategy</>configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         This feature is to select which readahead strategy is used. When we
+         set off(default), readahead strategy is optimized by OS. On the other
+         hands, when we set on, readahead strategy is optimized by Postgres.
+         In typicaly situations, OS readahead strategy will be good working,
+         however Postgres often knows better readahead strategy before 
+         executing disk access. For example, we can easy to predict access 
+         pattern when we input SQLs, because planner of postgres decides 
+         efficient access pattern to read faster. And it might be random access
+         pattern or sequential access pattern. It will be less disk IO and more
+         efficient to use file cache in OS. It will be better performance.
+         However this optimization is not complete now, so it is necessary to
+         choose it carefully in considering situations. Default setting is off
+         that is optimized by OS, and whenever it can change it.
+        </para>
+       </listitem>
+      </varlistentry> 
+ 
       </variablelist>
       </sect2>
  
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 9119,9125 **** copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9119,9125 ----
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/buf.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 451,457 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			if (track_io_timing)
  			{
--- 452,458 ----
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock, (char *) strategy);
  
  			if (track_io_timing)
  			{
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 73,80 ****
--- 73,82 ----
  #include "catalog/pg_tablespace.h"
  #include "common/relpath.h"
  #include "pgstat.h"
+ #include "storage/buf.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
+ #include "storage/bufmgr.h"
  #include "utils/guc.h"
  #include "utils/resowner_private.h"
  
***************
*** 123,129 **** int			max_files_per_process = 1000;
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! 
  
  /* Debugging.... */
  
--- 125,131 ----
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! bool			readahead_strategy = false ;
  
  /* Debugging.... */
  
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 385,405 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM) && defined(POSIX_FADV_SEQUENTIAL)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1159,1203 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
+  * Select OS readahead strategy using buffer hint in postgres. If we select POSIX_FADV_RANDOM,
+  * readahead isn't executed at all and file cache replace algorithm in OS will be more smart.
+  * Because it can calculate correct number of accesses which are hot data.
+  */
+ int
+ BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM)
+ 	if(!readahead_strategy)
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_NORMAL);
+ 
+ 	/* readahead optimization */
+ 	if(strategy != NULL)
+ 	{
+ 		printf("S");
+ 		/* use normal readahead setting, we confirmed POSIX_FADV_NORMAL is faster than FADV_SEQUENTIAL in linux. */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_NORMAL);
+ 	}
+ 	else
+ 	{
+ 		printf("R");
+ 		/* don't use readahead in kernel, so we can more effectively use OS file cache */
+ 		return FileCacheAdvise(file, (off_t) offset, amount, POSIX_FADV_RANDOM);
+ 	}
+ #else
+ 	return 0;
+ #endif
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 162,168 **** static List *pendingUnlinks = NIL;
  static CycleCtr mdsync_cycle_ctr = 0;
  static CycleCtr mdckpt_cycle_ctr = 0;
  
- 
  typedef enum					/* behavior for mdopen & _mdfd_getseg */
  {
  	EXTENSION_FAIL,				/* ereport if segment not present */
--- 162,167 ----
***************
*** 653,659 **** mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer)
  {
  	off_t		seekpos;
  	int			nbytes;
--- 652,658 ----
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy)
  {
  	off_t		seekpos;
  	int			nbytes;
***************
*** 677,682 **** mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
--- 676,683 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+ 	/* Control buffered IO in OS by using posix_fadvise() */
+ 	BufferHintIOAdvise(v->mdfd_vfd, buffer, BLCKSZ, strategy);
  	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 50,56 **** typedef struct f_smgr
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 										  BlockNumber blocknum, char *buffer);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
--- 50,56 ----
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 					  BlockNumber blocknum, char *buffer, char *strategy);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
***************
*** 588,596 **** smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
  }
  
  /*
--- 588,596 ----
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer, char *strategy)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer, strategy);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 762,767 **** static struct config_bool ConfigureNamesBool[] =
--- 762,776 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"readahead_strategy", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("off is optimized readahead by kernel, on is optimized by postgres."),
+ 			NULL
+ 		},
+ 		&readahead_strategy,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 135,140 ****
--- 135,142 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #readahead_strategy = off		# off is optimized by OS,
+ 					# on is optimized by postgres
  
  # - Kernel Resource Usage -
  
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 44,55 **** typedef enum
--- 44,58 ----
  /* in globals.c ... this duplicates miscadmin.h */
  extern PGDLLIMPORT int NBuffers;
  
+ 
+ 
  /* in bufmgr.c */
  extern bool zero_damaged_pages;
  extern int	bgwriter_lru_maxpages;
  extern double bgwriter_lru_multiplier;
  extern bool track_io_timing;
  extern int	target_prefetch_pages;
+ extern bool	readahead_strategy;
  
  /* in buf_init.c */
  extern PGDLLIMPORT char *BufferBlocks;
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 75,80 **** extern int	FileSync(File file);
--- 76,82 ----
  extern off_t FileSeek(File file, off_t offset, int whence);
  extern int	FileTruncate(File file, off_t offset);
  extern char *FilePathName(File file);
+ extern int	BufferHintIOAdvise(File file, char *offset, off_t amount, char *strategy);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 115,121 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 92,98 **** extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 		 BlockNumber blocknum, char *buffer);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
--- 92,98 ----
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 			BlockNumber blocknum, char *buffer, char *strategy);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
***************
*** 118,124 **** extern void mdextend(SMgrRelation reln, ForkNumber forknum,
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
--- 118,124 ----
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
#22Simon Riggs
simon@2ndQuadrant.com
In reply to: KONDO Mitsumasa (#21)
Re: Optimize kernel readahead using buffer access strategy

On 17 December 2013 11:50, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

Unpatched PG is faster than patched in A and B query. It was about 1.3 times
faster. Result of A query as expected, because patched PG cannot execute
readahead at all. So cache cold situation is bad for patched PG. However, it
might good for cache hot situation, because it doesn't read disk IO at all
and can calculate file cache usage and know which cache is important.

However, result of B query as unexpected, because my patch select
POSIX_FADV_SEQUNENTIAL collectry, but it slow. I cannot understand that,
nevertheless I read kernel source code... Next, I change
POSIX_FADV_SEQUNENTIAL to POISX_FADV_NORMAL in my patch. B query was faster
as unpatched PG.

In heavily random access benchmark tests which are pgbench and DBT-2, my
patched PG is about 1.1 - 1.3 times faster than unpatched PG. But postgres
buffer hint strategy algorithm have not optimized for readahead strategy
yet, and I don't fix it. It is still only for ring buffer algorithm in
shared_buffer.

These are interesting results. Good research.

They also show that the benefit of this is very specific to the exact
task being performed. I can't see any future for a setting that
applies to everything or nothing. We must be more selective.

We also need much better benchmark results, clearly laid out, so they
can be reproduced and discussed.

Please keep working on this.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: Simon Riggs (#22)
Re: Optimize kernel readahead using buffer access strategy

(2013/12/17 21:29), Simon Riggs wrote:

These are interesting results. Good research.

Thanks!

They also show that the benefit of this is very specific to the exact
task being performed. I can't see any future for a setting that
applies to everything or nothing. We must be more selective.

This patch is still needed some human judgement whether readahead is on or off.
But it might have been already useful for clever users. However, I'd like to
implement adding more the minimum optimization.

We also need much better benchmark results, clearly laid out, so they
can be reproduced and discussed.

I think this feature is big benefit for OLTP, and it might useful for BI now.
BI queries are mostly compicated, so we will need to test more in some
situations. Printf debug is very useful for debugging my patch, and it will
accelerate the optimization.

Please keep working on this.

OK. I do it patiently.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24KONDO Mitsumasa
kondo.mitsumasa@lab.ntt.co.jp
In reply to: KONDO Mitsumasa (#23)
1 attachment(s)
Re: Optimize kernel readahead using buffer access strategy

Hi,

I fix and submit this patch in CF4.

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in pgbench.
Because pgbench transactions are always random access...

And I test my patch in DBT-2 benchmark. Results are under following.

* Test server
Server: HP Proliant DL360 G7
CPU: Xeon E5640 2.66GHz (1P/4C)
Memory: 18GB(PC3-10600R-9)
Disk: 146GB(15k)*4 RAID1+0
RAID controller: P410i/256MB
OS: RHEL 6.4(x86_64)
FS: Ext4

* DBT-2 result(WH400, SESSION=100, ideal_score=5160)
Method | score | average | 90%tile | Maximum
------------------------------------------------
plain | 3589 | 9.751 | 33.680 | 87.8036
option=off | 3670 | 9.107 | 34.267 | 79.3773
option=on | 4222 | 5.140 | 7.619 | 102.473

"option" is "readahead_strategy" option, and "on" is my proposed.
"average", "90%tile", and Maximum represent latency.
Average_latency is 2 times faster than plain!

* Detail results (uploading now. please wait for a hour...)
[plain]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/normal_20140109/HTML/index_thput.html
[option=off]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_off_20140109/HTML/index_thput.html
[option=on]
http://pgstatsinfo.projects.pgfoundry.org/readahead_dbt2/readahead_on_20140109/HTML/index_thput.html

We can see part of super slow latency in my proposed method test.
Part of transaction active is 20%, and part of slow transactions is 80%.
It might be Pareto principle in CHECKPOINT;-)
#It's joke.

I will test some join sqls performance and TPC-3 benchmark in this or next week.

Regards,
--
Mitsumasa KONDO
NTT Open Source Software Center

Attachments:

optimizing_kernel-readahead_using_buffer-access-strategy_v5.patchtext/x-diff; name=optimizing_kernel-readahead_using_buffer-access-strategy_v5.patchDownload
*** a/configure
--- b/configure
***************
*** 11303,11309 **** fi
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
--- 11303,11309 ----
  LIBS_including_readline="$LIBS"
  LIBS=`echo "$LIBS" | sed -e 's/-ledit//g' -e 's/-lreadline//g'`
  
! for ac_func in cbrt dlopen fdatasync getifaddrs getpeerucred getrlimit mbstowcs_l memmove poll posix_fadvise pstat readlink setproctitle setsid shm_open sigprocmask symlink sync_file_range towlower utime utimes wcstombs wcstombs_l
  do :
    as_ac_var=`$as_echo "ac_cv_func_$ac_func" | $as_tr_sh`
  ac_fn_c_check_func "$LINENO" "$ac_func" "$as_ac_var"
*** a/contrib/pg_prewarm/pg_prewarm.c
--- b/contrib/pg_prewarm/pg_prewarm.c
***************
*** 179,185 **** pg_prewarm(PG_FUNCTION_ARGS)
  		 */
  		for (block = first_block; block <= last_block; ++block)
  		{
! 			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer);
  			++blocks_done;
  		}
  	}
--- 179,185 ----
  		 */
  		for (block = first_block; block <= last_block; ++block)
  		{
! 			smgrread(rel->rd_smgr, forkNumber, block, blockbuffer, BAS_BULKREAD);
  			++blocks_done;
  		}
  	}
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1293,1298 **** include 'filename'
--- 1293,1322 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="guc-readahead-strategy" xreflabel="readahead_strategy">
+       <term><varname>readahead_strategy</varname> (<type>integer</type>)</term>
+       <indexterm>
+        <primary><varname>readahead_strategy</>configuration parameter</primary>
+       </indexterm>
+       <listitem>
+        <para>
+         This feature is to select which readahead strategy is used. When we
+         set off(default), readahead strategy is optimized by OS. On the other
+         hands, when we set on, readahead strategy is optimized by Postgres.
+         In typicaly situations, OS readahead strategy will be good working,
+         however Postgres often knows better readahead strategy before 
+         executing disk access. For example, we can easy to predict access 
+         pattern when we input SQLs, because planner of postgres decides 
+         efficient access pattern to read faster. And it might be random access
+         pattern or sequential access pattern. It will be less disk IO and more
+         efficient to use file cache in OS. It will be better performance.
+         However this optimization is not complete now, so it is necessary to
+         choose it carefully in considering situations. Default setting is off
+         that is optimized by OS, and whenever it can change it.
+        </para>
+       </listitem>
+      </varlistentry> 
+ 
       </variablelist>
       </sect2>
  
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 9125,9131 **** copy_relation_data(SMgrRelation src, SMgrRelation dst,
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
--- 9125,9131 ----
  		/* If we got a cancel signal during the copy of the data, quit */
  		CHECK_FOR_INTERRUPTS();
  
! 		smgrread(src, forkNum, blkno, buf, (char *) BAS_BULKREAD);
  
  		if (!PageIsVerified(page, blkno))
  			ereport(ERROR,
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "pg_trace.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
+ #include "storage/buf.h"
  #include "storage/buf_internals.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
***************
*** 451,457 **** ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
  
  			if (track_io_timing)
  			{
--- 452,458 ----
  			if (track_io_timing)
  				INSTR_TIME_SET_CURRENT(io_start);
  
! 			smgrread(smgr, forkNum, blockNum, (char *) bufBlock, (char *) strategy);
  
  			if (track_io_timing)
  			{
*** a/src/backend/storage/file/fd.c
--- b/src/backend/storage/file/fd.c
***************
*** 73,80 ****
--- 73,82 ----
  #include "catalog/pg_tablespace.h"
  #include "common/relpath.h"
  #include "pgstat.h"
+ #include "storage/buf.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
+ #include "storage/bufmgr.h"
  #include "utils/guc.h"
  #include "utils/resowner_private.h"
  
***************
*** 123,129 **** int			max_files_per_process = 1000;
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! 
  
  /* Debugging.... */
  
--- 125,131 ----
   * setting this variable, and so need not be tested separately.
   */
  int			max_safe_fds = 32;	/* default if not changed */
! bool			readahead_strategy = false ;
  
  /* Debugging.... */
  
***************
*** 383,388 **** pg_flush_data(int fd, off_t offset, off_t amount)
--- 385,405 ----
  	return 0;
  }
  
+ /*
+  * pg_fadvise --- advise OS that the cache will need or not
+  *
+  * Not all platforms have posix_fadvise. If it does not support posix_fadvise,
+  * we do nothing about here.
+  */
+ int
+ pg_fadvise(int fd, off_t offset, off_t amount, int advise)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM) && defined(POSIX_FADV_SEQUENTIAL)
+ 	return posix_fadvise(fd, offset, amount, advise);
+ #else
+ 	return 0;
+ #endif
+ }
  
  /*
   * fsync_fname -- fsync a file or directory, handling errors properly
***************
*** 1142,1147 **** OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
--- 1159,1201 ----
  }
  
  /*
+  * Controling OS file cache using posix_fadvise()
+  */
+ int
+ FileCacheAdvise(File file, off_t offset, off_t amount, int advise)
+ {
+ 	return pg_fadvise(VfdCache[file].fd, offset, amount, advise);
+ }
+ 
+ /*
+  * Select OS readahead strategy using buffer hint in postgres. If we select POSIX_FADV_RANDOM,
+  * readahead isn't executed at all and file cache replace algorithm in OS will be more smart.
+  * Because it can calculate correct number of accesses which are hot data.
+  */
+ int
+ BufferHintIOAdvise(File file, off_t offset, off_t amount, char *strategy)
+ {
+ #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED) && defined(POSIX_FADV_RANDOM)
+ 	if(!readahead_strategy)
+ 		return FileCacheAdvise(file, offset, amount, POSIX_FADV_NORMAL);
+ 
+ 	/* readahead optimization */
+ 	if(strategy != NULL)
+ 	{
+ 		/* use normal readahead setting, we confirmed POSIX_FADV_NORMAL is faster than FADV_SEQUENTIAL in linux. */
+ 		return FileCacheAdvise(file, offset, amount, POSIX_FADV_NORMAL);
+ 	}
+ 	else
+ 	{
+ 		/* don't use readahead in kernel, so we can more effectively use OS file cache */
+ 		return FileCacheAdvise(file, offset, amount, POSIX_FADV_RANDOM);
+ 	}
+ #else
+ 	return 0;
+ #endif
+ }
+ 
+ /*
   * close a file when done with it
   */
  void
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 653,659 **** mdprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer)
  {
  	off_t		seekpos;
  	int			nbytes;
--- 653,659 ----
   */
  void
  mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy)
  {
  	off_t		seekpos;
  	int			nbytes;
***************
*** 677,682 **** mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
--- 677,684 ----
  				 errmsg("could not seek to block %u in file \"%s\": %m",
  						blocknum, FilePathName(v->mdfd_vfd))));
  
+ 	/* Control buffered IO in OS by using posix_fadvise() */
+ 	BufferHintIOAdvise(v->mdfd_vfd, seekpos, BLCKSZ, strategy);
  	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
  
  	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
*** a/src/backend/storage/smgr/smgr.c
--- b/src/backend/storage/smgr/smgr.c
***************
*** 50,56 **** typedef struct f_smgr
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 										  BlockNumber blocknum, char *buffer);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
--- 50,56 ----
  	void		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
  											  BlockNumber blocknum);
  	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
! 					  BlockNumber blocknum, char *buffer, char *strategy);
  	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
  						 BlockNumber blocknum, char *buffer, bool skipFsync);
  	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
***************
*** 588,596 **** smgrprefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
  }
  
  /*
--- 588,596 ----
   */
  void
  smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 		 char *buffer, char *strategy)
  {
! 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer, strategy);
  }
  
  /*
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 769,774 **** static struct config_bool ConfigureNamesBool[] =
--- 769,783 ----
  		NULL, NULL, NULL
  	},
  	{
+ 		{"readahead_strategy", PGC_USERSET, QUERY_TUNING_METHOD,
+ 			gettext_noop("off is optimized readahead by kernel, on is optimized by postgres."),
+ 			NULL
+ 		},
+ 		&readahead_strategy,
+ 		false,
+ 		NULL, NULL, NULL
+ 	},
+ 	{
  		{"geqo", PGC_USERSET, QUERY_TUNING_GEQO,
  			gettext_noop("Enables genetic query optimization."),
  			gettext_noop("This algorithm attempts to do planning without "
*** a/src/backend/utils/misc/postgresql.conf.sample
--- b/src/backend/utils/misc/postgresql.conf.sample
***************
*** 138,143 ****
--- 138,145 ----
  
  #temp_file_limit = -1			# limits per-session temp file space
  					# in kB, or -1 for no limit
+ #readahead_strategy = off		# off is optimized by OS,
+ 					# on is optimized by postgres
  
  # - Kernel Resource Usage -
  
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 50,55 **** extern int	bgwriter_lru_maxpages;
--- 50,56 ----
  extern double bgwriter_lru_multiplier;
  extern bool track_io_timing;
  extern int	target_prefetch_pages;
+ extern bool	readahead_strategy;
  
  /* in buf_init.c */
  extern PGDLLIMPORT char *BufferBlocks;
*** a/src/include/storage/fd.h
--- b/src/include/storage/fd.h
***************
*** 68,73 **** extern int	max_safe_fds;
--- 68,74 ----
  extern File PathNameOpenFile(FileName fileName, int fileFlags, int fileMode);
  extern File OpenTemporaryFile(bool interXact);
  extern void FileClose(File file);
+ extern int	FileCacheAdvise(File file, off_t offset, off_t amount, int advise);
  extern int	FilePrefetch(File file, off_t offset, int amount);
  extern int	FileRead(File file, char *buffer, int amount);
  extern int	FileWrite(File file, char *buffer, int amount);
***************
*** 75,80 **** extern int	FileSync(File file);
--- 76,82 ----
  extern off_t FileSeek(File file, off_t offset, int whence);
  extern int	FileTruncate(File file, off_t offset);
  extern char *FilePathName(File file);
+ extern int	BufferHintIOAdvise(File file, off_t offset, off_t amount, char *strategy);
  
  /* Operations that allow use of regular stdio --- USE WITH CAUTION */
  extern FILE *AllocateFile(const char *name, const char *mode);
***************
*** 113,118 **** extern int	pg_fsync_no_writethrough(int fd);
--- 115,121 ----
  extern int	pg_fsync_writethrough(int fd);
  extern int	pg_fdatasync(int fd);
  extern int	pg_flush_data(int fd, off_t offset, off_t amount);
+ extern int	pg_fadvise(int fd, off_t offset, off_t amount, int advise);
  extern void fsync_fname(char *fname, bool isdir);
  
  /* Filename components for OpenTemporaryFile */
*** a/src/include/storage/smgr.h
--- b/src/include/storage/smgr.h
***************
*** 92,98 **** extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 		 BlockNumber blocknum, char *buffer);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
--- 92,98 ----
  extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
  			 BlockNumber blocknum);
  extern void smgrread(SMgrRelation reln, ForkNumber forknum,
! 			BlockNumber blocknum, char *buffer, char *strategy);
  extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
  		  BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
***************
*** 118,124 **** extern void mdextend(SMgrRelation reln, ForkNumber forknum,
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
--- 118,124 ----
  extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
  		   BlockNumber blocknum);
  extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
! 	   char *buffer, char *strategy);
  extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
  		BlockNumber blocknum, char *buffer, bool skipFsync);
  extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
#25Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#24)
Re: Optimize kernel readahead using buffer access strategy

On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in
pgbench.
Because pgbench transactions are always random access...

Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Claudio Freire
klaussfreire@gmail.com
In reply to: KONDO Mitsumasa (#24)
Re: Optimize kernel readahead using buffer access strategy

On Tue, Jan 14, 2014 at 8:58 AM, KONDO Mitsumasa
<kondo.mitsumasa@lab.ntt.co.jp> wrote:

In my past patch, it is significant bug which is mistaken caluculation of
offset in posix_fadvise():-( However it works well without problem in
pgbench.
Because pgbench transactions are always random access...

Did you notice any difference?

AFAIK, when specifying read patterns (ie, RANDOM, SEQUENTIAL and stuff
like that), the offset doesn't matter. At least in linux.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Andres Freund
andres@2ndquadrant.com
In reply to: KONDO Mitsumasa (#24)
Re: Optimize kernel readahead using buffer access strategy

Hi,

On 2014-01-14 20:58:20 +0900, KONDO Mitsumasa wrote:

I will test some join sqls performance and TPC-3 benchmark in this or next week.

This patch has been marked as "Waiting For Author" for nearly two months
now. Marked as "Returned with Feedback".

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers