Prefetch index pages for B-Tree index scans

Started by Claudio Freireabout 13 years ago22 messages

klaussfreire@gmail.com

about 13 years ago

1 attachment(s)

I've noticed, doing some reporting queries once, that index scans fail
to saturate server resources on compute-intensive queries.

Problem is, just after fetching a page, postgres starts computing
stuff before fetching the next. This results in I/O - compute - I/O -
compute alternation that results in idle CPU and disk, both.

I've also noticed earlier patches attempted to implement prefetch of
index pages, yet they don't seem to be committed. I'm wondering why.

The patches themselves were quite complex, attempting to prefetch heap
tuples in addition to index tuples. I can see how this would be
beneficial, but heap prefetch is notoriously more complicated, and
with the advent of index-only scans, maybe index-page-only prefetching
would be of use.

To test this hypothesis, I wrote the attached patch. pgbench doesn't
seem to see any performance impact (expected, since pgbench uses
really tiny queries that touch a single page). Using pgbench's data
with a scale of 1000, I tried some other queries that were more taxing
of index-only scans. This was done on my personal computer, with a
really lame I/O subsystem (compared to database servers), and with
9.2.1 rather than git, but I think it should be significant anyway. I
will try to verify on RAID-ed disks, but I'm in short supply of test
systems at the moment.

Pgbench's biggest index (on pgbench_accounts.aid) is not
unsurprisingly quite sequential, since it has been just created. So, I
tested both forward and backward index scans, to get an idea of how
the patch impacts on sequential and non-sequential workloads. I
haven't been able to test truly random ones yet - I'm trying to load
up a dump from a production database that's both big and messy to
test.

A few things worth noting, is that the patch avoids doing prefetch on
single-page index access, and only when block numbers aren't
sequentially increasing (only when scanning the index nonsequentially
will prefetch be attempted), since I noticed the fadvise call in those
cases was being counterproductive.

So, here we go:

The base query I used, is:

explain analyze select aid, count(*) from pgbench_accounts where aid
between 10000000 and 200000000 group by aid order by aid;

For backward scans, just order by aid desc.

The full command used is
sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' ; pg_ctl start -l
${PGLOG} ; sleep 5 ; ( sleep 10 ; echo 'set effective_io_concurrency
to 0;' ; echo 'explain analyze select aid, count(*) from
pgbench_accounts where aid between 10000000 and 200000000 group by aid
order by aid;' ) | psql -h localhost -p 5433 pgbench ; sleep 5 ;
pg_ctl stop
server starting

The server is started and stopped every time to make sure the shared
cache is empty, and sleeps are there to avoid backlash I/O from
dropping caches to influence the benchmark.

Results:

With effective_io_concurrency set to 0 (patch disabled):

Forward:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=47.552..31113.353 rows=90000001 loops=1)
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.00..2795179.71 rows=90257289 width=4) (actual
time=47.542..13197.982 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 33648.500 ms
I/O thoughtput averages 60MB/s (clearly sequential)
I/O utilization averages around 30%

Backward:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=10.279..159704.590 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=10.266..132853.382 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 163202.869 ms
I/O thoughput averages 11MB/s (clearly not fully random, but neither sequential)
I/O utilization averages 68%

With effective_io_concurrency set to 1 (patch enabled):

Forward:

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=47.582..30474.222 rows=90000001 loops=1)
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
(cost=0.00..2795179.71 rows=90257289 width=4) (actual
time=47.571..12208.340 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 32875.695 ms
I/O thoughput still averages 60MB/s (tiny performance imrpovement is
probably just variation, although in the multiple tests I've made
there is a systematic bias towards lower times, probably the effect of
prefetch on the few cases where the index jumps nonsequentially)
I/O utilization remains around 35%

Backward:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=28.190..157708.405 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=28.178..135282.317 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 160735.539 ms
I/O thoughput averages 12MB/s (a small increase), and the 3-second
difference seems related to it (it's consistent).
I/O utilization averages 88% (important increase)

This last result makes me think deeper prefetching could be
potentially beneficial (it would result in read merges), but it's
rather hard to implement without a skiplist of leaf pages. Maybe the
backward-sequential pattern could be detected. I'll have to tinker
with that.

With a hot OS cache, there's no discernible difference in timings.

For heap-including scans forward (using sum(abalance) instead of
count(*)), the times are 418522.583ms with, and 425497.436ms without
(still very small, still consistent improvement).
For heap-including scans backward, the times are 1764391.372ms with
(with an I/O of 8.4MB/s at 95% utilization), and 1794715.164ms wihtout
(7.8MB/s at 92% utilization). So, not whopping, but better.

However small the improvement (5% on backward scans), since I see no
ill effect, I thought I'd post the patch. Besides, I'm convinced
first, with more chaotic indexes, the improvement should be noticeably
bigger, and second, with more complex queries, the parallelization may
be too. Especially with queries involving more than one index scan.
This I believe since, if I attach an strace to the backend process,
the improvement with prefetch jumps to 30% on backward scans, probably
due to the increased computation window between reads.

Attachments:

postgresql-git-bt_prefetch.diffapplication/octet-stream; name=postgresql-git-bt_prefetch.diffDownload

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 016ce22..1115afc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -487,6 +487,19 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno)
+{
+	if (blkno != P_NEW && blkno != P_NONE)
+	{
+		/* Just prefetch an existing block of the relation */
+		PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	}
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.	The page
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e0c9523..22c6d34 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,7 +28,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
@@ -961,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1008,7 +1009,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1021,7 +1022,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1180,7 +1181,7 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1243,8 +1244,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (prefetch && so->currPos.moreRight 
+							&& (opaque->btpo_next != (blkno+1))) 
+					{
+						/* start prefetch on next page, but not
+						   if we're reading sequentially already,
+						   as it's counterproductive in those cases */
+						_bt_prefetchbuf(rel, opaque->btpo_next);
+					}
 					break;
+				}
 			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
@@ -1291,8 +1301,13 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* start prefetch on next page */
+						_bt_prefetchbuf(rel, opaque->btpo_prev);
+					}
 					break;
+				}
 			}
 		}
 	}
@@ -1587,7 +1602,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4941e0..3360b6b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -628,6 +628,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Claudio Freire (#1)

1 attachment(s)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thu, Oct 18, 2012 at 5:30 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Backward:

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=28.190..157708.405 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=28.178..135282.317 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 160735.539 ms
I/O thoughput averages 12MB/s (a small increase), and the 3-second
difference seems related to it (it's consistent).
I/O utilization averages 88% (important increase)

This last result makes me think deeper prefetching could be
potentially beneficial (it would result in read merges), but it's
rather hard to implement without a skiplist of leaf pages. Maybe the
backward-sequential pattern could be detected. I'll have to tinker
with that.

Fun. That didn't take long.

With the attached anti-sequential scan patch, and effective_io_concurrency=8:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=26.964..84299.789 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=26.955..62761.774 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 87170.355 ms
I/O thoughput 22MB/s (twice as fast)
I/O utilization 95% (I was expecting 100% but... hey... good enough)

With e_i_c=24, it gets to 100% utilization and 30MB/s (that's 3 times
faster). So, I'd like to know what you think, but maybe for
back-sequential scans, prefetch should be set to a multiple (ie: x24)
of e_i_c, in order to exploit read request merges.

Attachments:

postgresql-git-bt_prefetch_backseq.diffapplication/octet-stream; name=postgresql-git-bt_prefetch_backseq.diffDownload

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 016ce22..1115afc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -487,6 +487,19 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno)
+{
+	if (blkno != P_NEW && blkno != P_NONE)
+	{
+		/* Just prefetch an existing block of the relation */
+		PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	}
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.	The page
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b055eaf..a2b9d99 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -432,6 +432,9 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.	However, we set up
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e0c9523..f815bab 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,7 +28,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
@@ -961,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1008,7 +1009,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1021,7 +1022,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1075,9 +1076,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1180,7 +1183,7 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1243,8 +1246,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (prefetch && so->currPos.moreRight 
+							&& (opaque->btpo_next != (blkno+1))) 
+					{
+						/* start prefetch on next page, but not
+						   if we're reading sequentially already,
+						   as it's counterproductive in those cases */
+						_bt_prefetchbuf(rel, opaque->btpo_next);
+					}
 					break;
+				}
 			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
@@ -1288,11 +1300,49 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved nextPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase
+						   prefetch window blindly downwards 2 blocks 
+						   at a time */
+						if (blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= target_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos--);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= target_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= target_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos--);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev);
+						}
+					}
 					break;
+				}
 			}
 		}
 	}
@@ -1587,7 +1637,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4941e0..9087d2d 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -496,6 +496,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -576,6 +577,9 @@ typedef struct BTScanOpaqueData
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
 	/* keep these last in struct for efficiency */
+	int		   backSeqRun;		/* number of back-sequential pages in a run */
+	int		   backSeqPos;		/* blkid last prefetched in back-sequential 
+			              		   runs */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
@@ -628,6 +632,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Claudio Freire (#2)

1 attachment(s)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thu, Oct 18, 2012 at 7:42 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Fun. That didn't take long.

With the attached anti-sequential scan patch, and effective_io_concurrency=8:

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual
time=26.964..84299.789 rows=90000001 loops=1)
-> Index Only Scan Backward using pgbench_accounts_pkey on
pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4)
(actual time=26.955..62761.774 rows=90000001 loops=1)
Index Cond: ((aid >= 10000000) AND (aid <= 200000000))
Heap Fetches: 0
Total runtime: 87170.355 ms
I/O thoughput 22MB/s (twice as fast)
I/O utilization 95% (I was expecting 100% but... hey... good enough)

With e_i_c=24, it gets to 100% utilization and 30MB/s (that's 3 times
faster). So, I'd like to know what you think, but maybe for
back-sequential scans, prefetch should be set to a multiple (ie: x24)
of e_i_c, in order to exploit read request merges.

Earlier patch had a regression for heap-including scans backwards with
RAID, so I made the back-sequential optimization index-only-only and
now I can find no regression. Make check runs fine, btw.

I hope I'm not talking to myself.

Attachments:

postgresql-git-bt_prefetch_backseq.diffapplication/octet-stream; name=postgresql-git-bt_prefetch_backseq.diffDownload

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 016ce22..1115afc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -487,6 +487,19 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno)
+{
+	if (blkno != P_NEW && blkno != P_NONE)
+	{
+		/* Just prefetch an existing block of the relation */
+		PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	}
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.	The page
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b055eaf..a2b9d99 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -432,6 +432,9 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.	However, we set up
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e0c9523..a7a9e2e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,7 +28,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
@@ -961,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -1008,7 +1009,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1021,7 +1022,7 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
@@ -1075,9 +1076,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1180,7 +1183,7 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1243,8 +1246,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (prefetch && so->currPos.moreRight 
+							&& (opaque->btpo_next != (blkno+1))) 
+					{
+						/* start prefetch on next page, but not
+						   if we're reading sequentially already,
+						   as it's counterproductive in those cases */
+						_bt_prefetchbuf(rel, opaque->btpo_next);
+					}
 					break;
+				}
 			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
@@ -1288,11 +1300,50 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase
+						   prefetch window blindly downwards 2 blocks 
+						   at a time. This only works in our favor
+						   for index-only scans */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= target_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos--);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= target_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= target_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos--);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev);
+						}
+					}
 					break;
+				}
 			}
 		}
 	}
@@ -1587,7 +1638,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4941e0..71f1565 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -496,6 +496,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -576,6 +577,9 @@ typedef struct BTScanOpaqueData
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
 	/* keep these last in struct for efficiency */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
 } BTScanOpaqueData;
@@ -628,6 +632,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);

John Lumby

johnlumby@hotmail.com

about 13 years ago

In reply to: Claudio Freire (#3)

Re: [PATCH] Prefetch index pages for B-Tree index scans

From: Claudio Freire <klaussfreire@gmail.com>
I hope I'm not talking to myself.

Indeed not. I also looked into prefetching for pure index scans for b-trees (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

I am not where I have a proper setup this week but will reply at greater length next week.

John

Import Notes

Reply to msg id not found: E1TP0L2-0004VJ-1n@malur.postgresql.orgReference msg id not found: E1TP0L2-0004VJ-1n@malur.postgresql.org | Resolved by subject fallback

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: John Lumby (#4)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Tue, Oct 23, 2012 at 9:44 AM, John Lumby <johnlumby@hotmail.com> wrote:

From: Claudio Freire <klaussfreire@gmail.com>
I hope I'm not talking to myself.

Indeed not. I also looked into prefetching for pure index scans for
b-trees (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

Yes, I've seen that, though I thought it was only an improvement on
PrefetchBuffer. That patch would interact quite nicely with mine.

I'm now trying to prefetch heap tuples, and I got to a really nice
place where I get an extra 30% speedup even on forward scans, but the
patch is rather green now for a review.

I am not where I have a proper setup this week but will reply at greater
length next week.

Great - will go on improving the patch in the meanwhile ;-)

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Claudio Freire (#5)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Tue, Oct 23, 2012 at 10:54 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

Indeed not. I also looked into prefetching for pure index scans for
b-trees (and extension to use async io).
http://archives.postgresql.org/message-id/BLU0-SMTP31709961D846CCF4F5EB4C2A3930%40phx.gbl

Yes, I've seen that, though I thought it was only an improvement on
PrefetchBuffer. That patch would interact quite nicely with mine.

I'm now trying to prefetch heap tuples, and I got to a really nice
place where I get an extra 30% speedup even on forward scans, but the
patch is rather green now for a review.

I am not where I have a proper setup this week but will reply at greater
length next week.

Great - will go on improving the patch in the meanwhile ;-)

Ok, this is the best I could come up with, without some real test hardware.

The only improvement I see in single-disk scenarios:
* Huge speedup of back-sequential index-only scans
* Marginal speedup on forward index-only scans (5% or less)
* No discernible difference in heap-including scans (even with heap
prefetch), but I'm pretty sure a real RAID setup would change this
* No change in pgbench (so I guess no regression for small transactions)

If I manage to get my hands on test hardware, I'll post results. But
we just had to bring some machines offline in our testing datacenter,
which effectively shrank my options, rather than expanding them. I
don't see that improving soon, so I'll post the patch and hope someone
else tests.

PS: should I add it to the commit fest? should we compare notes with
John Limby's patch first?

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Claudio Freire (#6)

1 attachment(s)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Mon, Oct 29, 2012 at 12:53 PM, Claudio Freire <klaussfreire@gmail.com> wrote:

Yes, I've seen that, though I thought it was only an improvement on
PrefetchBuffer. That patch would interact quite nicely with mine.

I'm now trying to prefetch heap tuples, and I got to a really nice
place where I get an extra 30% speedup even on forward scans, but the
patch is rather green now for a review.

I am not where I have a proper setup this week but will reply at greater
length next week.

Great - will go on improving the patch in the meanwhile ;-)

Ok, this is the best I could come up with, without some real test hardware.

Oops - forgot to effectively attach the patch.

Attachments:

postgresql-git-bt_fullfetch.diffapplication/octet-stream; name=postgresql-git-bt_fullfetch.diffDownload

diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index d64df31..d3f11f5 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -300,7 +300,10 @@ index_beginscan_internal(Relation indexRelation,
 									  PointerGetDatum(indexRelation),
 									  Int32GetDatum(nkeys),
 									  Int32GetDatum(norderbys)));
-
+	
+	scan->heap_tids_seen = 0;
+	scan->heap_tids_fetched = 0;
+	
 	return scan;
 }
 
@@ -469,6 +472,11 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
 	}
 
 	pgstat_count_index_tuples(scan->indexRelation, 1);
+	if (scan->heap_tids_seen++ >= (~0)) {
+		/* Avoid integer overflow */
+		scan->heap_tids_seen = 1;
+		scan->heap_tids_fetched = 0;
+	}
 
 	/* Return the TID of the tuple we found. */
 	return &scan->xs_ctup.t_self;
@@ -508,6 +516,7 @@ index_fetch_heap(IndexScanDesc scan)
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
+		scan->heap_tids_fetched++;
 
 		/*
 		 * Prune page, but only if we weren't already on this page
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 016ce22..1115afc 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -487,6 +487,19 @@ _bt_log_reuse_page(Relation rel, BlockNumber blkno, TransactionId latestRemovedX
 }
 
 /*
+ *	_bt_prefetchbuf() -- Prefetch a buffer by block number
+ */
+void
+_bt_prefetchbuf(Relation rel, BlockNumber blkno)
+{
+	if (blkno != P_NEW && blkno != P_NONE)
+	{
+		/* Just prefetch an existing block of the relation */
+		PrefetchBuffer(rel, MAIN_FORKNUM, blkno);
+	}
+}
+
+/*
  *	_bt_getbuf() -- Get a buffer by block number for read or write.
  *
  *		blkno == P_NEW means to get an unallocated index page.	The page
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b055eaf..bb60588 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -432,6 +432,12 @@ btbeginscan(PG_FUNCTION_ARGS)
 	so->killedItems = NULL;		/* until needed */
 	so->numKilled = 0;
 
+	so->backSeqRun = 0;
+	so->backSeqPos = 0;
+	so->prefetchItemIndex = 0;
+	so->lastHeapPrefetchBlkno = P_NONE;
+	so->prefetchBlockCount = 0;
+	
 	/*
 	 * We don't know yet whether the scan will be index-only, so we do not
 	 * allocate the tuple workspace arrays until btrescan.	However, we set up
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index e0c9523..21c492e 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -28,7 +28,8 @@ static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 			 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 			 OffsetNumber offnum, IndexTuple itup);
-static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
+static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir, 
+			 bool prefetch);
 static Buffer _bt_walk_left(Relation rel, Buffer buf);
 static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 
@@ -961,7 +962,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
@@ -996,6 +997,8 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	BTScanPosItem *currItem;
+	BlockNumber prevblkno = ItemPointerGetBlockNumber(
+		&scan->xs_ctup.t_self);
 
 	/*
 	 * Advance to next tuple on current page; or if there's no more, try to
@@ -1008,11 +1011,51 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (target_prefetch_pages > 0) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreRight
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+				
+					if (so->prefetchItemIndex <= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex + 1;
+					while (    (so->prefetchItemIndex <= so->currPos.lastItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex++].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {
+							/* start prefetch on next page, but not
+							   if we're reading sequentially already,
+							   as it's counterproductive in those cases */
+							if (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1)) {
+								_bt_prefetchbuf(scan->heapRelation, blkno );
+								nonsequential = true;
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 	else
 	{
@@ -1021,11 +1064,51 @@ _bt_next(IndexScanDesc scan, ScanDirection dir)
 			/* We must acquire lock before applying _bt_steppage */
 			Assert(BufferIsValid(so->currPos.buf));
 			LockBuffer(so->currPos.buf, BT_READ);
-			if (!_bt_steppage(scan, dir))
+			if (!_bt_steppage(scan, dir, target_prefetch_pages > 0))
 				return false;
 			/* Drop the lock, but not pin, on the new page */
 			LockBuffer(so->currPos.buf, BUFFER_LOCK_UNLOCK);
 		}
+		
+		if (target_prefetch_pages > 0) {
+			BlockNumber currblkno = ItemPointerGetBlockNumber(
+				&so->currPos.items[so->currPos.itemIndex].heapTid);
+		
+			if (currblkno != prevblkno) {
+				if (so->prefetchBlockCount > 0)
+					so->prefetchBlockCount--;
+			
+				/* If we have heap fetch frequency stats, and it's above ~94%,
+				 * initiate heap prefetches */
+				if (so->currPos.moreLeft
+					&& scan->heap_tids_seen > 256 
+					&& ( (scan->heap_tids_seen - scan->heap_tids_seen/16)
+						 <= scan->heap_tids_fetched ) )
+				{
+					bool nonsequential = false;
+			
+					if (so->prefetchItemIndex >= so->currPos.itemIndex)
+						so->prefetchItemIndex = so->currPos.itemIndex - 1;
+					while (    (so->prefetchItemIndex >= so->currPos.firstItem)
+							&& (so->prefetchBlockCount < target_prefetch_pages) )
+					{
+						ItemPointer tid = &so->currPos.items[so->prefetchItemIndex--].heapTid;
+						BlockNumber blkno = ItemPointerGetBlockNumber(tid);
+						if (blkno != so->lastHeapPrefetchBlkno) {
+							/* start prefetch on next page, but not
+							   if we're reading sequentially already,
+							   as it's counterproductive in those cases */
+							if (nonsequential || blkno != (so->lastHeapPrefetchBlkno+1)) {
+								_bt_prefetchbuf(scan->heapRelation, blkno );
+								nonsequential = true;
+							}
+							so->lastHeapPrefetchBlkno = blkno;
+							so->prefetchBlockCount++;
+						}
+					}
+				}
+			}
+		}
 	}
 
 	/* OK, itemIndex says what to return */
@@ -1075,9 +1158,11 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	/*
 	 * we must save the page's right-link while scanning it; this tells us
 	 * where to step right to after we're done with these items.  There is no
-	 * corresponding need for the left-link, since splits always go right.
+	 * corresponding need for the left-link, since splits always go right,
+	 * but we need it for back-sequential scan detection.
 	 */
 	so->currPos.nextPage = opaque->btpo_next;
+	so->currPos.prevPage = opaque->btpo_prev;
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
@@ -1112,6 +1197,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
+		so->prefetchItemIndex = 0;
 	}
 	else
 	{
@@ -1143,6 +1229,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		so->currPos.firstItem = itemIndex;
 		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
 		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->prefetchItemIndex = MaxIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1180,7 +1267,7 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
  * locks and pins, set so->currPos.buf to InvalidBuffer, and return FALSE.
  */
 static bool
-_bt_steppage(IndexScanDesc scan, ScanDirection dir)
+_bt_steppage(IndexScanDesc scan, ScanDirection dir, bool prefetch)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	Relation	rel;
@@ -1243,8 +1330,17 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				PredicateLockPage(rel, blkno, scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreRight if we can stop */
-				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque)))
+				if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) {
+					if (prefetch && so->currPos.moreRight 
+							&& (opaque->btpo_next != (blkno+1))) 
+					{
+						/* start prefetch on next page, but not
+						   if we're reading sequentially already,
+						   as it's counterproductive in those cases */
+						_bt_prefetchbuf(rel, opaque->btpo_next);
+					}
 					break;
+				}
 			}
 			/* nope, keep going */
 			blkno = opaque->btpo_next;
@@ -1288,11 +1384,55 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 			if (!P_IGNORE(opaque))
 			{
+				/* We must rely on the previously saved prevPage link! */
+				BlockNumber blkno = so->currPos.prevPage;
+				
 				PredicateLockPage(rel, BufferGetBlockNumber(so->currPos.buf), scan->xs_snapshot);
 				/* see if there are any matches on this page */
 				/* note that this will clear moreLeft if we can stop */
-				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page)))
+				if (_bt_readpage(scan, dir, PageGetMaxOffsetNumber(page))) {
+					if (prefetch && so->currPos.moreLeft) {
+						/* detect back-sequential runs and increase prefetch window blindly 
+						 * downwards 2 blocks at a time. This only works in our favor
+						 * for index-only scans, by merging read requests at the kernel,
+						 * so we want to inflate target_prefetch_pages since merged 
+						 * back-sequential requests are about as expensive as a single one 
+						 */
+						if (scan->xs_want_itup && blkno > 0 && opaque->btpo_prev == (blkno-1)) {
+							BlockNumber backPos;
+							unsigned int back_prefetch_pages = target_prefetch_pages * 16;
+							if (back_prefetch_pages > 64)
+								back_prefetch_pages = 64;
+							
+							if (so->backSeqRun == 0)
+								backPos = (blkno-1);
+							else
+								backPos = so->backSeqPos;
+							so->backSeqRun++;
+							
+							if (backPos > 0 && (blkno - backPos) <= back_prefetch_pages) {
+								_bt_prefetchbuf(rel, backPos--);
+								/* don't start back-seq prefetch too early */
+								if (so->backSeqRun >= back_prefetch_pages
+										&& backPos > 0 
+										&& (blkno - backPos) <= back_prefetch_pages)
+								{
+									_bt_prefetchbuf(rel, backPos--);
+								}
+							}
+							
+							so->backSeqPos = backPos;
+						} else {
+							/* start prefetch on next page */
+							if (so->backSeqRun != 0) {
+								if (opaque->btpo_prev > blkno || opaque->btpo_prev < so->backSeqPos)
+									so->backSeqRun = 0;
+							}
+							_bt_prefetchbuf(rel, opaque->btpo_prev);
+						}
+					}
 					break;
+				}
 			}
 		}
 	}
@@ -1587,7 +1727,7 @@ _bt_endpoint(IndexScanDesc scan, ScanDirection dir)
 		 * There's no actually-matching data on this page.  Try to advance to
 		 * the next page.  Return false if there's no matching data at all.
 		 */
-		if (!_bt_steppage(scan, dir))
+		if (!_bt_steppage(scan, dir, false))
 			return false;
 	}
 
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d4941e0..b2a4eea 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -496,6 +496,7 @@ typedef struct BTScanPosData
 	Buffer		buf;			/* if valid, the buffer is pinned */
 
 	BlockNumber nextPage;		/* page's right link when we scanned it */
+	BlockNumber prevPage;		/* page's left link when we scanned it */
 
 	/*
 	 * moreLeft and moreRight track whether we think there may be matching
@@ -575,6 +576,15 @@ typedef struct BTScanOpaqueData
 	 */
 	int			markItemIndex;	/* itemIndex, or -1 if not valid */
 
+	/* prefetch logic state */
+	unsigned int	backSeqRun;	/* number of back-sequential pages in a run */
+	BlockNumber		backSeqPos;	/* blkid last prefetched in back-sequential 
+				          		   runs */
+	BlockNumber		lastHeapPrefetchBlkno;	/* blkid last prefetched from heap */
+	int				prefetchItemIndex; /* item index within currPos last
+					                      fetched by heap prefetch */
+	int				prefetchBlockCount; /* number of prefetched heap blocks */
+	
 	/* keep these last in struct for efficiency */
 	BTScanPosData currPos;		/* current position data */
 	BTScanPosData markPos;		/* marked position, if any */
@@ -628,6 +638,7 @@ extern Buffer _bt_getroot(Relation rel, int access);
 extern Buffer _bt_gettrueroot(Relation rel);
 extern void _bt_checkpage(Relation rel, Buffer buf);
 extern Buffer _bt_getbuf(Relation rel, BlockNumber blkno, int access);
+extern void _bt_prefetchbuf(Relation rel, BlockNumber blkno);
 extern Buffer _bt_relandgetbuf(Relation rel, Buffer obuf,
 				 BlockNumber blkno, int access);
 extern void _bt_relbuf(Relation rel, Buffer buf);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index 87acc8e..3343fda 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -90,6 +90,10 @@ typedef struct IndexScanDescData
 	/* NB: if xs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	bool		xs_recheck;		/* T means scan keys must be rechecked */
 
+	/* heap fetch statistics for read-ahead logic */
+	unsigned int heap_tids_seen;
+	unsigned int heap_tids_fetched;
+
 	/* state data for traversing HOT chains in index_getnext */
 	bool		xs_continue_hot;	/* T if must keep walking HOT chain */
 }	IndexScanDescData;

Cédric Villemain

cedric@2ndquadrant.com

about 13 years ago

In reply to: Claudio Freire (#6)

Re: [PATCH] Prefetch index pages for B-Tree index scans

Ok, this is the best I could come up with, without some real test hardware.

The only improvement I see in single-disk scenarios:
* Huge speedup of back-sequential index-only scans
* Marginal speedup on forward index-only scans (5% or less)
* No discernible difference in heap-including scans (even with heap
prefetch), but I'm pretty sure a real RAID setup would change this
* No change in pgbench (so I guess no regression for small transactions)

If the gain is visible mostly for the backward and not for other access
patterns I suggest to check the work done in backward-prefecthing in linux.

http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

I don't know how others (BSD, windows, ...) handle this case.

Maybe the strategy to use our own prefetch is better, then I would like to use
it also in places where we used to hack to make linux understand that we will
benefits from prefetching.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Cédric Villemain (#8)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Mon, Oct 29, 2012 at 4:17 PM, Cédric Villemain
<cedric@2ndquadrant.com> wrote:

Ok, this is the best I could come up with, without some real test hardware.

The only improvement I see in single-disk scenarios:
* Huge speedup of back-sequential index-only scans
* Marginal speedup on forward index-only scans (5% or less)
* No discernible difference in heap-including scans (even with heap
prefetch), but I'm pretty sure a real RAID setup would change this
* No change in pgbench (so I guess no regression for small transactions)

If the gain is visible mostly for the backward and not for other access
patterns I suggest to check the work done in backward-prefecthing in linux.

http://thread.gmane.org/gmane.linux.kernel.mm/73837 for example

That patch seems very similar (but in kernel terms) to this patch, so
I imagine it would also do the job.

But it also looks forgotten. Bringing it back to life would mean
building the latest kernel with that patch included, replicating the
benchmarks I ran here, sans pg patch, but with patched kernel, and
reporting the (hopefully equally dramatic) performance improvements in
the kernel ML. That would take me quite some time (not used to playing
with kernels, though it wouldn't be my first time either), though it
might be worth the effort.

I don't know how others (BSD, windows, ...) handle this case.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

BSD probably supports it the same way linux does.

Maybe the strategy to use our own prefetch is better, then I would like to use
it also in places where we used to hack to make linux understand that we will
benefits from prefetching.

It would at least benefit those installations without the
latest-in-the-future kernel-with-backwards-readahead.

To which places are you referring to, btw?

#10

Cédric Villemain

cedric@2ndquadrant.com

about 13 years ago

In reply to: Claudio Freire (#9)

Re: [PATCH] Prefetch index pages for B-Tree index scans

But it also looks forgotten. Bringing it back to life would mean
building the latest kernel with that patch included, replicating the
benchmarks I ran here, sans pg patch, but with patched kernel, and
reporting the (hopefully equally dramatic) performance improvements in
the kernel ML. That would take me quite some time (not used to playing
with kernels, though it wouldn't be my first time either), though it
might be worth the effort.

Well, informing linux hackers may help.

I don't know how others (BSD, windows, ...) handle this case.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

BSD probably supports it the same way linux does.

I though of the opposite way: how do other kernels handle the backwards
prefetch.

Maybe the strategy to use our own prefetch is better, then I would like
to use it also in places where we used to hack to make linux understand
that we will benefits from prefetching.

It would at least benefit those installations without the
latest-in-the-future kernel-with-backwards-readahead.

We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old
kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Btw the improvements you are doing looks good, I just add some information
regarding what is achieved around us.

To which places are you referring to, btw?

Maintenance tasks.

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

#11

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Cédric Villemain (#10)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
<cedric@2ndquadrant.com> wrote:

But it also looks forgotten. Bringing it back to life would mean
building the latest kernel with that patch included, replicating the
benchmarks I ran here, sans pg patch, but with patched kernel, and
reporting the (hopefully equally dramatic) performance improvements in
the kernel ML. That would take me quite some time (not used to playing
with kernels, though it wouldn't be my first time either), though it
might be worth the effort.

Well, informing linux hackers may help.

I agree. I'm a bit hesitant to subscribe to yet another mailing list,
but I happen to agree.

I don't know how others (BSD, windows, ...) handle this case.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

BSD probably supports it the same way linux does.

I though of the opposite way: how do other kernels handle the backwards
prefetch.

From what I saw (while reasearching that statement above), BSD's
read-ahead and fadvise implementations are way behind linux's.
Functional, though. I haven't been able to find the code responsible
for readahead in FreeBSD yet to confirm whether they have anything
supporting back-sequential patterns.

Maybe the strategy to use our own prefetch is better, then I would like
to use it also in places where we used to hack to make linux understand
that we will benefits from prefetching.

It would at least benefit those installations without the
latest-in-the-future kernel-with-backwards-readahead.

We're speaking of PostgreSQL 9.3, running cutting edge PostgreSQL and old
kernel in end 2013... Maybe it won't be so latest-in-the-future at this time.

Good point.

#12

John Lumby

johnlumby@hotmail.com

about 13 years ago

In reply to: Claudio Freire (#7)

1 attachment(s)

Re: [PATCH] Prefetch index pages for B-Tree index scans

Claudio wrote :

Oops - forgot to effectively attach the patch.

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting. You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why -- I think they should work very well together - as both have
been demonstrated to produce improvements. I will see if I can combine them,
git permitting (as of course their changed file lists overlap).

I was surprised by this design decision :
/* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those cases */
Is it really? Are you assuming the it's redundant with posix_fadvise for this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

Cédric wrote:

If the gain is visible mostly for the backward and not for other access

building the latest kernel with that patch included, replicating the

I found improvement from forward scans.
Actually I did not even try backward but only because I did not have time.
It should help both.

I don't even think windows supports posix_fadvise, but if async_io is
used (as hinted by the link Lumby posted), it would probably also work
in windows.

windows has async io and I think it would not be hard to extend my implementation
to windows (although I don't plan it myself).     Actually about 95% of the code I wrote
to implement async-io in postgresql concerns not the async io, which is trivial,
but the buffer management.   With async io,   PrefetchBuffer must allocate and pin a
buffer, (not too hard), but now also every other part of buf mgr must know about the
possibility that a buffer may be in async_io_in_progress state and be prepared to
determine the possible completion (quite hard)   - and also if and when the prefetch requester
comes again with ReadBuffer, buf mgr has to remember that this buffer was pinned by
this backend during previous prefetch and must not be re-pinned a second time
(very hard without increasing size of the shared descriptor, which was important
since there could be a very large number of these).
It ended up with a major change to bufmgr.c plus one new file for handling
buffer management aspects of starting, checking and terminating async io.

However I think in some environments the async-io has significant benefits over
posix-fadvise, especially (of course!) where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've attached a snapshot patch of just the non-bitmap-index-scan changes I've made.
You can't compile it as is because I had to change the interface to PrefetchBuffer
and add a new DiscardBuffer which I did not include in this snapshot to avoid confusing.

John

Attachments:

postgresql-9.3.121031.indexdatapagepfch.121101-120004.patchtext/x-diffDownload

--- src/backend/executor/nodeIndexscan.c.orig	2012-10-31 15:24:12.083163547 -0400
+++ src/backend/executor/nodeIndexscan.c	2012-11-01 11:45:16.244967963 -0400
@@ -35,8 +35,13 @@
 #include "utils/rel.h"
 
 
+
 static TupleTableSlot *IndexNext(IndexScanState *node);
 
+#ifdef USE_PREFETCH
+extern unsigned int prefetch_dbOid; /*  database oid of relations on which prefetching to be done - 0 means all */
+extern unsigned int prefetch_index_scans; /*  boolean whether to prefetch bitmap heap scans         */
+#endif   /* USE_PREFETCH */
 
 /* ----------------------------------------------------------------
  *		IndexNext
@@ -418,7 +423,17 @@ ExecEndIndexScan(IndexScanState *node)
 	 * close the index relation (no-op if we didn't open it)
 	 */
 	if (indexScanDesc)
+        {
 		index_endscan(indexScanDesc);
+
+#ifdef USE_PREFETCH
+                if (   indexScanDesc->do_prefetch
+                    && ( (struct BlockIdData*)0 != indexScanDesc->pfch_list )
+                   ) {
+                     pfree(indexScanDesc->pfch_list);
+                }
+#endif   /* USE_PREFETCH */
+        }
 	if (indexRelationDesc)
 		index_close(indexRelationDesc, NoLock);
 
@@ -609,6 +624,25 @@ ExecInitIndexScan(IndexScan *node, EStat
 											   indexstate->iss_NumScanKeys,
 											 indexstate->iss_NumOrderByKeys);
 
+#ifdef USE_PREFETCH
+        /*  initialize prefetching   */
+		if (    prefetch_index_scans
+                     &&	(!RelationUsesLocalBuffers(indexstate->iss_ScanDesc->heapRelation)) /* I think this must always be true for an indexed heap ? */
+                     && (    (   (prefetch_dbOid > 0)
+                               && (prefetch_dbOid == indexstate->iss_ScanDesc->heapRelation->rd_node.dbNode)
+                             )
+                         ||  (prefetch_dbOid == 0)
+                        )
+                   ) {
+			indexstate->iss_ScanDesc->pfch_list = palloc( target_prefetch_pages * sizeof(struct BlockIdData) );
+			if ( (struct BlockIdData*)0 != indexstate->iss_ScanDesc->pfch_list ) {
+                          indexstate->iss_ScanDesc->pfch_used = 0;
+                          indexstate->iss_ScanDesc->pfch_next = target_prefetch_pages; /* ensure first entry is at index 0 */
+                          indexstate->iss_ScanDesc->do_prefetch = 1;
+                        }
+		}
+#endif   /* USE_PREFETCH */
+
 	/*
 	 * If no run-time keys to calculate, go ahead and pass the scankeys to the
 	 * index AM.
--- src/backend/executor/instrument.c.orig	2012-10-31 15:24:12.083163547 -0400
+++ src/backend/executor/instrument.c	2012-11-01 11:36:52.855258120 -0400
@@ -41,6 +41,12 @@ InstrAlloc(int n, int instrument_options
 		{
 			instr[i].need_bufusage = need_buffers;
 			instr[i].need_timer = need_timer;
+			instr[i].bufusage_start.aio_read_noneed = 0;
+			instr[i].bufusage_start.aio_read_noblok = 0;
+			instr[i].bufusage_start.aio_read_failed = 0;
+			instr[i].bufusage_start.aio_read_wasted = 0;
+			instr[i].bufusage_start.aio_read_waited = 0;
+			instr[i].bufusage_start.aio_read_ontime = 0;
 		}
 	}
 
@@ -145,6 +151,14 @@ BufferUsageAccumDiff(BufferUsage *dst,
 	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
 	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
 	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+
+	dst->aio_read_noneed       += add->aio_read_noneed - sub->aio_read_noneed;
+	dst->aio_read_noblok       += add->aio_read_noblok - sub->aio_read_noblok;
+	dst->aio_read_failed       += add->aio_read_failed - sub->aio_read_failed;
+	dst->aio_read_wasted       += add->aio_read_wasted - sub->aio_read_wasted;
+	dst->aio_read_waited       += add->aio_read_waited - sub->aio_read_waited;
+	dst->aio_read_ontime       += add->aio_read_ontime - sub->aio_read_ontime;
+
 	INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
 						  add->blk_read_time, sub->blk_read_time);
 	INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
--- src/backend/access/nbtree/nbtree.c.orig	2012-11-01 08:58:22.972791424 -0400
+++ src/backend/access/nbtree/nbtree.c	2012-11-01 11:36:52.827258026 -0400
@@ -30,6 +30,8 @@
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
 
+Datum
+btpeeknexttuple(IndexScanDesc scan);
 
 /* Working state for btbuild and its callback */
 typedef struct
@@ -338,6 +340,66 @@ btgettuple(PG_FUNCTION_ARGS)
 	PG_RETURN_BOOL(res);
 }
 
+/*
+ *	btpeeknexttuple() -- peek at the next tuple different from any blocknum in pfch_list
+ *                           without reading a new index page
+ *                       and without causing any side-effects such as altering values in control blocks
+ *               if found,     store blocknum in next element of pfch_list
+ */
+Datum
+btpeeknexttuple(IndexScanDesc scan)
+{
+	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	bool		res = false;
+	int		itemIndex;		/* current index in items[] */
+
+        /*
+         * If we've already initialized this scan, we can just advance it in
+         * the appropriate direction.  If we haven't done so yet, bail out
+         */
+        if ( BTScanPosIsValid(so->currPos) ) {
+
+            itemIndex = so->currPos.itemIndex+1;    /*   next item */
+
+            /* This loop handles advancing till we find different data block or end of index page */
+            while (itemIndex <= so->currPos.lastItem) {
+                unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                        if (BlockIdEquals(((scan->pfch_list)+pfchx) , &(so->currPos.items[itemIndex].heapTid.ip_blkid))) {
+                             goto block_match;
+                        }
+                }
+
+                /*  if we reach here,  no block in list matched this item  */
+                res = true;
+                /*   set item in prefetch list
+                **   prefer unused entry if there is one,  else overwrite
+                */
+                if (scan->pfch_used < target_prefetch_pages) {
+                    scan->pfch_next = scan->pfch_used;
+                } else {
+                    scan->pfch_next++;
+                    if (scan->pfch_next >= target_prefetch_pages) {
+                        scan->pfch_next = 0;
+                    }
+                }
+
+                BlockIdCopy((scan->pfch_list + scan->pfch_next) , &(so->currPos.items[itemIndex].heapTid.ip_blkid));
+                /* elog(LOG,"btpeeknexttuple added blocknum %u" ,BlockIdGetBlockNumber(&(so->currPos.items[itemIndex].heapTid.ip_blkid))); */
+                if (scan->pfch_used <= scan->pfch_next) {
+                     scan->pfch_used = (scan->pfch_next + 1);
+                }
+
+                goto peek_complete;
+
+      block_match:         itemIndex++;
+            }
+	}
+
+ peek_complete:
+	PG_RETURN_BOOL(res);
+}
+
 /*
  * btgetbitmap() -- gets all matching tuples, and adds them to a bitmap
  */
--- src/backend/access/heap/syncscan.c.orig	2012-10-31 15:24:12.047163492 -0400
+++ src/backend/access/heap/syncscan.c	2012-11-01 11:36:52.799257932 -0400
@@ -90,6 +90,7 @@ typedef struct ss_scan_location_t
 {
 	RelFileNode relfilenode;	/* identity of a relation */
 	BlockNumber location;		/* last-reported location in the relation */
+	BlockNumber prefetchHWM;	/* high-water-mark of prefetched Blocknum */
 } ss_scan_location_t;
 
 typedef struct ss_lru_item_t
@@ -113,7 +114,7 @@ static ss_scan_locations_t *scan_locatio
 
 /* prototypes for internal functions */
 static BlockNumber ss_search(RelFileNode relfilenode,
-		  BlockNumber location, bool set);
+		  BlockNumber location, bool set , BlockNumber *prefetchHWMp);
 
 
 /*
@@ -160,6 +161,7 @@ SyncScanShmemInit(void)
 			item->location.relfilenode.dbNode = InvalidOid;
 			item->location.relfilenode.relNode = InvalidOid;
 			item->location.location = InvalidBlockNumber;
+			item->location.prefetchHWM = InvalidBlockNumber;
 
 			item->prev = (i > 0) ?
 				(&scan_locations->items[i - 1]) : NULL;
@@ -185,7 +187,7 @@ SyncScanShmemInit(void)
  * data structure.
  */
 static BlockNumber
-ss_search(RelFileNode relfilenode, BlockNumber location, bool set)
+ss_search(RelFileNode relfilenode, BlockNumber location, bool set , BlockNumber *prefetchHWMp)
 {
 	ss_lru_item_t *item;
 
@@ -206,6 +208,22 @@ ss_search(RelFileNode relfilenode, Block
 			{
 				item->location.relfilenode = relfilenode;
 				item->location.location = location;
+                                /*  if prefetch information requested,
+                                **  then reconcile and either update or report back the new HWM.
+                                */
+                                if (prefetchHWMp)
+                                {
+                                    if (   (item->location.prefetchHWM == InvalidBlockNumber)
+                                        || (item->location.prefetchHWM < *prefetchHWMp)
+                                       )
+                                    {
+                                      item->location.prefetchHWM = *prefetchHWMp;
+                                    }
+                                    else
+                                    {
+                                      *prefetchHWMp = item->location.prefetchHWM;
+                                    }
+                                }
 			}
 			else if (set)
 				item->location.location = location;
@@ -252,7 +270,7 @@ ss_get_location(Relation rel, BlockNumbe
 	BlockNumber startloc;
 
 	LWLockAcquire(SyncScanLock, LW_EXCLUSIVE);
-	startloc = ss_search(rel->rd_node, 0, false);
+	startloc = ss_search(rel->rd_node, 0, false , 0);
 	LWLockRelease(SyncScanLock);
 
 	/*
@@ -282,7 +300,7 @@ ss_get_location(Relation rel, BlockNumbe
  * same relfilenode.
  */
 void
-ss_report_location(Relation rel, BlockNumber location)
+ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp)
 {
 #ifdef TRACE_SYNCSCAN
 	if (trace_syncscan)
@@ -306,7 +324,7 @@ ss_report_location(Relation rel, BlockNu
 	{
 		if (LWLockConditionalAcquire(SyncScanLock, LW_EXCLUSIVE))
 		{
-			(void) ss_search(rel->rd_node, location, true);
+			(void) ss_search(rel->rd_node, location, true , prefetchHWMp);
 			LWLockRelease(SyncScanLock);
 		}
 #ifdef TRACE_SYNCSCAN
--- src/backend/access/index/indexam.c.orig	2012-11-01 08:58:22.928791446 -0400
+++ src/backend/access/index/indexam.c	2012-11-01 11:41:33.380206093 -0400
@@ -76,6 +76,45 @@
 #include "utils/tqual.h"
 
 
+#ifdef USE_PREFETCH
+bool BlocknotinBuffer(Buffer buffer, Relation relation, BlockNumber blockNum);
+BlockNumber BlocknumOfBuffer(Buffer buffer);
+void index_evict_block(IndexScanDesc scan , BlockNumber blocknumber);
+
+/*  if specified block number is present in the prefetch array,  then evict it */
+void index_evict_block(IndexScanDesc scan , BlockNumber blocknumber)
+{
+        unsigned short int pfchx , pfchy , pfchz; /*  indexes in BlockIdData array   */
+
+        if (    scan->do_prefetch
+             && ((struct BlockIdData*)0 != scan->pfch_list)
+             /* no need to check for scan->pfch_next < target_prefetch_pages
+             ** since we will do nothing if scan->pfch_used == 0
+             */
+           ) {
+            /*  search the prefetch list to find if the block is a member */
+            for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                if (BlockIdGetBlockNumber((scan->pfch_list)+pfchx) == blocknumber) {
+                      /*  shuffle all following the evictee to the left
+                      **  and update next pointer if its element moves
+                      */
+                      pfchy = (scan->pfch_used - 1); /*  current rightmost */
+                      scan->pfch_used = pfchy;
+
+                      while (pfchy > pfchx) {
+                          pfchz = pfchx + 1;
+                          BlockIdCopy((scan->pfch_list)+pfchx, (scan->pfch_list)+pfchz);
+                          if (scan->pfch_next == pfchz) {
+                              scan->pfch_next = pfchx;
+                          }
+                          pfchx++;
+                      }
+                }
+            }
+        }
+}
+#endif /* USE_PREFETCH */
+
 /* ----------------------------------------------------------------
  *					macros used in index_ routines
  *
@@ -243,6 +282,10 @@ index_beginscan(Relation heapRelation,
 	 */
 	scan->heapRelation = heapRelation;
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+	scan->pfch_list = (struct BlockIdData*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -267,6 +310,10 @@ index_beginscan_bitmap(Relation indexRel
 	 * up by RelationGetIndexScan.
 	 */
 	scan->xs_snapshot = snapshot;
+#ifdef USE_PREFETCH
+	scan->do_prefetch = 0;             /*  no prefetching by default  */
+	scan->pfch_list = (struct BlockIdData*)0;
+#endif   /* USE_PREFETCH */
 
 	return scan;
 }
@@ -332,6 +379,12 @@ index_rescan(IndexScanDesc scan,
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf));
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
@@ -363,10 +416,28 @@ index_endscan(IndexScanDesc scan)
 	/* Release any held pin on a heap page */
 	if (BufferIsValid(scan->xs_cbuf))
 	{
+#ifdef USE_PREFETCH
+                /*   if specified block number is present in the prefetch array,  then evict it */
+                if (scan->do_prefetch) {
+                      index_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf));
+                }
+#endif   /* USE_PREFETCH */
 		ReleaseBuffer(scan->xs_cbuf);
 		scan->xs_cbuf = InvalidBuffer;
 	}
 
+#ifdef USE_PREFETCH
+        /*   discard prefetched but unread buffers */
+        if (    scan->do_prefetch
+             && ((struct BlockIdData*)0 != scan->pfch_list)
+           ) {
+            unsigned short int pfchx; /*  index in BlockIdData array   */
+                for (pfchx = 0; pfchx < scan->pfch_used; pfchx++) {
+                    DiscardBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber((scan->pfch_list)+pfchx));
+                }
+        }
+#endif   /* USE_PREFETCH */
+
 	/* End the AM's scan */
 	FunctionCall1(procedure, PointerGetDatum(scan));
 
@@ -462,6 +533,12 @@ index_getnext_tid(IndexScanDesc scan, Sc
 		/* ... but first, release any held pin on a heap page */
 		if (BufferIsValid(scan->xs_cbuf))
 		{
+#ifdef USE_PREFETCH
+                    /*   if specified block number is present in the prefetch array,  then evict it */
+                    if (scan->do_prefetch) {
+                          index_evict_block(scan , BlocknumOfBuffer(scan->xs_cbuf));
+                    }
+#endif   /* USE_PREFETCH */
 			ReleaseBuffer(scan->xs_cbuf);
 			scan->xs_cbuf = InvalidBuffer;
 		}
@@ -492,6 +569,11 @@ index_getnext_tid(IndexScanDesc scan, Sc
  * enough information to do it efficiently in the general case.
  * ----------------
  */
+#ifdef USE_PREFETCH
+extern Datum btgettuple(PG_FUNCTION_ARGS);
+extern Datum btpeeknexttuple(IndexScanDesc scan);
+#endif /* USE_PREFETCH */
+
 HeapTuple
 index_fetch_heap(IndexScanDesc scan)
 {
@@ -505,10 +587,55 @@ index_fetch_heap(IndexScanDesc scan)
 		/* Switch to correct buffer if we don't have it already */
 		Buffer		prev_buf = scan->xs_cbuf;
 
+#ifdef USE_PREFETCH
+                /*   if the old block is different from new block, then evict old
+                **   block from prefetched array.   It is arguable we should leave it
+                **   in the array because it's likely to remain in the buffer pool
+                **   for a while,  but in that case , if we encounter the block
+                **   again,  prefetching it again does no harm.
+                **   (although unfortunately ,  if it's not pinned,  prefetching it will
+                **   not pin it since prefetch is a noop for a buffer in the buffer pool)
+                */
+                if (    scan->do_prefetch
+                    && ( BufferIsValid(prev_buf) )
+                    && (BlocknotinBuffer(prev_buf,scan->heapRelation,ItemPointerGetBlockNumber(tid)))
+                    && (scan->pfch_next < target_prefetch_pages)  /* ensure there is an entry */
+                   ) {
+                      index_evict_block(scan , BlocknumOfBuffer(prev_buf));
+                }
+
+#endif   /* USE_PREFETCH */
 		scan->xs_cbuf = ReleaseAndReadBuffer(scan->xs_cbuf,
 											 scan->heapRelation,
 											 ItemPointerGetBlockNumber(tid));
 
+#ifdef USE_PREFETCH
+                /*  try prefetching next data block
+                **    (next meaning one containing TIDs from matching keys
+                **     in same index page and different from any block
+                **     we previously prefetched and listed in prefetched array)
+                */
+                {
+                    FmgrInfo   *procedure;
+                    bool	found;
+
+                    if (scan->do_prefetch) {
+                        /*  GET_SCAN_PROCEDURE(ampeeknexttuple);   is correct but requires adding function to catalog which I did not do so instead */
+                        procedure = &scan->indexRelation->rd_aminfo->ampeeknexttuple;
+                        /* elog(LOG,"index_fetch_heap procedure= %p procedure->fn_addr= %p" ,procedure , (procedure ? procedure->fn_addr : 0)); */
+                        /*  if (procedure && procedure->fn_addr) { **   does the index access method support peektuple?  - procedure->fn_addr is null since not in catalog so instead  */
+                        if (procedure) { /*   does the index access method support peektuple? */
+                            /*  note we trust InitIndexScan verified this scan is forwards only and so set that */
+                            /*  found =  DatumGetBool(FunctionCall1(procedure, PointerGetDatum(scan))); cant use fmgr to call it because not in catalog so instead  */
+                            found =  DatumGetBool(btpeeknexttuple(scan));
+                            if (found) {
+                                PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, BlockIdGetBlockNumber(scan->pfch_list + scan->pfch_next) , 0);
+                            }
+                        }
+                    }
+                }
+#endif   /* USE_PREFETCH */
+
 		/*
 		 * Prune page, but only if we weren't already on this page
 		 */
--- src/include/executor/instrument.h.orig	2012-10-31 15:24:12.331163938 -0400
+++ src/include/executor/instrument.h	2012-11-01 11:36:52.915258321 -0400
@@ -28,6 +28,14 @@ typedef struct BufferUsage
 	long		local_blks_written;		/* # of local disk blocks written */
 	long		temp_blks_read; /* # of temp blocks read */
 	long		temp_blks_written;		/* # of temp blocks written */
+
+	long		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	long		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	long		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	long		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	long		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	long		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
+
 	instr_time	blk_read_time;	/* time spent reading */
 	instr_time	blk_write_time; /* time spent writing */
 } BufferUsage;
--- src/include/catalog/pg_am.h.orig	2012-10-31 15:24:12.323163928 -0400
+++ src/include/catalog/pg_am.h	2012-11-01 11:48:17.841593881 -0400
@@ -67,6 +67,7 @@ CATALOG(pg_am,2601)
 	regproc		amcanreturn;	/* can indexscan return IndexTuples? */
 	regproc		amcostestimate; /* estimate cost of an indexscan */
 	regproc		amoptions;		/* parse AM-specific parameters */
+	regproc		ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } FormData_pg_am;
 
 /* ----------------
@@ -117,19 +118,19 @@ typedef FormData_pg_am *Form_pg_am;
  * ----------------
  */
 
-DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions ));
+DATA(insert OID = 403 (  btree		5 2 t f t t t t t t f t t 0 btinsert btbeginscan btgettuple btgetbitmap btrescan btendscan btmarkpos btrestrpos btbuild btbuildempty btbulkdelete btvacuumcleanup btcanreturn btcostestimate btoptions btpeeknexttuple ));
 DESCR("b-tree index access method");
 #define BTREE_AM_OID 403
-DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions ));
+DATA(insert OID = 405 (  hash		1 1 f f t f f f f f f f f 23 hashinsert hashbeginscan hashgettuple hashgetbitmap hashrescan hashendscan hashmarkpos hashrestrpos hashbuild hashbuildempty hashbulkdelete hashvacuumcleanup - hashcostestimate hashoptions - ));
 DESCR("hash index access method");
 #define HASH_AM_OID 405
-DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions ));
+DATA(insert OID = 783 (  gist		0 8 f t f f t t f t t t f 0 gistinsert gistbeginscan gistgettuple gistgetbitmap gistrescan gistendscan gistmarkpos gistrestrpos gistbuild gistbuildempty gistbulkdelete gistvacuumcleanup - gistcostestimate gistoptions - ));
 DESCR("GiST index access method");
 #define GIST_AM_OID 783
-DATA(insert OID = 2742 (  gin		0 5 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions ));
+DATA(insert OID = 2742 (  gin		0 5 f f f f t t f f t f f 0 gininsert ginbeginscan - gingetbitmap ginrescan ginendscan ginmarkpos ginrestrpos ginbuild ginbuildempty ginbulkdelete ginvacuumcleanup - gincostestimate ginoptions - ));
 DESCR("GIN index access method");
 #define GIN_AM_OID 2742
-DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions ));
+DATA(insert OID = 4000 (  spgist	0 5 f f f f f t f t f f f 0 spginsert spgbeginscan spggettuple spggetbitmap spgrescan spgendscan spgmarkpos spgrestrpos spgbuild spgbuildempty spgbulkdelete spgvacuumcleanup spgcanreturn spgcostestimate spgoptions - ));
 DESCR("SP-GiST index access method");
 #define SPGIST_AM_OID 4000
 
--- src/include/utils/rel.h.orig	2012-10-31 15:24:12.351163971 -0400
+++ src/include/utils/rel.h	2012-11-01 11:48:17.853593923 -0400
@@ -67,6 +67,7 @@ typedef struct RelationAmInfo
 	FmgrInfo	amcanreturn;
 	FmgrInfo	amcostestimate;
 	FmgrInfo	amoptions;
+	FmgrInfo	ampeeknexttuple;	/* peek at the next tuple different from any blocknum in pfch_list without reading a new index page */
 } RelationAmInfo;
 
 
--- src/include/access/heapam.h.orig	2012-10-31 15:24:12.319163922 -0400
+++ src/include/access/heapam.h	2012-11-01 11:59:40.307980735 -0400
@@ -160,7 +160,7 @@ extern void heap_page_prune_execute(Buff
 extern void heap_get_root_tuples(Page page, OffsetNumber *root_offsets);
 
 /* in heap/syncscan.c */
-extern void ss_report_location(Relation rel, BlockNumber location);
+extern void ss_report_location(Relation rel, BlockNumber location , BlockNumber *prefetchHWMp);
 extern BlockNumber ss_get_location(Relation rel, BlockNumber relnblocks);
 extern void SyncScanShmemInit(void);
 extern Size SyncScanShmemSize(void);
--- src/include/access/relscan.h.orig	2012-11-01 08:58:22.980791419 -0400
+++ src/include/access/relscan.h	2012-11-01 11:36:52.895258256 -0400
@@ -43,6 +43,10 @@ typedef struct HeapScanDescData
 	bool		rs_inited;		/* false = scan not init'd yet */
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 	BlockNumber rs_cblock;		/* current block # in scan, if any */
+#ifdef USE_PREFETCH
+	int	    rs_prefetch_target; /* target distance (numblocks) for prefetch to reach beyond main scan */
+	BlockNumber rs_pfchblock;	/* next block # to be prefetched in scan, if any */
+#endif   /* USE_PREFETCH */
 	Buffer		rs_cbuf;		/* current buffer in scan, if any */
 	/* NB: if rs_cbuf is not InvalidBuffer, we hold a pin on that buffer */
 	ItemPointerData rs_mctid;	/* marked scan position, if any */
@@ -74,8 +78,14 @@ typedef struct IndexScanDescData
 	/* signaling to index AM about killing index tuples */
 	bool		kill_prior_tuple;		/* last-returned tuple is dead */
 	bool		ignore_killed_tuples;	/* do not return killed entries */
-	bool		xactStartedInRecovery;	/* prevents killing/seeing killed
-										 * tuples */
+	bool		xactStartedInRecovery;	/* prevents killing/seeing killed tuples */
+										 
+#ifdef USE_PREFETCH
+        struct BlockIdData*    pfch_list;       /* array of BlockIds which we will/have prefetched */
+        unsigned short int     pfch_used;	/*  number of used elements in BlockIdData array   */
+        unsigned short int     pfch_next;	/*  next element for prefetch in BlockIdData array */
+	int             do_prefetch;    /*  should I prefetch ? */
+#endif   /* USE_PREFETCH */
 
 	/* index access method's private state */
 	void	   *opaque;			/* access-method-specific info */
--- contrib/pg_stat_statements/pg_stat_statements.c.orig	2012-10-31 15:24:11.943163326 -0400
+++ contrib/pg_stat_statements/pg_stat_statements.c	2012-11-01 11:36:52.795257920 -0400
@@ -113,6 +113,14 @@ typedef struct Counters
 	int64		local_blks_written;		/* # of local disk blocks written */
 	int64		temp_blks_read; /* # of temp blocks read */
 	int64		temp_blks_written;		/* # of temp blocks written */
+
+	int64		aio_read_noneed;		/* # of prefetches for which no need for prefetch as block already in buffer pool */
+	int64		aio_read_noblok;		/* # of prefetches for which no available BufferAiocb */
+	int64		aio_read_failed;		/* # of aio reads for which aio itself failed or the read failed with an errno */
+	int64		aio_read_wasted;		/* # of aio reads for which disk block not used */
+	int64		aio_read_waited;		/* # of aio reads for which disk block used but had to wait for it */
+	int64		aio_read_ontime;		/* # of aio reads for which disk block used and ready on time when requested */
+
 	double		blk_read_time;	/* time spent reading, in msec */
 	double		blk_write_time; /* time spent writing, in msec */
 	double		usage;			/* usage factor */
@@ -861,7 +869,21 @@ pgss_ProcessUtility(Node *parsetree, con
 		bufusage.temp_blks_read =
 			pgBufferUsage.temp_blks_read - bufusage_start.temp_blks_read;
 		bufusage.temp_blks_written =
-			pgBufferUsage.temp_blks_written - bufusage_start.temp_blks_written;
+			pgBufferUsage.temp_blks_written - bufusage.temp_blks_written;
+
+		bufusage.aio_read_noneed =
+			pgBufferUsage.aio_read_noneed - bufusage.aio_read_noneed;
+		bufusage.aio_read_noblok =
+			pgBufferUsage.aio_read_noblok - bufusage.aio_read_noblok;
+		bufusage.aio_read_failed =
+			pgBufferUsage.aio_read_failed - bufusage.aio_read_failed;
+		bufusage.aio_read_wasted =
+			pgBufferUsage.aio_read_wasted - bufusage.aio_read_wasted;
+		bufusage.aio_read_waited =
+			pgBufferUsage.aio_read_waited - bufusage.aio_read_waited;
+		bufusage.aio_read_ontime =
+			pgBufferUsage.aio_read_ontime - bufusage.aio_read_ontime;
+
 		bufusage.blk_read_time = pgBufferUsage.blk_read_time;
 		INSTR_TIME_SUBTRACT(bufusage.blk_read_time, bufusage_start.blk_read_time);
 		bufusage.blk_write_time = pgBufferUsage.blk_write_time;
@@ -876,6 +898,7 @@ pgss_ProcessUtility(Node *parsetree, con
 				   rows,
 				   &bufusage,
 				   NULL);
+
 	}
 	else
 	{
@@ -1037,6 +1060,14 @@ pgss_store(const char *query, uint32 que
 		e->counters.local_blks_written += bufusage->local_blks_written;
 		e->counters.temp_blks_read += bufusage->temp_blks_read;
 		e->counters.temp_blks_written += bufusage->temp_blks_written;
+
+		e->counters.aio_read_noneed     += bufusage->aio_read_noneed;
+		e->counters.aio_read_noblok     += bufusage->aio_read_noblok;
+		e->counters.aio_read_failed     += bufusage->aio_read_failed;
+		e->counters.aio_read_wasted     += bufusage->aio_read_wasted;
+		e->counters.aio_read_waited     += bufusage->aio_read_waited;
+		e->counters.aio_read_ontime     += bufusage->aio_read_ontime;
+
 		e->counters.blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_read_time);
 		e->counters.blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->blk_write_time);
 		e->counters.usage += USAGE_EXEC(total_time);
@@ -1066,7 +1097,7 @@ pg_stat_statements_reset(PG_FUNCTION_ARG
 }
 
 #define PG_STAT_STATEMENTS_COLS_V1_0	14
-#define PG_STAT_STATEMENTS_COLS			18
+#define PG_STAT_STATEMENTS_COLS		24
 
 /*
  * Retrieve statement statistics.
@@ -1177,6 +1208,14 @@ pg_stat_statements(PG_FUNCTION_ARGS)
 		values[i++] = Int64GetDatumFast(tmp.local_blks_written);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_read);
 		values[i++] = Int64GetDatumFast(tmp.temp_blks_written);
+
+		values[i++] = Int64GetDatumFast(tmp.aio_read_noneed);
+		values[i++] = Int64GetDatumFast(tmp.aio_read_noblok);
+		values[i++] = Int64GetDatumFast(tmp.aio_read_failed);
+		values[i++] = Int64GetDatumFast(tmp.aio_read_wasted);
+		values[i++] = Int64GetDatumFast(tmp.aio_read_waited);
+		values[i++] = Int64GetDatumFast(tmp.aio_read_ontime);
+
 		if (sql_supports_v1_1_counters)
 		{
 			values[i++] = Float8GetDatumFast(tmp.blk_read_time);
--- contrib/pg_stat_statements/pg_stat_statements--1.1.sql.orig	2012-10-31 15:24:11.943163326 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.1.sql	2012-11-01 11:36:52.735257720 -0400
@@ -26,6 +26,12 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
     OUT temp_blks_written int8,
+    OUT aio_read_noneed int8,
+    OUT aio_read_noblok int8,
+    OUT aio_read_failed int8,
+    OUT aio_read_wasted int8,
+    OUT aio_read_waited int8,
+    OUT aio_read_ontime int8,
     OUT blk_read_time float8,
     OUT blk_write_time float8
 )
--- contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql.orig	2012-10-31 15:24:11.943163326 -0400
+++ contrib/pg_stat_statements/pg_stat_statements--1.0--1.1.sql	2012-11-01 11:36:52.727257692 -0400
@@ -29,6 +29,12 @@ CREATE FUNCTION pg_stat_statements(
     OUT local_blks_written int8,
     OUT temp_blks_read int8,
     OUT temp_blks_written int8,
+    OUT aio_read_noneed int8,
+    OUT aio_read_noblok int8,
+    OUT aio_read_failed int8,
+    OUT aio_read_wasted int8,
+    OUT aio_read_waited int8,
+    OUT aio_read_ontime int8,
     OUT blk_read_time float8,
     OUT blk_write_time float8
 )

#13

Marti Raudsepp

marti@juffo.org

about 13 years ago

In reply to: Claudio Freire (#11)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com> wrote:

On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain
<cedric@2ndquadrant.com> wrote:

Well, informing linux hackers may help.

I agree. I'm a bit hesitant to subscribe to yet another mailing list

FYI you can send messages to linux-kernel without subscribing (there's
no moderation either).

Subscribing to linux-kernel is like drinking from a firehose :)

Regards,
Marti

#14

Andres Freund

andres@2ndquadrant.com

about 13 years ago

In reply to: Marti Raudsepp (#13)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thursday, November 01, 2012 05:53:39 PM Marti Raudsepp wrote:

On Tue, Oct 30, 2012 at 1:50 AM, Claudio Freire <klaussfreire@gmail.com>

wrote:

On Mon, Oct 29, 2012 at 7:07 PM, Cédric Villemain

<cedric@2ndquadrant.com> wrote:

Well, informing linux hackers may help.

I agree. I'm a bit hesitant to subscribe to yet another mailing list

FYI you can send messages to linux-kernel without subscribing (there's
no moderation either).

Subscribing to linux-kernel is like drinking from a firehose :)

linux-fsdevel is more reasonable though...

Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: John Lumby (#12)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thu, Nov 1, 2012 at 1:37 PM, John Lumby <johnlumby@hotmail.com> wrote:

Claudio wrote :

Oops - forgot to effectively attach the patch.

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting. You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why -- I think they should work very well together - as both have
been demonstrated to produce improvements. I will see if I can combine them,
git permitting (as of course their changed file lists overlap).

Check the latest patch, it contains heap page prefetching too.

I was surprised by this design decision :
/* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those cases */
Is it really? Are you assuming the it's redundant with posix_fadvise for this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

async_io indeed may make that logic obsolete, but it's not redundant
posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

However I think in some environments the async-io has significant benefits over
posix-fadvise, especially (of course!) where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API insead of sending tons of requests to the
kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.

#16

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Andres Freund (#14)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thu, Nov 1, 2012 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I agree. I'm a bit hesitant to subscribe to yet another mailing list

FYI you can send messages to linux-kernel without subscribing (there's
no moderation either).

Subscribing to linux-kernel is like drinking from a firehose :)

linux-fsdevel is more reasonable though...

readahead logic is not at the filesystem level, but the memory mapper's.

I'll consider posting without subscribing.

#17

John Lumby

johnlumby@hotmail.com

about 13 years ago

In reply to: John Lumby (#4)

FW: [PATCH] Prefetch index pages for B-Tree index scans

Claudio wrote :

Check the latest patch, it contains heap page prefetching too.

Oh yes I see. I missed that - I was looking in the wrong place.
I do have one question about the way you did it : by placing the
prefetch heap-page calls in _bt_next, which effectively means inside
a call from the index am index_getnext_tid to btgettuple, are you sure
you are synchronizing your prefetches of heap pages with the index am's
ReadBuffer's of heap pages? I.e. are you complying with this comment
from nodeBitmapHeapscan.c for prefetching its bitmap heap pages in
the bitmap-index-scan case:

* We issue prefetch requests *after* fetching the current page to try
* to avoid having prefetching interfere with the main I/O.

I can't really tell whether your design conforms to this and nor do I
know whether it is important, but I decided to do it in the same manner,
and so implemented the heap-page fetching in index_fetch_heap

async_io indeed may make that logic obsolete, but it's not redundant
posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

I did not know that. Very interesting.

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API insead of sending tons of requests to the
kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.

I think you are totally correct on this point. If I recall, the
glic (librt) aio does have an lio_listio but it is either a noop
or just loops over the list, I forget which (don't have its source right now),
but in any case I am sure there is a potential for implementing such a facility.
But to be really effective, it should be implemented in the kernel itself,
which we don't have today.

John

Import Notes

Reply to msg id not found: COL116-W183899582C276EAD2BC414A3600@phx.gblReference msg id not found: E1TP0L2-0004VJ-1n@malur.postgresql.org

#18

Greg Smith

greg@2ndQuadrant.com

about 13 years ago

In reply to: Claudio Freire (#15)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On 11/1/12 6:13 PM, Claudio Freire wrote:

posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

That's really interesting. There was a patch submitted at one point to
use POSIX_FADV_SEQUENTIAL on sequential scans, and that wasn't a
repeatable improvement either, so it was canned at
http://archives.postgresql.org/pgsql-hackers/2008-10/msg01611.php

The Linux posix_fadvise implementation never seemed like it was well
liked by the kernel developers. Quirky stuff like this popped up all
the time during that period, when effective_io_concurrency was being
added. I wonder how far back the fadvise/read-ahead conflict goes back.

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API instead of sending tons of requests to the
kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.

Another possibly useful bit of history here for you. Greg Stark wrote a
test program that used async I/O effectively on both Linux and Solaris.
Unfortunately, it was hard to get that to work given how Postgres does
its buffer I/O, and using processes instead of threads. This looks like
the place he commented on why:

http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html

The part I think was relevant there from him:

"In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for Postgres. If the process that initiated io
is in the middle of a long query it might take a long time, or not even
never get back to complete the io. The callbacks use threads...

And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete."

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

#19

Claudio Freire

klaussfreire@gmail.com

about 13 years ago

In reply to: Greg Smith (#18)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 11/1/12 6:13 PM, Claudio Freire wrote:

posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

...

The Linux posix_fadvise implementation never seemed like it was well liked
by the kernel developers. Quirky stuff like this popped up all the time
during that period, when effective_io_concurrency was being added. I wonder
how far back the fadvise/read-ahead conflict goes back.

Well, to be precise it's not so much as a problem in posix_fadvise
itself, it's a problem in how it interacts with readahead. Since
readahead works at the memory mapper level, and only when actually
performing I/O (which would seem at first glance quite sensible), it
doesn't get to see fadvise activity.

FADV_WILLNEED is implemented as a forced readahead, which doesn't
update any of the readahead context structures. Again, at first
glance, this would seem sensible (explicit hints shouldn't interfere
with pattern detection logic). However, since those pages are (after
the fadvise call) under async I/O, next time the memory mapper needs
that page, instead of requesting I/O through readahead logic, it will
wait for async I/O to complete.

IOW, what was sequential in fact, became invisible to readahead,
indistinguishable from random I/O. Whatever page fadvise failed to
predict will be treated as random I/O, and here the trouble lies.

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API instead of sending tons of requests to the

kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.

Another possibly useful bit of history here for you. Greg Stark wrote a
test program that used async I/O effectively on both Linux and Solaris.
Unfortunately, it was hard to get that to work given how Postgres does its
buffer I/O, and using processes instead of threads. This looks like the
place he commented on why:

http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html

The part I think was relevant there from him:

"In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for Postgres. If the process that initiated io
is in the middle of a long query it might take a long time, or not even
never get back to complete the io. The callbacks use threads...

And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete."

I noticed that. I always envisioned async I/O as managed by some
dedicated process. One that could check for completion or receive
callbacks. Postmaster, for instance.

#20

John Lumby

johnlumby@hotmail.com

about 13 years ago

In reply to: Claudio Freire (#19)

Re: [PATCH] Prefetch index pages for B-Tree index scans

Claudio Freire wrote:

On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote:

On 11/1/12 6:13 PM, Claudio Freire wrote:

posix_fadvise what's the trouble there, but the fact that the kernel
stops doing read-ahead when a call to posix_fadvise comes. I noticed
the performance hit, and checked the kernel's code. It effectively
changes the prediction mode from sequential to fadvise, negating the
(assumed) kernel's prefetch logic.

The Linux posix_fadvise implementation never seemed like it was well liked
by the kernel developers. Quirky stuff like this popped up all the time
during that period, when effective_io_concurrency was being added. I wonder
how far back the fadvise/read-ahead conflict goes back.

Well, to be precise it's not so much as a problem in posix_fadvise
itself, it's a problem in how it interacts with readahead. Since
readahead works at the memory mapper level, and only when actually
performing I/O (which would seem at first glance quite sensible), it
doesn't get to see fadvise activity.

FADV_WILLNEED is implemented as a forced readahead, which doesn't
update any of the readahead context structures. Again, at first
glance, this would seem sensible (explicit hints shouldn't interfere
with pattern detection logic). However, since those pages are (after
the fadvise call) under async I/O, next time the memory mapper needs
that page, instead of requesting I/O through readahead logic, it will
wait for async I/O to complete.

IOW, what was sequential in fact, became invisible to readahead,
indistinguishable from random I/O. Whatever page fadvise failed to
predict will be treated as random I/O, and here the trouble lies.

And this may be one other advantage of async io over posix_fadvise in
the linux environment (with the present mmap behaviour) :
that async io achives the same objective of improving disk/processing overlap
without the mentioned interference with read-ahead.
Although to confirm this would ideally require 3-way comparing
   posix-fadvise + existing readahead behaviour
   posix-fadvise + modify existing readahead behaviour
                              to not force waiting for current async io
(i.e. just check the aio and continue normal readahead if in progress)
   async io wth no posix-fadvise

It seems in general to be preferable to have an implementation that is
less dependent on specific behaviour of the OS read-head mechanism.

I've mused about the possibility to batch async_io requests, and use
the scatter/gather API instead of sending tons of requests to the

kernel. I think doing so would enable a zero-copy path that could very
possibly imply big speed improvements when memory bandwidth is the
bottleneck.

Another possibly useful bit of history here for you. Greg Stark wrote a
test program that used async I/O effectively on both Linux and Solaris.
Unfortunately, it was hard to get that to work given how Postgres does its
buffer I/O, and using processes instead of threads. This looks like the
place he commented on why:

http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html

The part I think was relevant there from him:

"In the libaio view of the world you initiate io and either get a
callback or call another syscall to test if it's complete. Either
approach has problems for Postgres. If the process that initiated io
is in the middle of a long query it might take a long time, or not even
never get back to complete the io. The callbacks use threads...

And polling for completion has the problem that another process could
be waiting on the io and can't issue a read as long as the first
process has the buffer locked and io in progress. I think aio makes a
lot more sense if you're using threads so you can start a thread to
wait for the io to complete."

I noticed that. I always envisioned async I/O as managed by some
dedicated process. One that could check for completion or receive
callbacks. Postmaster, for instance.

Thanks for the mentioning this posting.    Interesting.
However,    the OP describes an implementation based on libaio.
Today what we have (for linux) is librt, which is quite different.
It is arguable worse than libaio (well actually I am sure it is worse)
since it is essentially just an encapsulation of using threads to do
synchronous ios - you can look at it as making it easier to do what the
application could do itself if it set up its own pthreads.     The linux
kernel does not know about it and so the CPU overhead of checking for
completion is higher.
But if async io is used *only* for prefetching, and not for the actual
ReadBuffer itself   (which is what I did),   then the problem
mentioned by the OP
     "If the process that initiated io is in the middle of a long query
it might take a long time, or never get back to complete the io"
is not a problem.     The model is simple:

1 .    backend process P1 requestrs prefetch on a relation/block R/B
        which results in initating aio_read using
        a (new) shared control block (call it pg_buf_aiocb)
      which tracks the request and also contains the librt's aiocb.
2 .    any backend P2 (which may or may not be P1)
       issues ReadBuffer or similar,
   requesting access for read/pin to buffer containing R/B.
       this backend discovers that the buffer is aio_in_progress
       and calls check_completion(pg_buf_aiocb),
       and waits (effectively on the librt thread) if not complete.

3 .   any number of other backends may concurrently do same as 2.
       I.e. If pg_buf_aiocb is still aio-in-progress, they all also wait
on the librt thread.
4.    Each waiting backend receives the completion
   and the last one does the housekeeping and returns the pg_buf_aiocb.

What complicates it is managing the associated pinned buffer
in such a way that every backend takes the correct action
with the correct degree of serialization of the buffer descriptor
during critical sections, but yet allowing all backends in
3. above to concurrently wait/check. After quite a lot of testing
I think I now this correct. ("I just found the *last* bug!" :-)

John

#21

Bruce Momjian

bruce@momjian.us

about 13 years ago

In reply to: John Lumby (#20)

Re: [PATCH] Prefetch index pages for B-Tree index scans

On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote:

Thanks for the mentioning this posting.ï¿½ï¿½ï¿½ Interesting.
However,ï¿½ï¿½ï¿½ the OP describes an implementation based on libaio.
Today what we have (for linux) is librt,ï¿½ which is quite different.
It is arguable worse than libaio (well actually I am sure it is worse)
since it is essentially just an encapsulation of using threads to do
synchronous iosï¿½ -ï¿½ you can look at it as making it easier to do what the
application could do itself if it set up its own pthreads.ï¿½ï¿½ï¿½ï¿½ The linux
kernel does not know about it and so the CPU overhead of checking for
completion is higher.

Well, good thing we didn't switch to using libaio, now that it is gone.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

#22

John Lumby

johnlumby@hotmail.com

about 13 years ago

In reply to: Bruce Momjian (#21)

Re: [PATCH] Prefetch index pages for B-Tree index scans

Bruce Momjian wrote:

On Fri, Nov 2, 2012 at 09:59:08AM -0400, John Lumby wrote:

However, the OP describes an implementation based on libaio.
Today what we have (for linux) is librt, which is quite different.

Well, good thing we didn't switch to using libaio, now that it is gone.

Yes, I think you are correct. Although I should correct myself about
status of libaio - it seems many distros continue to provide it and at least
one other popular database (MySQL) uses it, but as far as I can tell
the content has not been updated by the original authors for around 10 years.
That is perhaps not surprising since it does very little other than wrap
the linux kernel syscalls.

Set against the CPU-overhead disadvantage of librt, I think the three
main advantages of librt vs libaio/kernel-aio for postgresql are :
.   posix standard, and probably easier to provide very similar
      implementation on windows (I see at least one posix aio lib for windows)
.   no restrictions on the way files are accessed (kernel-aio imposes restrictions
      on open() flags and buffer alignment etc)
.   it seems (from the recent postings about the earlier attempt to implement
      async io using libaio) that the posix threads style lends itself better to
      fitting in with the postgresql backend model.

John