index prefetching
Hi,
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
Motivation
----------
Imagine we have a huge table (much larger than RAM), with an index, and
that we're doing a regular index scan (e.g. using a btree index). We
first walk the index to the leaf page, read the item pointers from the
leaf page and then start issuing fetches from the heap.
The index access is usually pretty cheap, because non-leaf pages are
very likely cached, so we may do perhaps I/O for the leaf. But the
fetches from heap are likely very expensive - unless the page is
clustered, we'll do a random I/O for each item pointer. Easily ~200 or
more I/O requests per leaf page. The problem is index scans do these
requests synchronously at the moment - we get the next TID, fetch the
heap page, process the tuple, continue to the next TID etc.
That is slow and can't really leverage the bandwidth of modern storage,
which require longer queues. This patch aims to improve this by async
prefetching.
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.
2) Our estimates / planning are not perfect, so we may easily pick an
index scan instead of a bitmap scan. It'd be nice to limit the damage a
bit by still prefetching.
3) There are queries that can't do a bitmap scan (at all, or because
it's hopelessly inefficient). Consider queries that require ordering, or
queries by distance with GiST/SP-GiST index.
Implementation
--------------
When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.
The easiest thing would be to just do prefetching from the btree code.
But then I realized there's no particular reason why other index types
(except for GIN, which only allows bitmap scans) couldn't do prefetching
too. We could have a copy in each AM, of course, but that seems sloppy
and also violation of layering. After all, bitmap heap scans do prefetch
from the executor, so AM seems way too low level.
So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).
So what I did is introducing a IndexPrefetch struct, which is part of
IndexScanDesc, maintaining all the info about prefetching for that
particular scan - current/maximum distance, progress, etc.
It also contains two AM-specific callbacks (get_range and get_block)
which say valid range of indexes (into the internal array), and block
number for a given index.
This mostly does the trick, although index_prefetch() is still called
from the amgettuple() functions. That seems wrong, we should call it
from indexam.c right aftter calling amgettuple.
Problems / Open questions
-------------------------
There's a couple issues I ran into, I'll try to list them in the order
of importance (most serious ones first).
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.
Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(
I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.
In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?
I'm also not entirely sure the way this interfaces with the AM (through
the get_range / get_block callbaces) is very elegant. It did the trick,
but it seems a bit cumbersome. I wonder if someone has a better/nicer
idea how to do this ...
3) prefetch distance
I think we can do various smart things about the prefetch distance.
The current code does about the same thing bitmap scans do - it starts
with distance 0 (no prefetching), and then simply ramps the distance up
until the maximum value from get_tablespace_io_concurrency(). Which is
either effective_io_concurrency, or per-tablespace value.
I think we could be a bit smarter, and also consider e.g. the estimated
number of matching rows (but we shouldn't be too strict, because it's
just an estimate). We could also track some statistics for each scan and
use that during a rescans (think index scan in a nested loop).
But the patch doesn't do any of that now.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.
I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.
Benchmarks
----------
1) OLTP
For OLTP, this tested different queries with various index types, on
data sets constructed to have certain number of matching rows, forcing
different types of query plans (bitmap, index, seqscan).
The data sets have ~34GB, which is much more than available RAM (8GB).
For example for BTREE, we have a query like this:
SELECT * FROM btree_test WHERE a = $v
with data matching 1, 10, 100, ..., 100000 rows for each $v. The results
look like this:
rows bitmapscan master patched seqscan
1 19.8 20.4 18.8 31875.5
10 24.4 23.8 23.2 30642.4
100 27.7 40.0 26.3 31871.3
1000 45.8 178.0 45.4 30754.1
10000 171.8 1514.9 174.5 30743.3
100000 1799.0 15993.3 1777.4 30937.3
This says that the query takes ~31s with a seqscan, 1.8s with a bitmap
scan and 16s index scan (on master). With the prefetching patch, it
takes about ~1.8s, i.e. about the same as the bitmap scan.
I don't know where exactly would the plan switch from index scan to
bitmap scan, but the table has ~100M rows, so all of this is tiny. I'd
bet most of the cases would do plain index scan.
For a query with ordering:
SELECT * FROM btree_test WHERE a >= $v ORDER BY a LIMIT $n
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1
This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).
Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).
The results for other index types (HASH, GiST, SP-GiST) follow roughly
the same pattern. See the attached PDF for more charts, and [1]https://github.com/tvondra/index-prefetch-tests for
complete results.
Benchmark / TPC-H
-----------------
I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):
query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%
The percentage is (timing patched / master, so <100% means faster, >100%
means slower).
The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.
My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.
There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.
regards
[1]: https://github.com/tvondra/index-prefetch-tests
[2]: https://github.com/tvondra/postgres/tree/dev/index-prefetch
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch-poc.patchtext/x-patch; charset=UTF-8; name=index-prefetch-poc.patchDownload
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index efdf9415d15..9b3625d833b 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -193,7 +193,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys, int prefetch, int prefetch_reset);
extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 6cc7d07164a..0c6da1b635b 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -25,7 +25,7 @@
* Begin scan of bloom index.
*/
IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation r, int nkeys, int norderbys, int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
BloomScanOpaque so;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..5b298c02cce 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -324,7 +324,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
* holding lock on index, it's not necessary to recompute it during brinrescan.
*/
IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
BrinOpaque *opaque;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb87..3087a986bc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
GinScanOpaque so;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..7b79128f2ce 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -493,12 +493,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
if (GistPageIsLeaf(page))
{
+ BlockNumber block = ItemPointerGetBlockNumber(&it->t_tid);
+
/* Creating heap-tuple GISTSearchItem */
item->blkno = InvalidBlockNumber;
item->data.heap.heapPtr = it->t_tid;
item->data.heap.recheck = recheck;
item->data.heap.recheckDistances = recheck_distances;
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
/*
* In an index-only scan, also fetch the data from the tuple.
*/
@@ -529,6 +533,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
UnlockReleaseBuffer(buffer);
+
+ so->didReset = true;
}
/*
@@ -679,6 +685,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->curPageData++;
+ index_prefetch(scan, ForwardScanDirection);
+
return true;
}
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 00400583c0b..fdf978eaaad 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -22,6 +22,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+static void gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
/*
* Pairing heap comparison function for the GISTSearchItem queue
@@ -71,7 +73,7 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
*/
IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
GISTSTATE *giststate;
@@ -111,6 +113,31 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = gist_prefetch_getblock;
+ prefetcher->get_range = gist_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
/*
@@ -356,3 +383,42 @@ gistendscan(IndexScanDesc scan)
*/
freeGISTstate(so->giststate);
}
+
+static void
+gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->didReset;
+ so->didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->curPageData;
+ *end = (so->nPageData - 1);
+ }
+ else
+ {
+ *start = 0;
+ *end = so->curPageData;
+ }
+}
+
+static BlockNumber
+gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->curPageData) || (index >= so->nPageData))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->pageData[index].heapPtr;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606e..01a25132bce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -48,6 +48,9 @@ static void hashbuildCallback(Relation index,
bool tupleIsAlive,
void *state);
+static void _hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
/*
* Hash handler function: return IndexAmRoutine with access method parameters
@@ -362,7 +365,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
* hashbeginscan() -- start a scan on a hash index
*/
IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
HashScanOpaque so;
@@ -383,6 +386,31 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL;
so->numKilled = 0;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = _hash_prefetch_getblock;
+ prefetcher->get_range = _hash_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
return scan;
@@ -918,3 +946,42 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
else
LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
}
+
+static void
+_hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->currPos.didReset;
+ so->currPos.didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->currPos.itemIndex;
+ *end = so->currPos.lastItem;
+ }
+ else
+ {
+ *start = so->currPos.firstItem;
+ *end = so->currPos.itemIndex;
+ }
+}
+
+static BlockNumber
+_hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->currPos.items[index].heapTid;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9ea2a42a07f..b5cea5e23eb 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -434,6 +434,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
currItem = &so->currPos.items[so->currPos.itemIndex];
scan->xs_heaptid = currItem->heapTid;
+ index_prefetch(scan, dir);
+
/* if we're here, _hash_readpage found a valid tuples */
return true;
}
@@ -467,6 +469,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
so->currPos.buf = buf;
so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.didReset = true;
if (ScanDirectionIsForward(dir))
{
@@ -597,6 +600,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
return true;
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc21c..b2f4eadc1ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
@@ -756,6 +757,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
PROGRESS_CLUSTER_INDEX_RELID
};
int64 ci_val[2];
+ int prefetch_target;
+
+ prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
/* Set phase and OIDOldIndex to columns */
ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,7 +768,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+ prefetch_target, prefetch_target);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ /* set in each AM when applicable */
+ scan->xs_prefetch = NULL;
+
return scan;
}
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, irel,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, indexRelation,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..aa8a14624d8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -59,6 +59,7 @@
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/lsyscache.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
@@ -106,7 +107,8 @@ do { \
static IndexScanDesc index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap);
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset);
/* ----------------------------------------------------------------
@@ -200,18 +202,36 @@ index_insert(Relation indexRelation,
* index_beginscan - start a scan of an index with amgettuple
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
*/
IndexScanDesc
index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys)
+ int nkeys, int norderbys,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+ prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+ 0, 0); /* no prefetch */
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
static IndexScanDesc
index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap)
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
/*
* Tell the AM to open a scan.
*/
- scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
- norderbys);
+ scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+ prefetch_target, prefetch_reset);
/* Initialize information for parallel scan. */
scan->parallel_scan = pscan;
scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
orderbys, norderbys);
+
+ /* If we're prefetching for this index, maybe reset some of the state. */
+ if (scan->xs_prefetch != NULL)
+ {
+ IndexPrefetch prefetcher = scan->xs_prefetch;
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+ prefetcher->prefetchReset);
+ }
}
/* ----------------
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
* index_beginscan_parallel - join parallel index scan
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
*/
IndexScanDesc
index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
- int norderbys, ParallelIndexScanDesc pscan)
+ int norderbys, ParallelIndexScanDesc pscan,
+ int prefetch_target, int prefetch_reset)
{
Snapshot snapshot;
IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
RegisterSnapshot(snapshot);
scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
- pscan, true);
+ pscan, true, prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
pgstat_count_index_tuples(scan->indexRelation, 1);
+ /* do index prefetching, if needed */
+ index_prefetch(scan, direction);
+
/* Return the TID of the tuple we found. */
return &scan->xs_heaptid;
}
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+void
+index_prefetch(IndexScanDesc scan, ScanDirection dir)
+{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
+ /*
+ * No heap relation means bitmap index scan, which does prefetching at
+ * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+ * without the heap)
+ *
+ * XXX But in this case we should have prefetchMaxTarget=0, because in
+ * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+ * just check that.
+ */
+ if (!prefetch)
+ return;
+
+ /* was it initialized correctly? */
+ // Assert(prefetch->prefetchIndex != -1);
+
+ /*
+ * If we got here, prefetching is enabled and it's a node that supports
+ * prefetching (i.e. it can't be a bitmap index scan).
+ */
+ Assert(scan->heapRelation);
+
+ /* gradually increase the prefetch distance */
+ prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+ prefetch->prefetchMaxTarget);
+
+ /*
+ * Did we already reach the point to actually start prefetching? If not,
+ * we're done. We'll try again for the next index tuple.
+ */
+ if (prefetch->prefetchTarget <= 0)
+ return;
+
+ /*
+ * XXX I think we don't need to worry about direction here, that's handled
+ * by how the AMs build the curPos etc. (see nbtsearch.c)
+ */
+ if (ScanDirectionIsForward(dir))
+ {
+ bool reset;
+ int startIndex,
+ endIndex;
+
+ /* get indexes of unprocessed index entries */
+ prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+ /*
+ * Did we switch to a different index block? if yes, reset relevant
+ * info so that we start prefetching from scratch.
+ */
+ if (reset)
+ {
+ prefetch->prefetchTarget = prefetch->prefetchReset;
+ prefetch->prefetchIndex = startIndex; /* maybe -1 instead? */
+ pgBufferUsage.blks_prefetch_rounds++;
+ }
+
+ /*
+ * Adjust the range, based on what we already prefetched, and also
+ * based on the prefetch target.
+ *
+ * XXX We need to adjust the end index first, because it depends on
+ * the actual position, before we consider how far we prefetched.
+ */
+ endIndex = Min(endIndex, startIndex + prefetch->prefetchTarget);
+ startIndex = Max(startIndex, prefetch->prefetchIndex + 1);
+
+ for (int i = startIndex; i <= endIndex; i++)
+ {
+ bool recently_prefetched = false;
+ BlockNumber block;
+
+ block = prefetch->get_block(scan, dir, i);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ *
+ * XXX We can't just check blocks between startIndex and endIndex,
+ * because at some point (after the pefetch target gets ramped up)
+ * it's going to be just a single block.
+ *
+ * XXX The solution here is pretty trivial - we just check the
+ * immediately preceding block. We could check a longer history, or
+ * maybe maintain some "already prefetched" struct (small LRU array
+ * of last prefetched blocks - say 8 blocks or so - would work fine,
+ * I think).
+ */
+ for (int j = 0; j < 8; j++)
+ {
+ /* the cached block might be InvalidBlockNumber, but that's fine */
+ if (prefetch->cacheBlocks[j] == block)
+ {
+ recently_prefetched = true;
+ break;
+ }
+ }
+
+ if (recently_prefetched)
+ continue;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+
+ prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+ prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+ }
+
+ prefetch->prefetchIndex = endIndex;
+ }
+ else
+ {
+ bool reset;
+ int startIndex,
+ endIndex;
+
+ /* get indexes of unprocessed index entries */
+ prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+ /* FIXME handle the reset flag */
+
+ /*
+ * Adjust the range, based on what we already prefetched, and also
+ * based on the prefetch target.
+ *
+ * XXX We need to adjust the start index first, because it depends on
+ * the actual position, before we consider how far we prefetched (which
+ * for backwards scans is (end index).
+ */
+ startIndex = Max(startIndex, endIndex - prefetch->prefetchTarget);
+ endIndex = Min(endIndex, prefetch->prefetchIndex - 1);
+
+ for (int i = endIndex; i >= startIndex; i--)
+ {
+ bool recently_prefetched = false;
+ BlockNumber block;
+
+ block = prefetch->get_block(scan, dir, i);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ *
+ * XXX We can't just check blocks between startIndex and endIndex,
+ * because at some point (after the pefetch target gets ramped up)
+ * it's going to be just a single block.
+ *
+ * XXX The solution here is pretty trivial - we just check the
+ * immediately preceding block. We could check a longer history, or
+ * maybe maintain some "already prefetched" struct (small LRU array
+ * of last prefetched blocks - say 8 blocks or so - would work fine,
+ * I think).
+ */
+ for (int j = 0; j < 8; j++)
+ {
+ /* the cached block might be InvalidBlockNumber, but that's fine */
+ if (prefetch->cacheBlocks[j] == block)
+ {
+ recently_prefetched = true;
+ break;
+ }
+ }
+
+ if (recently_prefetched)
+ continue;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+
+ prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+ prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+ }
+
+ prefetch->prefetchIndex = startIndex;
+ }
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b15199a..b1a02cc9bcd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -37,6 +37,7 @@
#include "utils/builtins.h"
#include "utils/index_selfuncs.h"
#include "utils/memutils.h"
+#include "utils/spccache.h"
/*
@@ -87,6 +88,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+static void _bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -341,7 +344,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
* btbeginscan() -- start a scan on a btree index
*/
IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
BTScanOpaque so;
@@ -369,6 +372,31 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = _bt_prefetch_getblock;
+ prefetcher->get_range = _bt_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -1423,3 +1451,42 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+static void
+_bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->currPos.didReset;
+ so->currPos.didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->currPos.itemIndex;
+ *end = so->currPos.lastItem;
+ }
+ else
+ {
+ *start = so->currPos.firstItem;
+ *end = so->currPos.itemIndex;
+ }
+}
+
+static BlockNumber
+_bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->currPos.items[index].heapTid;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce95..762d95d09ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,7 +47,6 @@ static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
/*
* _bt_drop_lock_and_maybe_pin()
*
@@ -1385,7 +1384,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
_bt_parallel_done(scan);
BTScanPosInvalidate(so->currPos);
-
return false;
}
else
@@ -1538,6 +1536,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
Assert(BufferIsValid(so->currPos.buf));
+ /*
+ * Mark the currPos as reset before loading the next chunk of pointers, to
+ * restart the preretching.
+ */
+ so->currPos.didReset = true;
+
page = BufferGetPage(so->currPos.buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index cbfaf0c00ac..79015194b73 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -16,6 +16,7 @@
#include "postgres.h"
#include "access/genam.h"
+#include "access/relation.h"
#include "access/relscan.h"
#include "access/spgist_private.h"
#include "miscadmin.h"
@@ -32,6 +33,10 @@ typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
SpGistLeafTuple leafTuple, bool recheck,
bool recheckDistances, double *distances);
+static void spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
+
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
* KNN-searches currently only support NULLS LAST. So, preserve this logic
@@ -191,6 +196,7 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
pfree(so->reconTups[i]);
}
so->iPtr = so->nPtrs = 0;
+ so->didReset = true;
}
/*
@@ -301,7 +307,7 @@ spgPrepareScanKeys(IndexScanDesc scan)
}
IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
SpGistScanOpaque so;
@@ -316,6 +322,8 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
so->keyData = NULL;
initSpGistState(&so->state, scan->indexRelation);
+ so->state.heap = relation_open(scan->indexRelation->rd_index->indrelid, NoLock);
+
so->tempCxt = AllocSetContextCreate(CurrentMemoryContext,
"SP-GiST search temporary context",
ALLOCSET_DEFAULT_SIZES);
@@ -371,6 +379,31 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
so->indexCollation = rel->rd_indcollation[0];
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = spgist_prefetch_getblock;
+ prefetcher->get_range = spgist_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
return scan;
@@ -453,6 +486,8 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ relation_close(so->state.heap, NoLock);
+
pfree(so);
}
@@ -584,6 +619,13 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ // FIXME prefetch here? or in storeGettuple?
+ {
+ BlockNumber block = ItemPointerGetBlockNumber(&leafTuple->heapPtr);
+
+ PrefetchBuffer(so->state.heap, MAIN_FORKNUM, block);
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -1047,7 +1089,12 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
index_store_float8_orderby_distances(scan, so->orderByTypes,
so->distances[so->iPtr],
so->recheckDistances[so->iPtr]);
+
so->iPtr++;
+
+ /* prefetch additional tuples */
+ index_prefetch(scan, dir);
+
return true;
}
@@ -1070,6 +1117,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
pfree(so->reconTups[i]);
}
so->iPtr = so->nPtrs = 0;
+ so->didReset = true;
spgWalk(scan->indexRelation, so, false, storeGettuple,
scan->xs_snapshot);
@@ -1095,3 +1143,42 @@ spgcanreturn(Relation index, int attno)
return cache->config.canReturnData;
}
+
+static void
+spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->didReset;
+ so->didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->iPtr;
+ *end = (so->nPtrs - 1);
+ }
+ else
+ {
+ *start = 0;
+ *end = so->iPtr;
+ }
+}
+
+static BlockNumber
+spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->iPtr) || (index >= so->nPtrs))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->heapPtrs[index];
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 190e4f76a9e..4aac68f0766 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -17,6 +17,7 @@
#include "access/amvalidate.h"
#include "access/htup_details.h"
+#include "access/relation.h"
#include "access/reloptions.h"
#include "access/spgist_private.h"
#include "access/toast_compression.h"
@@ -334,6 +335,9 @@ initSpGistState(SpGistState *state, Relation index)
state->index = index;
+ /* we'll initialize the reference in spgbeginscan */
+ state->heap = NULL;
+
/* Get cached static information about index */
cache = spgGetCache(index);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
!INSTR_TIME_IS_ZERO(usage->blk_write_time));
bool has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+ bool has_prefetches = (usage->blks_prefetches > 0);
bool show_planning = (planning && (has_shared ||
has_local || has_temp || has_timing ||
has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
appendStringInfoChar(es->str, '\n');
}
+ /* As above, show only positive counter values. */
+ if (has_prefetches)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Prefetches:");
+
+ if (usage->blks_prefetches > 0)
+ appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+ if (usage->blks_prefetch_rounds > 0)
+ appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+
if (show_planning)
es->indent--;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* May have to restart scan from this point if a potential conflict is
* found.
+ *
+ * XXX Should this do index prefetch? Probably not worth it for unique
+ * constraints, I guess? Otherwise we should calculate prefetch_target
+ * just like in nodeIndexscan etc.
*/
retry:
conflict = false;
found_self = false;
- index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+ index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+ /* Start an index scan.
+ *
+ * XXX Should this do index prefetching? We're looking for a single tuple,
+ * probably using a PK / UNIQUE index, so does not seem worth it. If we
+ * reconsider this, calclate prefetch_target like in nodeIndexscan.
+ */
+ scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
retry:
found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
dst->local_blks_written += add->local_blks_written;
dst->temp_blks_read += add->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+ dst->blks_prefetches += add->blks_prefetches;
INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
add->blk_read_time, sub->blk_read_time);
INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We reach here if the index only scan is not parallel, or if we're
* serially executing an index only scan that was planned to be
* parallel.
+ *
+ * XXX Maybe we should enable prefetching, but prefetch only pages that
+ * are not all-visible (but checking that from the index code seems like
+ * a violation of layering etc).
+ *
+ * XXX This might lead to IOS being slower than plain index scan, if the
+ * table has a lot of pages that need recheck.
*/
scandesc = index_beginscan(node->ss.ss_currentRelation,
node->ioss_RelationDesc,
estate->es_snapshot,
node->ioss_NumScanKeys,
- node->ioss_NumOrderByKeys);
+ node->ioss_NumOrderByKeys,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc = scandesc;
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
/*
* When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ Relation heapRel = node->ss.ss_currentRelation;
/*
* extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
if (scandesc == NULL)
{
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
if (scandesc == NULL)
{
+ Relation heapRel = node->ss.ss_currentRelation;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Maybe reduce the value with parallel workers?
+ */
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
index_scan = index_beginscan(heapRel, indexRel,
&SnapshotNonVacuumable,
- 1, 0);
+ 1, 0, 0, 0); /* XXX maybe do prefetch? */
/* Set it up for index-only scan */
index_scan->xs_want_itup = true;
index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4476ff7fba1..80fec7a11f9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -160,7 +160,9 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
/* prepare for index scan */
typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
int nkeys,
- int norderbys);
+ int norderbys,
+ int prefetch_maximum,
+ int prefetch_reset);
/* (re)start index scan */
typedef void (*amrescan_function) (IndexScanDesc scan,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 97ddc925b27..f17dcdffd86 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -96,7 +96,7 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..6a500c5aa1f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,7 +152,9 @@ extern bool index_insert(Relation indexRelation,
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys);
+ int nkeys, int norderbys,
+ int prefetch_target,
+ int prefetch_reset);
extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
Snapshot snapshot,
int nkeys);
@@ -169,7 +171,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
extern void index_parallelrescan(IndexScanDesc scan);
extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
Relation indexrel, int nkeys, int norderbys,
- ParallelIndexScanDesc pscan);
+ ParallelIndexScanDesc pscan,
+ int prefetch_target,
+ int prefetch_reset);
extern ItemPointer index_getnext_tid(IndexScanDesc scan,
ScanDirection direction);
struct TupleTableSlot;
@@ -230,4 +234,45 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
ScanDirection direction);
extern void systable_endscan_ordered(SysScanDesc sysscan);
+
+
+void index_prefetch(IndexScanDesc scandesc, ScanDirection direction);
+
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int *start, int *end,
+ bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int index);
+
+typedef struct IndexPrefetchData
+{
+ /*
+ * XXX We need to disable this in some cases (e.g. when using index-only
+ * scans, we don't want to prefetch pages). Or maybe we should prefetch
+ * only pages that are not all-visible, that'd be even better.
+ */
+ int prefetchIndex; /* how far we already prefetched */
+ int prefetchTarget; /* how far we should be prefetching */
+ int prefetchMaxTarget; /* maximum prefetching distance */
+ int prefetchReset; /* reset to this distance on rescan */
+
+ /*
+ * a small LRU cache of recently prefetched blocks
+ *
+ * XXX needs to be tiny, to make the (frequent) searches very cheap
+ */
+ BlockNumber cacheBlocks[8];
+ int cacheIndex;
+
+ prefetcher_getblock_function get_block;
+ prefetcher_getrange_function get_range;
+
+} IndexPrefetchData;
+
#endif /* GENAM_H */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 6da64928b66..b4bd3b2e202 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -384,7 +384,7 @@ typedef struct GinScanOpaqueData
typedef GinScanOpaqueData *GinScanOpaque;
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void ginendscan(IndexScanDesc scan);
extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3edc740a3f3..e844a9eed84 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -176,6 +176,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ bool didReset; /* reset since last access? */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 65911245f74..adf167a60b6 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,7 @@
#include "access/amapi.h"
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
ScanKey orderbys, int norderbys);
extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9e035270a16..743192997c5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -124,6 +124,8 @@ typedef struct HashScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
+ bool didReset;
+
HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
} HashScanPosData;
@@ -370,7 +372,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
struct IndexInfo *indexInfo);
extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d6847860959..8d053de461b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -984,6 +984,9 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
+ /* Did the position reset/rebuilt since the last time we checked it? */
+ bool didReset;
+
BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */
} BTScanPosData;
@@ -1019,6 +1022,7 @@ typedef BTScanPosData *BTScanPos;
(scanpos).buf = InvalidBuffer; \
(scanpos).lsn = InvalidXLogRecPtr; \
(scanpos).nextTupleOffset = 0; \
+ (scanpos).didReset = true; \
} while (0)
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
@@ -1127,7 +1131,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
Relation rel;
} IndexFetchTableData;
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
/*
* We use the same IndexScanDescData structure for both amgettuple-based
* and amgetbitmap-based index scans. Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
bool *xs_orderbynulls;
bool xs_recheckorderby;
+ /* prefetching state (or NULL if disabled) */
+ IndexPrefetchData *xs_prefetch;
+
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
} IndexScanDescData;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index fe31d32dbe9..e1e2635597c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,7 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
struct IndexInfo *indexInfo);
/* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset);
extern void spgendscan(IndexScanDesc scan);
extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index c6ef46fc206..e00d4fc90b6 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -144,7 +144,7 @@ typedef struct SpGistTypeDesc
typedef struct SpGistState
{
Relation index; /* index we're working with */
-
+ Relation heap; /* heap the index is defined on */
spgConfigOut config; /* filled in by opclass config method */
SpGistTypeDesc attType; /* type of values to be indexed/restored */
@@ -231,6 +231,7 @@ typedef struct SpGistScanOpaqueData
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
* flags */
HeapTuple reconTups[MaxIndexTuplesPerPage]; /* reconstructed tuples */
+ bool didReset; /* */
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
int64 local_blks_written; /* # of local disk blocks written */
int64 temp_blks_read; /* # of temp blocks read */
int64 temp_blks_written; /* # of temp blocks written */
+ int64 blks_prefetch_rounds; /* # of prefetch rounds */
+ int64 blks_prefetches; /* # of buffers prefetched */
instr_time blk_read_time; /* time spent reading blocks */
instr_time blk_write_time; /* time spent writing blocks */
instr_time temp_blk_read_time; /* time spent reading temp blocks */
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.
Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).
ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.
Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).
This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.
Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.
As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.
That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.
Implementation
--------------When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.
So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).
Maybe you were right to do that, but I'm not entirely sure.
Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.
In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).
Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).
--
Peter Geoghegan
On 6/8/23 20:56, Peter Geoghegan wrote:
On Thu, Jun 8, 2023 at 8:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.I have an educated guess as to why prefetching was limited to bitmap
index scans this whole time: it might have been due to issues with
ScalarArrayOpExpr quals.Commit 9e8da0f757 taught nbtree to deal with ScalarArrayOpExpr quals
"natively". This meant that "indexedcol op ANY(ARRAY[...])" conditions
were supported by both index scans and index-only scans -- not just
bitmap scans, which could handle ScalarArrayOpExpr quals even without
nbtree directly understanding them. The commit was in late 2011,
shortly after the introduction of index-only scans -- which seems to
have been the real motivation. And so it seems to me that support for
ScalarArrayOpExpr was built with bitmap scans and index-only scans in
mind. Plain index scan ScalarArrayOpExpr quals do work, but support
for them seems kinda perfunctory to me (maybe you can think of a
specific counter-example where plain index scans really benefit from
ScalarArrayOpExpr, but that doesn't seem particularly relevant to the
original motivation).
I don't think SAOP is the reason. I did a bit of digging in the list
archives, and found thread [1]/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com, which says:
Regardless of what mechanism is used and who is responsible for
doing it someone is going to have to figure out which blocks are
specifically interesting to prefetch. Bitmap index scans happen
to be the easiest since we've already built up a list of blocks
we plan to read. Somehow that information has to be pushed to the
storage manager to be acted upon.
Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.
So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.
There's a couple more ~2008 threads mentioning prefetching, bitmap scans
and even regular index scans (like [2]/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com). None of them even mentions SAOP
stuff at all.
[1]: /messages/by-id/871wa17vxb.fsf@oxford.xeocode.com
/messages/by-id/871wa17vxb.fsf@oxford.xeocode.com
[2]: /messages/by-id/87wsnnz046.fsf@oxford.xeocode.com
/messages/by-id/87wsnnz046.fsf@oxford.xeocode.com
ScalarArrayOpExpr for plain index scans don't really make that much
sense right now because there is no heap prefetching in the index scan
case, which is almost certainly going to be the major bottleneck
there. At the same time, adding useful prefetching for
ScalarArrayOpExpr execution more or less requires that you first
improve how nbtree executes ScalarArrayOpExpr quals in general. Bear
in mind that ScalarArrayOpExpr execution (whether for bitmap index
scans or index scans) is related to skip scan/MDAM techniques -- so
there are tricky dependencies that need to be considered together.Right now, nbtree ScalarArrayOpExpr execution must call _bt_first() to
descend the B-Tree for each array constant -- even though in principle
we could avoid all that work in cases that happen to have locality. In
other words we'll often descend the tree multiple times and land on
exactly the same leaf page again and again, without ever noticing that
we could have gotten away with only descending the tree once (it'd
also be possible to start the next "descent" one level up, not at the
root, intelligently reusing some of the work from an initial descent
-- but you don't need anything so fancy to greatly improve matters
here).This lack of smarts around how many times we call _bt_first() to
descend the index is merely a silly annoyance when it happens in
btgetbitmap(). We do at least sort and deduplicate the array up-front
(inside _bt_sort_array_elements()), so there will be significant
locality of access each time we needlessly descend the tree.
Importantly, there is no prefetching "pipeline" to mess up in the
bitmap index scan case -- since that all happens later on. Not so for
the superficially similar (though actually rather different) plain
index scan case -- at least not once you add prefetching. If you're
uselessly processing the same leaf page multiple times, then there is
no way that heap prefetching can notice that it should be batching
things up. The context that would allow prefetching to work well isn't
really available right now. So the plain index scan case is kinda at a
gratuitous disadvantage (with prefetching) relative to the bitmap
index scan case.Queries with (say) quals with many constants appearing in an "IN()"
are both common and particularly likely to benefit from prefetching.
I'm not suggesting that you need to address this to get to a
committable patch. But you should definitely think about it now. I'm
strongly considering working on this problem for 17 anyway, so we may
end up collaborating on these aspects of prefetching. Smarter
ScalarArrayOpExpr execution for index scans is likely to be quite
compelling if it enables heap prefetching.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).
So if you're planning to work on this for PG17, collaborating on it
would be great.
For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.
But there's three shortcomings in logic:
1) It's not clear the thresholds for prefetching being beneficial and
switching to bitmap index scans are the same value. And as I'll
demonstrate later, the prefetching threshold is indeed much lower
(perhaps a couple dozen matching tuples) on large tables.As I mentioned during the pgCon unconference session, I really like
your framing of the problem; it makes a lot of sense to directly
compare an index scan's execution against a very similar bitmap index
scan execution -- there is an imaginary continuum between index scan
and bitmap index scan. If the details of when and how we scan the
index are rather similar in each case, then there is really no reason
why the performance shouldn't be fairly similar. I suspect that it
will be useful to ask the same question for various specific cases,
that you might not have thought about just yet. Things like
ScalarArrayOpExpr queries, where bitmap index scans might look like
they have a natural advantage due to an inherent need for random heap
access in the plain index scan case.
Yeah, although all the tests were done with a random table generated
like this:
insert into btree_test select $d * random(), md5(i::text)
from generate_series(1, $ROWS) s(i)
So it's damn random anyway. Although maybe it's random even for the
bitmap case, so maybe if the SAOP had some sort of locality, that'd be
an advantage for the bitmap scan. But how would such table look like?
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)
select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.
I do agree, but what do you mean by "assessing"? Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?
If we pick index scan and enable prefetching, causing a regression (e.g.
for the SAOP with locality), that'd be bad. But how is that related to
viability of index scans over bitmap index scans?
That's what I was mostly trying to get at when I talked about testing
strategy at the unconference session (this may have been unclear at
the time). It could be done in a way that helps you to think about the
problem from first principles. It could be really useful as a way of
avoiding confusing cases where plain index scan + prefetch does badly
due to implementation restrictions, versus cases where it's
*inherently* the wrong strategy. And a testing strategy that starts
with very basic ideas about what I/O is truly necessary might help you
to notice and fix regressions. The difference will never be perfectly
crisp, of course (isn't bitmap index scan basically just index scan
with a really huge prefetch buffer anyway?), but it still seems like a
useful direction to go in.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.
Implementation
--------------When I started looking at this, I only really thought about btree. If
you look at BTScanPosData, which is what the index scans use to
represent the current leaf page, you'll notice it has "items", which is
the array of item pointers (TIDs) that we'll fetch from the heap. Which
is exactly the thing we need.So I ended up moving most of the prefetching logic up into indexam.c,
see the index_prefetch() function. It can't be entirely separate,
because each AM represents the current state in a different way (e.g.
SpGistScanOpaque and BTScanOpaque are very different).Maybe you were right to do that, but I'm not entirely sure.
Bear in mind that the ScalarArrayOpExpr case already looks like a
single index scan whose qual involves an array to the executor, even
though nbtree more or less implements it as multiple index scans with
plain constant quals (one per unique-ified array element). Index scans
whose results can be "OR'd together". Is that a modularity violation?
And if so, why? As I've pointed out earlier in this email, we don't do
very much with that context right now -- but clearly we should.In other words, maybe you're right to suspect that doing this in AMs
like nbtree is a modularity violation. OTOH, maybe it'll turn out that
that's exactly the right place to do it, because that's the only way
to make the full context available in one place. I myself struggled
with this when I reviewed the skip scan patch. I was sure that Tom
wouldn't like the way that the skip-scan patch doubles-down on adding
more intelligence/planning around how to execute queries with
skippable leading columns. But, it turned out that he saw the merit in
it, and basically accepted that general approach. Maybe this will turn
out to be a little like that situation, where (counter to intuition)
what you really need to do is add a new "layering violation".
Sometimes that's the only thing that'll allow the information to flow
to the right place. It's tricky.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.I tend to agree that this sort of thing doesn't need to happen in the
first committed version. But FWIW nbtree could be taught to scan
multiple index pages and act as if it had just processed them as one
single index page -- up to a point. This is at least possible with
plain index scans that use MVCC snapshots (though not index-only
scans), since we already drop the pin on the leaf page there anyway.
AFAICT stops us from teaching nbtree to "lie" to the executor and tell
it that we processed 1 leaf page, even though it was actually 5 leaf pages
(maybe there would also have to be restrictions for the markpos stuff).
Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.
the results look a bit different:
rows bitmapscan master patched seqscan
1 52703.9 19.5 19.5 31145.6
10 51208.1 22.7 24.7 30983.5
100 49038.6 39.0 26.3 32085.3
1000 53760.4 193.9 48.4 31479.4
10000 56898.4 1600.7 187.5 32064.5
100000 50975.2 15978.7 1848.9 31587.1This is a good illustration of a query where bitmapscan is terrible
(much worse than seqscan, in fact), and the patch is a massive
improvement over master (about an order of magnitude).Of course, if you only scan a couple rows, the benefits are much more
modest (say 40% for 100 rows, which is still significant).Nice! And, it'll be nice to be able to use the kill_prior_tuple
optimization in many more cases (possible by teaching the optimizer to
favor index scans over bitmap index scans more often).
Right, I forgot to mention that benefit. Although, that'd only happen if
we actually choose index scans in more places, which I guess would
require tweaking the costing model ...
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.
What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).
I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.
So if you're planning to work on this for PG17, collaborating on it
would be great.For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.
Makes sense, but I hope that it won't come to that.
IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.
This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?
I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.
I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.
I'm talking about problems that exist today, without your patch.
I'll show a concrete example of the kind of index/index scan that
might be affected.
Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.
Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?
While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.
We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.
(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)
Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.I do agree, but what do you mean by "assessing"?
I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.
Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?
I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.
One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.
The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.
Definitely.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.
I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.
Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.
--
Peter Geoghegan
Attachments:
Hi,
On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
I'm really excited about this work.
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.
I think it'd be perfectly fair to just not tackle distance queries for now.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?
Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.
One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.
Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.
That should be easy to do, right?
Benchmark / TPC-H
-----------------I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%The percentage is (timing patched / master, so <100% means faster, >100%
means slower).The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).
Greetings,
Andres Freund
On Thu, Jun 8, 2023 at 4:38 PM Peter Geoghegan <pg@bowt.ie> wrote:
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.
I'll now give a simpler (though less realistic) example of a case
where "mini bitmap index scan" would be expected to help index scans
in general, and prefetching during index scans in particular.
Something very simple:
create table bitmap_parity_test(randkey int4, filler text);
create index on bitmap_parity_test (randkey);
insert into bitmap_parity_test select (random()*1000),
repeat('filler',10) from generate_series(1,250) i;
This gives me a table with 4 pages, and an index with 2 pages.
The following query selects about half of the rows from the table:
select * from bitmap_parity_test where randkey < 500;
If I force the query to use a bitmap index scan, I see that the total
number of buffers hit is exactly as expected (according to
EXPLAIN(ANALYZE,BUFFERS), that is): there are 5 buffers/pages hit. We
need to access every single heap page once, and we need to access the
only leaf page in the index once.
I'm sure that you know where I'm going with this already. I'll force
the same query to use a plain index scan, and get a very different
result. Now EXPLAIN(ANALYZE,BUFFERS) shows that there are a total of
89 buffers hit -- 88 of which must just be the same 5 heap pages,
again and again. That's just silly. It's probably not all that much
slower, but it's not helping things. And it's likely that this effect
interferes with the prefetching in your patch.
Obviously you can come up with a variant of this test case where
bitmap index scan does way fewer buffer accesses in a way that really
makes sense -- that's not in question. This is a fairly selective
index scan, since it only touches one index page -- and yet we still
see this difference.
(Anybody pedantic enough to want to dispute whether or not this index
scan counts as "selective" should run "insert into bitmap_parity_test
select i, repeat('actshually',10) from generate_series(2000,1e5) i"
before running the "randkey < 500" query, which will make the index
much larger without changing any of the details of how the query pins
pages -- non-pedants should just skip that step.)
--
Peter Geoghegan
On 6/9/23 02:06, Andres Freund wrote:
Hi,
On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.I'm really excited about this work.
1) pairing-heap in GiST / SP-GiST
For most AMs, the index state is pretty trivial - matching items from a
single leaf page. Prefetching that is pretty trivial, even if the
current API is a bit cumbersome.Distance queries on GiST and SP-GiST are a problem, though, because
those do not just read the pointers into a simple array, as the distance
ordering requires passing stuff through a pairing-heap :-(I don't know how to best deal with that, especially not in the simple
API. I don't think we can "scan forward" stuff from the pairing heap, so
the only idea I have is actually having two pairing-heaps. Or maybe
using the pairing heap for prefetching, but stashing the prefetched
pointers into an array and then returning stuff from it.In the patch I simply prefetch items before we add them to the pairing
heap, which is good enough for demonstrating the benefits.I think it'd be perfectly fair to just not tackle distance queries for now.
My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...
Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.
Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).
5) index-only scans
I'm not sure what to do about index-only scans. On the one hand, the
point of IOS is not to read stuff from the heap at all, so why prefetch
it. OTOH if there are many allvisible=false pages, we still have to
access that. And if that happens, this leads to the bizarre situation
that IOS is slower than regular index scan. But to address this, we'd
have to consider the visibility during prefetching.That should be easy to do, right?
It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either
Benchmark / TPC-H
-----------------I ran the 22 queries on 100GB data set, with parallel query either
disabled or enabled. And I measured timing (and speedup) for each query.
The speedup results look like this (see the attached PDF for details):query serial parallel
1 101% 99%
2 119% 100%
3 100% 99%
4 101% 100%
5 101% 100%
6 12% 99%
7 100% 100%
8 52% 67%
10 102% 101%
11 100% 72%
12 101% 100%
13 100% 101%
14 13% 100%
15 101% 100%
16 99% 99%
17 95% 101%
18 101% 106%
19 30% 40%
20 99% 100%
21 101% 100%
22 101% 107%The percentage is (timing patched / master, so <100% means faster, >100%
means slower).The different queries are affected depending on the query plan - many
queries are close to 100%, which means "no difference". For the serial
case, there are about 4 queries that improved a lot (6, 8, 14, 19),
while for the parallel case the benefits are somewhat less significant.My explanation is that either (a) parallel case used a different plan
with fewer index scans or (b) the parallel query does more concurrent
I/O simply by using parallel workers. Or maybe both.There are a couple regressions too, I believe those are due to doing too
much prefetching in some cases, and some of the heuristics mentioned
earlier should eliminate most of this, I think.I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?
I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:
for q in `seq 1 22`; do
1. drop caches and restart postgres
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.
I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).
That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 6/9/23 01:38, Peter Geoghegan wrote:
On Thu, Jun 8, 2023 at 3:17 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:Normal index scans are an even more interesting case but I'm not
sure how hard it would be to get that information. It may only be
convenient to get the blocks from the last leaf page we looked at,
for example.So this suggests we simply started prefetching for the case where the
information was readily available, and it'd be harder to do for index
scans so that's it.What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.
Did you mean parallel index scan or bitmap index scan?
But yeah, I get the point that SAOP queries are an interesting example
of queries to explore. I'll add some to the next round of tests.
Even if SAOP (probably) wasn't the reason, I think you're right it may
be an issue for prefetching, causing regressions. It didn't occur to me
before, because I'm not that familiar with the btree code and/or how it
deals with SAOP (and didn't really intend to study it too deeply).I'm pretty sure that you understand this already, but just in case:
ScalarArrayOpExpr doesn't even "get the blocks from the last leaf
page" in many important cases. Not really -- not in the sense that
you'd hope and expect. We're senselessly processing the same index
leaf page multiple times and treating it as a different, independent
leaf page. That makes heap prefetching of the kind you're working on
utterly hopeless, since it effectively throws away lots of useful
context. Obviously that's the fault of nbtree ScalarArrayOpExpr
handling, not the fault of your patch.
I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?
I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.
So if you're planning to work on this for PG17, collaborating on it
would be great.For now I plan to just ignore SAOP, or maybe just disabling prefetching
for SAOP index scans if it proves to be prone to regressions. That's not
great, but at least it won't make matters worse.Makes sense, but I hope that it won't come to that.
IMV it's actually quite reasonable that you didn't expect to have to
think about ScalarArrayOpExpr at all -- it would make a lot of sense
if that was already true. But the fact is that it works in a way
that's pretty silly and naive right now, which will impact
prefetching. I wasn't really thinking about regressions, though. I was
actually more concerned about missing opportunities to get the most
out of prefetching. ScalarArrayOpExpr really matters here.
OK
I guess something like this might be a "nice" bad case:
insert into btree_test mod(i,100000), md5(i::text)
from generate_series(1, $ROWS) s(i)select * from btree_test where a in (999, 1000, 1001, 1002)
The values are likely colocated on the same heap page, the bitmap scan
is going to do a single prefetch. With index scan we'll prefetch them
repeatedly. I'll give it a try.This is the sort of thing that I was thinking of. What are the
conditions under which bitmap index scan starts to make sense? Why is
the break-even point whatever it is in each case, roughly? And, is it
actually because of laws-of-physics level trade-off? Might it not be
due to implementation-level issues that are much less fundamental? In
other words, might it actually be that we're just doing something
stoopid in the case of plain index scans? Something that is just
papered-over by bitmap index scans right now?
Yeah, that's partially why I do this kind of testing on a wide range of
synthetic data sets - to find cases that behave in unexpected way (say,
seem like they should improve but don't).
I see that your patch has logic that avoids repeated prefetching of
the same block -- plus you have comments that wonder about going
further by adding a "small lru array" in your new index_prefetch()
function. I asked you about this during the unconference presentation.
But I think that my understanding of the situation was slightly
different to yours. That's relevant here.I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.
I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.
I mean, imagine we have TIDs
[T1, T2, T3, T4, T5, T6]
Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:
[T1, T5, T6, T2, T3, T4]
But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.
This is conceptually a "mini bitmap index scan", though one that takes
place "inside" a plain index scan, as it processes one particular leaf
page. That's the kind of design that "plain index scan vs bitmap index
scan as a continuum" leads me to (a little like the continuum between
nested loop joins, block nested loop joins, and merge joins). I bet it
would be practical to do things this way, and help a lot with some
kinds of queries. It might even be simpler than avoiding excessive
prefetching using an LRU cache thing.I'm talking about problems that exist today, without your patch.
I'll show a concrete example of the kind of index/index scan that
might be affected.Attached is an extract of the server log when the regression tests ran
against a server patched to show custom instrumentation. The log
output shows exactly what's going on with one particular nbtree
opportunistic deletion (my point has nothing to do with deletion, but
it happens to be convenient to make my point in this fashion). This
specific example involves deletion of tuples from the system catalog
index "pg_type_typname_nsp_index". There is nothing very atypical
about it; it just shows a certain kind of heap fragmentation that's
probably very common.Imagine a plain index scan involving a query along the lines of
"select * from pg_type where typname like 'part%' ", or similar. This
query runs an instant before the example LD_DEAD-bit-driven
opportunistic deletion (a "simple deletion" in nbtree parlance) took
place. You'll be able to piece together from the log output that there
would only be about 4 heap blocks involved with such a query. Ideally,
our hypothetical index scan would pin each buffer/heap page exactly
once, for a total of 4 PinBuffer()/UnpinBuffer() calls. After all,
we're talking about a fairly selective query here, that only needs to
scan precisely one leaf page (I verified this part too) -- so why
wouldn't we expect "index scan parity"?While there is significant clustering on this example leaf page/key
space, heap TID is not *perfectly* correlated with the
logical/keyspace order of the index -- which can have outsized
consequences. Notice that some heap blocks are non-contiguous
relative to logical/keyspace/index scan/index page offset number order.We'll end up pinning each of the 4 or so heap pages more than once
(sometimes several times each), when in principle we could have pinned
each heap page exactly once. In other words, there is way too much of
a difference between the case where the tuples we scan are *almost*
perfectly clustered (which is what you see in my example) and the case
where they're exactly perfectly clustered. In other other words, there
is way too much of a difference between plain index scan, and bitmap
index scan.(What I'm saying here is only true because this is a composite index
and our query uses "like", returning rows matches a prefix -- if our
index was on the column "typname" alone and we used a simple equality
condition in our query then the Postgres 12 nbtree work would be
enough to avoid the extra PinBuffer()/UnpinBuffer() calls. I suspect
that there are still relatively many important cases where we perform
extra PinBuffer()/UnpinBuffer() calls during plain index scans that
only touch one leaf page anyway.)Obviously we should expect bitmap index scans to have a natural
advantage over plain index scans whenever there is little or no
correlation -- that's clear. But that's not what we see here -- we're
way too sensitive to minor imperfections in clustering that are
naturally present on some kinds of leaf pages. The potential
difference in pin/unpin traffic (relative to the bitmap index scan
case) seems pathological to me. Ideally, we wouldn't have these kinds
of differences at all. It's going to disrupt usage_count on the
buffers.
I'm not sure I understand all the nuance here, but the thing I take away
is to add tests with different levels of correlation, and probably also
some multi-column indexes.
It's important to carefully distinguish between cases where plain
index scans really are at an inherent disadvantage relative to bitmap
index scans (because there really is no getting around the need to
access the same heap page many times with an index scan) versus cases
that merely *appear* that way. Implementation restrictions that only
really affect the plain index scan case (e.g., the lack of a
reasonably sized prefetch buffer, or the ScalarArrayOpExpr thing)
should be accounted for when assessing the viability of index scan +
prefetch over bitmap index scan + prefetch. This is very subtle, but
important.I do agree, but what do you mean by "assessing"?
I mean performance validation. There ought to be a theoretical model
that describes the relationship between index scan and bitmap index
scan, that has actual predictive power in the real world, across a
variety of different cases. Something that isn't sensitive to the
current phase of the moon (e.g., heap fragmentation along the lines of
my pg_type_typname_nsp_index log output). I particularly want to avoid
nasty discontinuities that really make no sense.Wasn't the agreement at
the unconference session was we'd not tweak costing? So ultimately, this
does not really affect which scan type we pick. We'll keep doing the
same planning decisions as today, no?I'm not really talking about tweaking the costing. What I'm saying is
that we really should expect index scans to behave similarly to bitmap
index scans at runtime, for queries that really don't have much to
gain from using a bitmap heap scan (queries that may or may not also
benefit from prefetching). There are several reasons why this makes
sense to me.One reason is that it makes tweaking the actual costing easier later
on. Also, your point about plan robustness was a good one. If we make
the wrong choice about index scan vs bitmap index scan, and the
consequences aren't so bad, that's a very useful enhancement in
itself.The most important reason of all may just be to build confidence in
the design. I'm interested in understanding when and how prefetching
stops helping.
Agreed.
I'm all for building a more comprehensive set of test cases - the stuff
presented at pgcon was good for demonstration, but it certainly is not
enough for testing. The SAOP queries are a great addition, I also plan
to run those queries on different (less random) data sets, etc. We'll
probably discover more interesting cases as the patch improves.Definitely.
There are two aspects why I think AM is not the right place:
- accessing table from index code seems backwards
- we already do prefetching from the executor (nodeBitmapHeapscan.c)
It feels kinda wrong in hindsight.
I'm willing to accept that we should do it the way you've done it in
the patch provisionally. It's complicated enough that it feels like I
should reserve the right to change my mind.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Yeah, I'm not saying it's impossible, and imagined we might teach nbtree
to do that. But it seems like work for future someone.Right. You probably noticed that this is another case where we'd be
making index scans behave more like bitmap index scans (perhaps even
including the downsides for kill_prior_tuple that accompany not
processing each leaf page inline). There is probably a point where
that ceases to be sensible, but I don't know what that point is.
They're way more similar than we seem to imagine.
OK. Thanks for all the comments.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jun 9, 2023 at 3:45 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
What the exact historical timeline is may not be that important. My
emphasis on ScalarArrayOpExpr is partly due to it being a particularly
compelling case for both parallel index scan and prefetching, in
general. There are many queries that have huge in() lists that
naturally benefit a great deal from prefetching. Plus they're common.Did you mean parallel index scan or bitmap index scan?
I meant parallel index scan (also parallel bitmap index scan). Note
that nbtree parallel index scans have special ScalarArrayOpExpr
handling code.
ScalarArrayOpExpr is kind of special -- it is simultaneously one big
index scan (to the executor), and lots of small index scans (to
nbtree). Unlike the queries that you've looked at so far, which really
only have one plausible behavior at execution time, there are many
ways that ScalarArrayOpExpr index scans can be executed at runtime --
some much faster than others. The nbtree implementation can in
principle reorder how it processes ranges from the key space (i.e.
each range of array elements) with significant flexibility.
I think I understand, although maybe my mental model is wrong. I agree
it seems inefficient, but I'm not sure why would it make prefetching
hopeless. Sure, it puts index scans at a disadvantage (compared to
bitmap scans), but it we pick index scan it should still be an
improvement, right?
Hopeless might have been too strong of a word. More like it'd fall far
short of what is possible to do with a ScalarArrayOpExpr with a given
high end server.
The quality of the implementation (including prefetching) could make a
huge difference to how well we make use of the available hardware
resources. A really high quality implementation of ScalarArrayOpExpr +
prefetching can keep the system busy with useful work, which is less
true with other types of queries, which have inherently less
predictable I/O (and often have less I/O overall). What could be more
amenable to predicting I/O patterns than a query with a large IN()
list, with many constants that can be processed in whatever order
makes sense at runtime?
What I'd like to do with ScalarArrayOpExpr is to teach nbtree to
coalesce together those "small index scans" into "medium index scans"
dynamically, where that makes sense. That's the main part that's
missing right now. Dynamic behavior matters a lot with
ScalarArrayOpExpr stuff -- that's where the challenge lies, but also
where the opportunities are. Prefetching builds on all that.
I guess I need to do some testing on a range of data sets / queries, and
see how it works in practice.
If I can figure out a way of getting ScalarArrayOpExpr to visit each
leaf page exactly once, that might be enough to make things work
really well most of the time. Maybe it won't even be necessary to
coordinate very much, in the end. Unsure.
I've already done a lot of work that tries to minimize the chances of
regular (non-ScalarArrayOpExpr) queries accessing more than a single
leaf page, which will help your strategy of just prefetching items
from a single leaf page at a time -- that will get you pretty far
already. Consider the example of the tenk2_hundred index from the
bt_page_items documentation. You'll notice that the high key for the
page shown in the docs (and every other page in the same index) nicely
makes the leaf page boundaries "aligned" with natural keyspace
boundaries, due to suffix truncation. That helps index scans to access
no more than a single leaf page when accessing any one distinct
"hundred" value.
We are careful to do the right thing with the "boundary cases" when we
descend the tree, too. This _bt_search behavior builds on the way that
suffix truncation influences the on-disk structure of indexes. Queries
such as "select * from tenk2 where hundred = ?" will each return 100
rows spread across almost as many heap pages. That's a fairly large
number of rows/heap pages, but we still only need to access one leaf
page for every possible constant value (every "hundred" value that
might be specified as the ? in my point query example). It doesn't
matter if it's the leftmost or rightmost item on a leaf page -- we
always descend to exactly the correct leaf page directly, and we
always terminate the scan without having to move to the right sibling
page (we check the high key before going to the right page in some
cases, per the optimization added by commit 29b64d1d).
The same kind of behavior is also seen with the TPC-C line items
primary key index, which is a composite index. We want to access the
items from a whole order in one go, from one leaf page -- and we
reliably do the right thing there too (though with some caveats about
CREATE INDEX). We should never have to access more than one leaf page
to read a single order's line items. This matters because it's quite
natural to want to access whole orders with that particular
table/workload (it's also unnatural to only access one single item
from any given order).
Obviously there are many queries that need to access two or more leaf
pages, because that's just what needs to happen. My point is that we
*should* only do that when it's truly necessary on modern Postgres
versions, since the boundaries between pages are "aligned" with the
"natural boundaries" from the keyspace/application. Maybe your testing
should verify that this effect is actually present, though. It would
be a shame if we sometimes messed up prefetching that could have
worked well due to some issue with how page splits divide up items.
CREATE INDEX is much less smart about suffix truncation -- it isn't
capable of the same kind of tricks as nbtsplitloc.c, even though it
could be taught to do roughly the same thing. Hopefully this won't be
an issue for your work. The tenk2 case still works as expected with
CREATE INDEX/REINDEX, due to help from deduplication. Indexes like the
TPC-C line items PK will leave the index with some "orders" (or
whatever the natural grouping of things is) that span more than a
single leaf page, which is undesirable, and might hinder your
prefetching work. I wouldn't mind fixing that if it turned out to hurt
your leaf-page-at-a-time prefetching patch. Something to consider.
We can fit at most 17 TPC-C orders on each order line PK leaf page.
Could be as few as 15. If we do the wrong thing with prefetching for 2
out of every 15 orders then that's a real problem, but is still subtle enough
to easily miss with conventional benchmarking. I've had a lot of success
with paying close attention to all the little boundary cases, which is why
I'm kind of zealous about it now.
I wonder if you should go further than this, by actually sorting the
items that you need to fetch as part of processing a given leaf page
(I said this at the unconference, you may recall). Why should we
*ever* pin/access the same heap page more than once per leaf page
processed per index scan? Nothing stops us from returning the tuples
to the executor in the original logical/index-wise order, despite
having actually accessed each leaf page's pointed-to heap pages
slightly out of order (with the aim of avoiding extra pin/unpin
traffic that isn't truly necessary). We can sort the heap TIDs in
scratch memory, then do our actual prefetching + heap access, and then
restore the original order before returning anything.I think that's possible, and I thought about that a bit (not just for
btree, but especially for the distance queries on GiST). But I don't
have a good idea if this would be 1% or 50% improvement, and I was
concerned it might easily lead to regressions if we don't actually need
all the tuples.
I get that it could be invasive. I have the sense that just pinning
the same heap page more than once in very close succession is just the
wrong thing to do, with or without prefetching.
I mean, imagine we have TIDs
[T1, T2, T3, T4, T5, T6]
Maybe T1, T5, T6 are from the same page, so per your proposal we might
reorder and prefetch them in this order:[T1, T5, T6, T2, T3, T4]
But maybe we only need [T1, T2] because of a LIMIT, and the extra work
we did on processing T5, T6 is wasted.
Yeah, that's possible. But isn't that par for the course? Any
optimization that involves speculation (including all prefetching)
comes with similar risks. They can be managed.
I don't think that we'd literally order by TID...we wouldn't change
the order that each heap page was *initially* pinned. We'd just
reorder the tuples minimally using an approach that is sufficient to
avoid repeated pinning of heap pages during processing of any one leaf
page's heap TIDs. ISTM that the risk of wasting work is limited to
wasting cycles on processing extra tuples from a heap page that we
definitely had to process at least one tuple from already. That
doesn't seem particularly risky, as speculative optimizations go. The
downside is bounded and well understood, while the upside could be
significant.
I really don't have that much confidence in any of this just yet. I'm
not trying to make this project more difficult. I just can't help but
notice that the order that index scans end up pinning heap pages
already has significant problems, and is sensitive to things like
small amounts of heap fragmentation -- maybe that's not a great basis
for prefetching. I *really* hate any kind of sharp discontinuity,
where a minor change in an input (e.g., from minor amounts of heap
fragmentation) has outsized impact on an output (e.g., buffers
pinned). Interactions like that tend to be really pernicious -- they
lead to bad performance that goes unnoticed and unfixed because the
problem effectively camouflages itself. It may even be easier to make
the conservative (perhaps paranoid) assumption that weird nasty
interactions will cause harm somewhere down the line...why take a
chance?
I might end up prototyping this myself. I may have to put my money
where my mouth is. :-)
--
Peter Geoghegan
On Thu, Jun 8, 2023 at 11:40 AM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans
At the point Greg Stark was hacking on this, the underlying OS async I/O
features were tricky to fix into PG's I/O model, and both of us did much
review work just to find working common ground that PG could plug into.
Linux POSIX advisories were completely different from Solaris's async
model, the other OS used for validation that the feature worked, with the
hope being that designing against two APIs would be better than just
focusing on Linux. Since that foundation was all so brittle and limited,
scope was limited to just the heap scan, since it seemed to have the best
return on time invested given the parts of async I/O that did and didn't
scale as expected.
As I remember it, the idea was to get the basic feature out the door and
gather feedback about things like whether the effective_io_concurrency knob
worked as expected before moving onto other prefetching. Then that got
lost in filesystem upheaval land, with so much drama around Solaris/ZFS and
Oracle's btrfs work. I think it's just that no one ever got back to it.
I have all the workloads that I use for testing automated into
pgbench-tools now, and this change would be easy to fit into testing on
them as I'm very heavy on block I/O tests. To get PG to reach full read
speed on newer storage I've had to do some strange tests, like doing index
range scans that touch 25+ pages. Here's that one as a pgbench script:
\set range 67 * (:multiplier + 1)
\set limit 100000 * :scale
\set limit :limit - :range
\set aid random(1, :limit)
SELECT aid,abalance FROM pgbench_accounts WHERE aid >= :aid ORDER BY aid
LIMIT :range;
And then you use '-Dmultiplier=10' or such to crank it up. Database 4X
RAM, multiplier=25 with 16 clients is my starting point on it when I want
to saturate storage. Anything that lets me bring those numbers down would
be valuable.
--
Greg Smith greg.smith@crunchydata.com
Director of Open Source Strategy
Hi,
On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?
Yes, I meant that.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).
I'll try to have a look.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:for q in `seq 1 22`; do
1. drop caches and restart postgres
Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.
Ah, ok.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.
Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).
Greetings,
Andres Freund
On 6/10/23 22:34, Andres Freund wrote:
Hi,
On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
2) prefetching from executor
Another question is whether the prefetching shouldn't actually happen
even higher - in the executor. That's what Andres suggested during the
unconference, and it kinda makes sense. That's where we do prefetching
for bitmap heap scans, so why should this happen lower, right?Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?Yes, I meant that.
Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.
4) per-leaf prefetching
The code is restricted only prefetches items from one leaf page. If the
index scan needs to scan multiple (many) leaf pages, we have to process
the first leaf page first before reading / prefetching the next one.I think this is acceptable limitation, certainly for v0. Prefetching
across multiple leaf pages seems way more complex (particularly for the
cases using pairing heap), so let's leave this for the future.Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).I'll try to have a look.
I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:for q in `seq 1 22`; do
1. drop caches and restart postgres
Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.
Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.
2. run query $q -> uncached
3. run query $q -> cached
done
So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.Ah, ok.
I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).
OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, 2023-06-08 at 17:40 +0200, Tomas Vondra wrote:
Hi,
At pgcon unconference I presented a PoC patch adding prefetching for
indexes, along with some benchmark results demonstrating the (pretty
significant) benefits etc. The feedback was quite positive, so let me
share the current patch more widely.
I added entry to
https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Unconference
based on notes I took during that session.
Hope it helps.
--
Tomasz Rybak, Debian Developer <serpent@debian.org>
GPG: A565 CE64 F866 A258 4DDC F9C7 ECB7 3E37 E887 AA8C
On Thu, Jun 8, 2023 at 9:10 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
We already do prefetching for bitmap index scans, where the bitmap heap
scan prefetches future pages based on effective_io_concurrency. I'm not
sure why exactly was prefetching implemented only for bitmap scans, but
I suspect the reasoning was that it only helps when there's many
matching tuples, and that's what bitmap index scans are for. So it was
not worth the implementation effort.
One of the reasons IMHO is that in the bitmap scan before starting the
heap fetch TIDs are already sorted in heap block order. So it is
quite obvious that once we prefetch a heap block most of the
subsequent TIDs will fall on that block i.e. each prefetch will
satisfy many immediate requests. OTOH, in the index scan the I/O
request is very random so we might have to prefetch many blocks even
for satisfying the request for TIDs falling on one index page. I
agree with prefetching with an index scan will definitely help in
reducing the random I/O, but this is my guess that thinking of
prefetching with a Bitmap scan appears more natural and that would
have been one of the reasons for implementing this only for a bitmap
scan.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Hi,
I have results from the new extended round of prefetch tests. I've
pushed everything to
https://github.com/tvondra/index-prefetch-tests-2
There are scripts I used to run this (run-*.sh), raw results and various
kinds of processed summaries (pdf, ods, ...) that I'll mention later.
As before, this tests a number of query types:
- point queries with btree and hash (equality)
- ORDER BY queries with btree (inequality + order by)
- SAOP queries with btree (column IN (values))
It's probably futile to go through details of all the tests - it's
easier to go through the (hopefully fairly readable) shell scripts.
But in principle, runs some simple queries while varying both the data
set and workload:
- data set may be random, sequential or cyclic (with different length)
- the number of matches per value differs (i.e. equality condition may
match 1, 10, 100, ..., 100k rows)
- forces a particular scan type (indexscan, bitmapscan, seqscan)
- each query is executed twice - first run (right after restarting DB
and dropping caches) is uncached, second run should have data cached
- the query is executed 5x with different parameters (so 10x in total)
This is tested with three basic data sizes - fits into shared buffers,
fits into RAM and exceeds RAM. The sizes are roughly 350MB, 3.5GB and
20GB (i5) / 40GB (xeon).
Note: xeon has 64GB RAM, so technically the largest scale fits into RAM.
But should not matter, thanks to drop-caches and restart.
I also attempted to pin the backend to a particular core, in effort to
eliminate scheduling-related noise. It's mostly what taskset does, but I
did that from extension (https://github.com/tvondra/taskset) which
allows me to do that as part of the SQL script.
For the results, I'll talk about the v1 patch (as submitted here) fist.
I'll use the PDF results in the "pdf" directory which generally show a
pivot table by different test parameters, comparing the results by
different parameters (prefetching on/off, master/patched).
Feel free to do your own analysis from the raw CSV data, ofc.
For example, this:
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-point-queries-builds.pdf
shows how the prefetching affects timing for point queries with
different numbers of matches (1 to 100k). The numbers are timings for
master and patched build. The last group is (patched/master), so the
lower the number the better - 50% means patch makes the query 2x faster.
There's also a heatmap, with green=good, red=bad, which makes it easier
to cases that got slower/faster.
The really interesting stuff starts on page 7 (in this PDF), because the
first couple pages are "cached" (so it's more about measuring overhead
when prefetching has no benefit).
Right on page 7 you can see a couple cases with a mix of slower/faster
cases, roughtly in the +/- 30% range. However, this is unrelated from
the patch because those are results for bitmapheapscan.
For indexscans (page 8), the results are invariably improved - the more
matches the better (up to ~10x faster for 100k matches).
Those were results for the "cyclic" data set. For random data set (pages
9-11) the results are pretty similar, but for "sequential" data (11-13)
the prefetching is actually harmful - there are red clusters, with up to
500% slowdowns.
I'm not going to explain the summary for SAOP queries
(https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/patch-v1-saop-queries-builds.pdf),
the story is roughly the same, except that there are more tested query
combinations (because we also vary the pattern in the IN() list - number
of values etc.).
So, the conclusion from this is - generally very good results for random
and cyclic data sets, but pretty bad results for sequential. But even
for the random/cyclic cases there are combinations (especially with many
matches) where prefetching doesn't help or even hurts.
The only way to deal with this is (I think) a cheap way to identify and
skip inefficient prefetches, essentially by doing two things:
a) remembering more recently prefetched blocks (say, 1000+) and not
prefetching them over and over
b) ability to identify sequential pattern, when readahead seems to do
pretty good job already (although I heard some disagreement)
I've been thinking about how to do this - doing (a) seem pretty hard,
because on the one hand we want to remember a fair number of blocks and
we want the check "did we prefetch X" to be very cheap. So a hash table
seems nice. OTOH we want to expire "old" blocks and only keep the most
recent ones, and hash table doesn't really support that.
Perhaps there is a great data structure for this, not sure. But after
thinking about this I realized we don't need a perfect accuracy - it's
fine to have false positives/negatives - it's fine to forget we already
prefetched block X and prefetch it again, or prefetch it again. It's not
a matter of correctness, just a matter of efficiency - after all, we
can't know if it's still in memory, we only know if we prefetched it
fairly recently.
This led me to a "hash table of LRU caches" thing. Imagine a tiny LRU
cache that's small enough to be searched linearly (say, 8 blocks). And
we have many of them (e.g. 128), so that in total we can remember 1024
block numbers. Now, every block number is mapped to a single LRU by
hashing, as if we had a hash table
index = hash(blockno) % 128
and we only use tha one LRU to track this block. It's tiny so we can
search it linearly.
To expire prefetched blocks, there's a counter incremented every time we
prefetch a block, and we store it in the LRU with the block number. When
checking the LRU we ignore old entries (with counter more than 1000
values back), and we also evict/replace the oldest entry if needed.
This seems to work pretty well for the first requirement, but it doesn't
allow identifying the sequential pattern cheaply. To do that, I added a
tiny queue with a couple entries that can checked it the last couple
entries are sequential.
And this is what the attached 0002+0003 patches do. There are PDF with
results for this build prefixed with "patch-v3" and the results are
pretty good - the regressions are largely gone.
It's even cleared in the PDFs comparing the impact of the two patches:
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-point.pdf
https://github.com/tvondra/index-prefetch-tests-2/blob/master/pdf/comparison-saop.pdf
Which simply shows the "speedup heatmap" for the two patches, and the
"v3" heatmap has much less red regression clusters.
Note: The comparison-point.pdf summary has another group of columns
illustrating if this scan type would be actually used, with "green"
meaning "yes". This provides additional context, because e.g. for the
"noisy bitmapscans" it's all white, i.e. without setting the GUcs the
optimizer would pick something else (hence it's a non-issue).
Let me know if the results are not clear enough (I tried to cover the
important stuff, but I'm sure there's a lot of details I didn't cover),
or if you think some other summary would be better.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
0003-ignore-seq-patterns-add-stats-v3.patchtext/x-patch; charset=UTF-8; name=0003-ignore-seq-patterns-add-stats-v3.patchDownload
From fc869af55678eda29045190f735da98c4b6808d9 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Thu, 15 Jun 2023 14:49:56 +0200
Subject: [PATCH 2/2] ignore seq patterns, add stats
---
src/backend/access/index/indexam.c | 80 ++++++++++++++++++++++++++++++
src/include/access/genam.h | 16 ++++++
2 files changed, 96 insertions(+)
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 557267aced9..6ab977ca284 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -378,6 +378,16 @@ index_endscan(IndexScanDesc scan)
if (scan->xs_temp_snap)
UnregisterSnapshot(scan->xs_snapshot);
+ /* If prefetching enabled, log prefetch stats. */
+ if (scan->xs_prefetch)
+ {
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
+ elog(LOG, "index prefetch stats: requests %lu prefetches %lu (%f)",
+ prefetch->prefetchAll, prefetch->prefetchCount,
+ prefetch->prefetchCount * 100.0 / prefetch->prefetchAll);
+ }
+
/* Release the scan data structure itself */
IndexScanEnd(scan);
}
@@ -1028,6 +1038,57 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+ bool is_sequential = true;
+ int idx;
+
+ /* no requests */
+ if (prefetch->queueIndex == 0)
+ {
+ idx = (prefetch->queueIndex++) % PREFETCH_QUEUE_SIZE;
+ prefetch->queueItems[idx] = block;
+ return false;
+ }
+
+ /* same as immediately preceding block? */
+ idx = (prefetch->queueIndex - 1) % PREFETCH_QUEUE_SIZE;
+ if (prefetch->queueItems[idx] == block)
+ return true;
+
+ idx = (prefetch->queueIndex++) % PREFETCH_QUEUE_SIZE;
+ prefetch->queueItems[idx] = block;
+
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+ {
+ /* not enough requests */
+ if (prefetch->queueIndex < i)
+ {
+ is_sequential = false;
+ break;
+ }
+
+ /*
+ * -1, because we've already advanced the index, so it points to
+ * the next slot at this point
+ */
+ idx = (prefetch->queueIndex - i - 1) % PREFETCH_QUEUE_SIZE;
+
+ if ((block - i) != prefetch->queueItems[idx])
+ {
+ is_sequential = false;
+ break;
+ }
+ }
+
+ return is_sequential;
+}
+
/*
* index_prefetch_add_cache
* Add a block to the cache, return true if it was recently prefetched.
@@ -1081,6 +1142,19 @@ index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
uint64 oldestRequest = PG_UINT64_MAX;
int oldestIndex = -1;
+ /*
+ * First add the block to the (tiny) top-level LRU cache and see if it's
+ * part of a sequential pattern. In this case we just ignore the block
+ * and don't prefetch it - we expect read-ahead to do a better job.
+ *
+ * XXX Maybe we should still add the block to the later cache, in case
+ * we happen to access it later? That might help if we first scan a lot
+ * of the table sequentially, and then randomly. Not sure that's very
+ * likely with index access, though.
+ */
+ if (index_prefetch_is_sequential(prefetch, block))
+ return true;
+
/* see if we already have prefetched this block (linear search of LRU) */
for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
{
@@ -1206,6 +1280,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
if (prefetch->prefetchTarget <= 0)
return;
+ prefetch->prefetchAll++;
+
/*
* XXX I think we don't need to worry about direction here, that's handled
* by how the AMs build the curPos etc. (see nbtsearch.c)
@@ -1256,6 +1332,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
if (index_prefetch_add_cache(prefetch, block))
continue;
+ prefetch->prefetchCount++;
+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
pgBufferUsage.blks_prefetches++;
}
@@ -1300,6 +1378,8 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
if (index_prefetch_add_cache(prefetch, block))
continue;
+ prefetch->prefetchCount++;
+
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
pgBufferUsage.blks_prefetches++;
}
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index c01c37951ca..526f280a44d 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -276,6 +276,12 @@ typedef struct PrefetchCacheEntry {
#define PREFETCH_LRU_COUNT 128
#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define PREFETCH_QUEUE_SIZE 8
+#define PREFETCH_SEQ_PATTERN_BLOCKS 4
+
typedef struct IndexPrefetchData
{
/*
@@ -291,6 +297,16 @@ typedef struct IndexPrefetchData
prefetcher_getblock_function get_block;
prefetcher_getrange_function get_range;
+ uint64 prefetchAll;
+ uint64 prefetchCount;
+
+ /*
+ * Tiny queue of most recently prefetched blocks, used first for cheap
+ * checks and also to identify (and ignore) sequential prefetches.
+ */
+ uint64 queueIndex;
+ BlockNumber queueItems[PREFETCH_QUEUE_SIZE];
+
/*
* Cache of recently prefetched blocks, organized as a hash table of
* small LRU caches.
--
2.40.1
0002-more-elaborate-prefetch-cache-v3.patchtext/x-patch; charset=UTF-8; name=0002-more-elaborate-prefetch-cache-v3.patchDownload
From 2fdfbcabb262e2fea38f40465f60441c5f255096 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas.vondra@postgresql.org>
Date: Wed, 14 Jun 2023 15:08:55 +0200
Subject: [PATCH 1/2] more elaborate prefetch cache
---
src/backend/access/gist/gistscan.c | 3 -
src/backend/access/hash/hash.c | 3 -
src/backend/access/index/indexam.c | 156 +++++++++++++++++++---------
src/backend/access/nbtree/nbtree.c | 3 -
src/backend/access/spgist/spgscan.c | 3 -
src/backend/replication/walsender.c | 2 +
src/include/access/genam.h | 41 ++++++--
7 files changed, 141 insertions(+), 70 deletions(-)
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index fdf978eaaad..eaa89ea6c97 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -128,9 +128,6 @@ gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int pr
prefetcher->prefetchMaxTarget = prefetch_maximum;
prefetcher->prefetchReset = prefetch_reset;
- prefetcher->cacheIndex = 0;
- memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
/* callbacks */
prefetcher->get_block = gist_prefetch_getblock;
prefetcher->get_range = gist_prefetch_getrange;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 01a25132bce..6546d457899 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -401,9 +401,6 @@ hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int
prefetcher->prefetchMaxTarget = prefetch_maximum;
prefetcher->prefetchReset = prefetch_reset;
- prefetcher->cacheIndex = 0;
- memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
/* callbacks */
prefetcher->get_block = _hash_prefetch_getblock;
prefetcher->get_range = _hash_prefetch_getrange;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aa8a14624d8..557267aced9 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,6 +54,7 @@
#include "catalog/pg_amproc.h"
#include "catalog/pg_type.h"
#include "commands/defrem.h"
+#include "common/hashfn.h"
#include "nodes/makefuncs.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
@@ -1027,7 +1028,110 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+/*
+ * index_prefetch_add_cache
+ * Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+ PrefetchCacheEntry *entry;
+
+ /* calculate which LRU to use */
+ int lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+ /* entry to (maybe) use for this block request */
+ uint64 oldestRequest = PG_UINT64_MAX;
+ int oldestIndex = -1;
+
+ /* see if we already have prefetched this block (linear search of LRU) */
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /* Request numbers are positive, so 0 means "unused". */
+ if (entry->request == 0)
+ continue;
+
+ /* Is this entry for the same block as the current request? */
+ if (entry->block == block)
+ {
+ bool prefetched;
+
+ /*
+ * Is the old request sufficiently recent? If yes, we treat the
+ * block as already prefetched.
+ *
+ * XXX We do add the cache size to the request in order not to
+ * have issues with uint64 underflows.
+ */
+ prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+ /* Update the request number. */
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ return prefetched;
+ }
+ }
+
+ /*
+ * We didn't find the block in the LRU, so store it either in an empty
+ * entry, or in the "oldest" prefetch request in this LRU.
+ */
+ Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+ entry->block = block;
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ /* not in the prefetch cache */
+ return false;
+}
/*
* Do prefetching, and gradually increase the prefetch distance.
@@ -1138,7 +1242,6 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
for (int i = startIndex; i <= endIndex; i++)
{
- bool recently_prefetched = false;
BlockNumber block;
block = prefetch->get_block(scan, dir, i);
@@ -1149,35 +1252,12 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
* This happens e.g. for clustered or naturally correlated indexes
* (fkey to a sequence ID). It's not expensive (the block is in page
* cache already, so no I/O), but it's not free either.
- *
- * XXX We can't just check blocks between startIndex and endIndex,
- * because at some point (after the pefetch target gets ramped up)
- * it's going to be just a single block.
- *
- * XXX The solution here is pretty trivial - we just check the
- * immediately preceding block. We could check a longer history, or
- * maybe maintain some "already prefetched" struct (small LRU array
- * of last prefetched blocks - say 8 blocks or so - would work fine,
- * I think).
*/
- for (int j = 0; j < 8; j++)
- {
- /* the cached block might be InvalidBlockNumber, but that's fine */
- if (prefetch->cacheBlocks[j] == block)
- {
- recently_prefetched = true;
- break;
- }
- }
-
- if (recently_prefetched)
+ if (index_prefetch_add_cache(prefetch, block))
continue;
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
pgBufferUsage.blks_prefetches++;
-
- prefetch->cacheBlocks[prefetch->cacheIndex] = block;
- prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
}
prefetch->prefetchIndex = endIndex;
@@ -1206,7 +1286,6 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
for (int i = endIndex; i >= startIndex; i--)
{
- bool recently_prefetched = false;
BlockNumber block;
block = prefetch->get_block(scan, dir, i);
@@ -1217,35 +1296,12 @@ index_prefetch(IndexScanDesc scan, ScanDirection dir)
* This happens e.g. for clustered or naturally correlated indexes
* (fkey to a sequence ID). It's not expensive (the block is in page
* cache already, so no I/O), but it's not free either.
- *
- * XXX We can't just check blocks between startIndex and endIndex,
- * because at some point (after the pefetch target gets ramped up)
- * it's going to be just a single block.
- *
- * XXX The solution here is pretty trivial - we just check the
- * immediately preceding block. We could check a longer history, or
- * maybe maintain some "already prefetched" struct (small LRU array
- * of last prefetched blocks - say 8 blocks or so - would work fine,
- * I think).
*/
- for (int j = 0; j < 8; j++)
- {
- /* the cached block might be InvalidBlockNumber, but that's fine */
- if (prefetch->cacheBlocks[j] == block)
- {
- recently_prefetched = true;
- break;
- }
- }
-
- if (recently_prefetched)
+ if (index_prefetch_add_cache(prefetch, block))
continue;
PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
pgBufferUsage.blks_prefetches++;
-
- prefetch->cacheBlocks[prefetch->cacheIndex] = block;
- prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
}
prefetch->prefetchIndex = startIndex;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b1a02cc9bcd..1ad5490b9ad 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -387,9 +387,6 @@ btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int pr
prefetcher->prefetchMaxTarget = prefetch_maximum;
prefetcher->prefetchReset = prefetch_reset;
- prefetcher->cacheIndex = 0;
- memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
/* callbacks */
prefetcher->get_block = _bt_prefetch_getblock;
prefetcher->get_range = _bt_prefetch_getrange;
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index 79015194b73..a1c6bb7b139 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -394,9 +394,6 @@ spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int p
prefetcher->prefetchMaxTarget = prefetch_maximum;
prefetcher->prefetchReset = prefetch_reset;
- prefetcher->cacheIndex = 0;
- memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
-
/* callbacks */
prefetcher->get_block = spgist_prefetch_getblock;
prefetcher->get_range = spgist_prefetch_getrange;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d3a136b6f55..c7248877f6c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
need_full_snapshot = true;
}
+ elog(LOG, "slot = %s need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
InvalidXLogRecPtr,
XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 6a500c5aa1f..c01c37951ca 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -250,6 +250,32 @@ typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
ScanDirection direction,
int index);
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+ BlockNumber block;
+ uint64 request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define PREFETCH_LRU_SIZE 8
+#define PREFETCH_LRU_COUNT 128
+#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
typedef struct IndexPrefetchData
{
/*
@@ -262,17 +288,16 @@ typedef struct IndexPrefetchData
int prefetchMaxTarget; /* maximum prefetching distance */
int prefetchReset; /* reset to this distance on rescan */
- /*
- * a small LRU cache of recently prefetched blocks
- *
- * XXX needs to be tiny, to make the (frequent) searches very cheap
- */
- BlockNumber cacheBlocks[8];
- int cacheIndex;
-
prefetcher_getblock_function get_block;
prefetcher_getrange_function get_range;
+ /*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches.
+ */
+ uint64 prefetchReqNumber;
+ PrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
} IndexPrefetchData;
#endif /* GENAM_H */
--
2.40.1
0001-index-prefetch-poc-v1.patchtext/x-patch; charset=UTF-8; name=0001-index-prefetch-poc-v1.patchDownload
diff --git a/contrib/bloom/bloom.h b/contrib/bloom/bloom.h
index efdf9415d15..9b3625d833b 100644
--- a/contrib/bloom/bloom.h
+++ b/contrib/bloom/bloom.h
@@ -193,7 +193,7 @@ extern bool blinsert(Relation index, Datum *values, bool *isnull,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc blbeginscan(Relation r, int nkeys, int norderbys, int prefetch, int prefetch_reset);
extern int64 blgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern void blrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/contrib/bloom/blscan.c b/contrib/bloom/blscan.c
index 6cc7d07164a..0c6da1b635b 100644
--- a/contrib/bloom/blscan.c
+++ b/contrib/bloom/blscan.c
@@ -25,7 +25,7 @@
* Begin scan of bloom index.
*/
IndexScanDesc
-blbeginscan(Relation r, int nkeys, int norderbys)
+blbeginscan(Relation r, int nkeys, int norderbys, int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
BloomScanOpaque so;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3c6a956eaa3..5b298c02cce 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -324,7 +324,7 @@ brininsert(Relation idxRel, Datum *values, bool *nulls,
* holding lock on index, it's not necessary to recompute it during brinrescan.
*/
IndexScanDesc
-brinbeginscan(Relation r, int nkeys, int norderbys)
+brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
BrinOpaque *opaque;
diff --git a/src/backend/access/gin/ginscan.c b/src/backend/access/gin/ginscan.c
index ae7b0e9bb87..3087a986bc3 100644
--- a/src/backend/access/gin/ginscan.c
+++ b/src/backend/access/gin/ginscan.c
@@ -22,7 +22,7 @@
IndexScanDesc
-ginbeginscan(Relation rel, int nkeys, int norderbys)
+ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
GinScanOpaque so;
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..7b79128f2ce 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -493,12 +493,16 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
if (GistPageIsLeaf(page))
{
+ BlockNumber block = ItemPointerGetBlockNumber(&it->t_tid);
+
/* Creating heap-tuple GISTSearchItem */
item->blkno = InvalidBlockNumber;
item->data.heap.heapPtr = it->t_tid;
item->data.heap.recheck = recheck;
item->data.heap.recheckDistances = recheck_distances;
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+
/*
* In an index-only scan, also fetch the data from the tuple.
*/
@@ -529,6 +533,8 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem,
}
UnlockReleaseBuffer(buffer);
+
+ so->didReset = true;
}
/*
@@ -679,6 +685,8 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
so->curPageData++;
+ index_prefetch(scan, ForwardScanDirection);
+
return true;
}
diff --git a/src/backend/access/gist/gistscan.c b/src/backend/access/gist/gistscan.c
index 00400583c0b..fdf978eaaad 100644
--- a/src/backend/access/gist/gistscan.c
+++ b/src/backend/access/gist/gistscan.c
@@ -22,6 +22,8 @@
#include "utils/memutils.h"
#include "utils/rel.h"
+static void gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
/*
* Pairing heap comparison function for the GISTSearchItem queue
@@ -71,7 +73,7 @@ pairingheap_GISTSearchItem_cmp(const pairingheap_node *a, const pairingheap_node
*/
IndexScanDesc
-gistbeginscan(Relation r, int nkeys, int norderbys)
+gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
GISTSTATE *giststate;
@@ -111,6 +113,31 @@ gistbeginscan(Relation r, int nkeys, int norderbys)
so->curBlkno = InvalidBlockNumber;
so->curPageLSN = InvalidXLogRecPtr;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = gist_prefetch_getblock;
+ prefetcher->get_range = gist_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
/*
@@ -356,3 +383,42 @@ gistendscan(IndexScanDesc scan)
*/
freeGISTstate(so->giststate);
}
+
+static void
+gist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->didReset;
+ so->didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->curPageData;
+ *end = (so->nPageData - 1);
+ }
+ else
+ {
+ *start = 0;
+ *end = so->curPageData;
+ }
+}
+
+static BlockNumber
+gist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ GISTScanOpaque so = (GISTScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->curPageData) || (index >= so->nPageData))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->pageData[index].heapPtr;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index fc5d97f606e..01a25132bce 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -48,6 +48,9 @@ static void hashbuildCallback(Relation index,
bool tupleIsAlive,
void *state);
+static void _hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
/*
* Hash handler function: return IndexAmRoutine with access method parameters
@@ -362,7 +365,7 @@ hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
* hashbeginscan() -- start a scan on a hash index
*/
IndexScanDesc
-hashbeginscan(Relation rel, int nkeys, int norderbys)
+hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
HashScanOpaque so;
@@ -383,6 +386,31 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL;
so->numKilled = 0;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = _hash_prefetch_getblock;
+ prefetcher->get_range = _hash_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
return scan;
@@ -918,3 +946,42 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf,
else
LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
}
+
+static void
+_hash_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->currPos.didReset;
+ so->currPos.didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->currPos.itemIndex;
+ *end = so->currPos.lastItem;
+ }
+ else
+ {
+ *start = so->currPos.firstItem;
+ *end = so->currPos.itemIndex;
+ }
+}
+
+static BlockNumber
+_hash_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ HashScanOpaque so = (HashScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->currPos.items[index].heapTid;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 9ea2a42a07f..b5cea5e23eb 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -434,6 +434,8 @@ _hash_first(IndexScanDesc scan, ScanDirection dir)
currItem = &so->currPos.items[so->currPos.itemIndex];
scan->xs_heaptid = currItem->heapTid;
+ index_prefetch(scan, dir);
+
/* if we're here, _hash_readpage found a valid tuples */
return true;
}
@@ -467,6 +469,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
so->currPos.buf = buf;
so->currPos.currPage = BufferGetBlockNumber(buf);
+ so->currPos.didReset = true;
if (ScanDirectionIsForward(dir))
{
@@ -597,6 +600,7 @@ _hash_readpage(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
}
Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
return true;
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 646135cc21c..b2f4eadc1ea 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
@@ -756,6 +757,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
PROGRESS_CLUSTER_INDEX_RELID
};
int64 ci_val[2];
+ int prefetch_target;
+
+ prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
/* Set phase and OIDOldIndex to columns */
ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -764,7 +768,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+ prefetch_target, prefetch_target);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ /* set in each AM when applicable */
+ scan->xs_prefetch = NULL;
+
return scan;
}
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, irel,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, indexRelation,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..aa8a14624d8 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -59,6 +59,7 @@
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/lsyscache.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
@@ -106,7 +107,8 @@ do { \
static IndexScanDesc index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap);
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset);
/* ----------------------------------------------------------------
@@ -200,18 +202,36 @@ index_insert(Relation indexRelation,
* index_beginscan - start a scan of an index with amgettuple
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
*/
IndexScanDesc
index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys)
+ int nkeys, int norderbys,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+ prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -241,7 +261,8 @@ index_beginscan_bitmap(Relation indexRelation,
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+ 0, 0); /* no prefetch */
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -258,7 +279,8 @@ index_beginscan_bitmap(Relation indexRelation,
static IndexScanDesc
index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap)
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
@@ -276,8 +298,8 @@ index_beginscan_internal(Relation indexRelation,
/*
* Tell the AM to open a scan.
*/
- scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
- norderbys);
+ scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys,
+ prefetch_target, prefetch_reset);
/* Initialize information for parallel scan. */
scan->parallel_scan = pscan;
scan->xs_temp_snap = temp_snap;
@@ -317,6 +339,16 @@ index_rescan(IndexScanDesc scan,
scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
orderbys, norderbys);
+
+ /* If we're prefetching for this index, maybe reset some of the state. */
+ if (scan->xs_prefetch != NULL)
+ {
+ IndexPrefetch prefetcher = scan->xs_prefetch;
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+ prefetcher->prefetchReset);
+ }
}
/* ----------------
@@ -487,10 +519,13 @@ index_parallelrescan(IndexScanDesc scan)
* index_beginscan_parallel - join parallel index scan
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
*/
IndexScanDesc
index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
- int norderbys, ParallelIndexScanDesc pscan)
+ int norderbys, ParallelIndexScanDesc pscan,
+ int prefetch_target, int prefetch_reset)
{
Snapshot snapshot;
IndexScanDesc scan;
@@ -499,7 +534,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
RegisterSnapshot(snapshot);
scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
- pscan, true);
+ pscan, true, prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -557,6 +592,9 @@ index_getnext_tid(IndexScanDesc scan, ScanDirection direction)
pgstat_count_index_tuples(scan->indexRelation, 1);
+ /* do index prefetching, if needed */
+ index_prefetch(scan, direction);
+
/* Return the TID of the tuple we found. */
return &scan->xs_heaptid;
}
@@ -988,3 +1026,228 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+
+
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+void
+index_prefetch(IndexScanDesc scan, ScanDirection dir)
+{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
+ /*
+ * No heap relation means bitmap index scan, which does prefetching at
+ * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+ * without the heap)
+ *
+ * XXX But in this case we should have prefetchMaxTarget=0, because in
+ * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+ * just check that.
+ */
+ if (!prefetch)
+ return;
+
+ /* was it initialized correctly? */
+ // Assert(prefetch->prefetchIndex != -1);
+
+ /*
+ * If we got here, prefetching is enabled and it's a node that supports
+ * prefetching (i.e. it can't be a bitmap index scan).
+ */
+ Assert(scan->heapRelation);
+
+ /* gradually increase the prefetch distance */
+ prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+ prefetch->prefetchMaxTarget);
+
+ /*
+ * Did we already reach the point to actually start prefetching? If not,
+ * we're done. We'll try again for the next index tuple.
+ */
+ if (prefetch->prefetchTarget <= 0)
+ return;
+
+ /*
+ * XXX I think we don't need to worry about direction here, that's handled
+ * by how the AMs build the curPos etc. (see nbtsearch.c)
+ */
+ if (ScanDirectionIsForward(dir))
+ {
+ bool reset;
+ int startIndex,
+ endIndex;
+
+ /* get indexes of unprocessed index entries */
+ prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+ /*
+ * Did we switch to a different index block? if yes, reset relevant
+ * info so that we start prefetching from scratch.
+ */
+ if (reset)
+ {
+ prefetch->prefetchTarget = prefetch->prefetchReset;
+ prefetch->prefetchIndex = startIndex; /* maybe -1 instead? */
+ pgBufferUsage.blks_prefetch_rounds++;
+ }
+
+ /*
+ * Adjust the range, based on what we already prefetched, and also
+ * based on the prefetch target.
+ *
+ * XXX We need to adjust the end index first, because it depends on
+ * the actual position, before we consider how far we prefetched.
+ */
+ endIndex = Min(endIndex, startIndex + prefetch->prefetchTarget);
+ startIndex = Max(startIndex, prefetch->prefetchIndex + 1);
+
+ for (int i = startIndex; i <= endIndex; i++)
+ {
+ bool recently_prefetched = false;
+ BlockNumber block;
+
+ block = prefetch->get_block(scan, dir, i);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ *
+ * XXX We can't just check blocks between startIndex and endIndex,
+ * because at some point (after the pefetch target gets ramped up)
+ * it's going to be just a single block.
+ *
+ * XXX The solution here is pretty trivial - we just check the
+ * immediately preceding block. We could check a longer history, or
+ * maybe maintain some "already prefetched" struct (small LRU array
+ * of last prefetched blocks - say 8 blocks or so - would work fine,
+ * I think).
+ */
+ for (int j = 0; j < 8; j++)
+ {
+ /* the cached block might be InvalidBlockNumber, but that's fine */
+ if (prefetch->cacheBlocks[j] == block)
+ {
+ recently_prefetched = true;
+ break;
+ }
+ }
+
+ if (recently_prefetched)
+ continue;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+
+ prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+ prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+ }
+
+ prefetch->prefetchIndex = endIndex;
+ }
+ else
+ {
+ bool reset;
+ int startIndex,
+ endIndex;
+
+ /* get indexes of unprocessed index entries */
+ prefetch->get_range(scan, dir, &startIndex, &endIndex, &reset);
+
+ /* FIXME handle the reset flag */
+
+ /*
+ * Adjust the range, based on what we already prefetched, and also
+ * based on the prefetch target.
+ *
+ * XXX We need to adjust the start index first, because it depends on
+ * the actual position, before we consider how far we prefetched (which
+ * for backwards scans is (end index).
+ */
+ startIndex = Max(startIndex, endIndex - prefetch->prefetchTarget);
+ endIndex = Min(endIndex, prefetch->prefetchIndex - 1);
+
+ for (int i = endIndex; i >= startIndex; i--)
+ {
+ bool recently_prefetched = false;
+ BlockNumber block;
+
+ block = prefetch->get_block(scan, dir, i);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ *
+ * XXX We can't just check blocks between startIndex and endIndex,
+ * because at some point (after the pefetch target gets ramped up)
+ * it's going to be just a single block.
+ *
+ * XXX The solution here is pretty trivial - we just check the
+ * immediately preceding block. We could check a longer history, or
+ * maybe maintain some "already prefetched" struct (small LRU array
+ * of last prefetched blocks - say 8 blocks or so - would work fine,
+ * I think).
+ */
+ for (int j = 0; j < 8; j++)
+ {
+ /* the cached block might be InvalidBlockNumber, but that's fine */
+ if (prefetch->cacheBlocks[j] == block)
+ {
+ recently_prefetched = true;
+ break;
+ }
+ }
+
+ if (recently_prefetched)
+ continue;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+
+ prefetch->cacheBlocks[prefetch->cacheIndex] = block;
+ prefetch->cacheIndex = (prefetch->cacheIndex + 1) % 8;
+ }
+
+ prefetch->prefetchIndex = startIndex;
+ }
+}
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1ce5b15199a..b1a02cc9bcd 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -37,6 +37,7 @@
#include "utils/builtins.h"
#include "utils/index_selfuncs.h"
#include "utils/memutils.h"
+#include "utils/spccache.h"
/*
@@ -87,6 +88,8 @@ static BTVacuumPosting btreevacuumposting(BTVacState *vstate,
OffsetNumber updatedoffset,
int *nremaining);
+static void _bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber _bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
/*
* Btree handler function: return IndexAmRoutine with access method parameters
@@ -341,7 +344,7 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
* btbeginscan() -- start a scan on a btree index
*/
IndexScanDesc
-btbeginscan(Relation rel, int nkeys, int norderbys)
+btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
BTScanOpaque so;
@@ -369,6 +372,31 @@ btbeginscan(Relation rel, int nkeys, int norderbys)
so->killedItems = NULL; /* until needed */
so->numKilled = 0;
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = _bt_prefetch_getblock;
+ prefetcher->get_range = _bt_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
/*
* We don't know yet whether the scan will be index-only, so we do not
* allocate the tuple workspace arrays until btrescan. However, we set up
@@ -1423,3 +1451,42 @@ btcanreturn(Relation index, int attno)
{
return true;
}
+
+static void
+_bt_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->currPos.didReset;
+ so->currPos.didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->currPos.itemIndex;
+ *end = so->currPos.lastItem;
+ }
+ else
+ {
+ *start = so->currPos.firstItem;
+ *end = so->currPos.itemIndex;
+ }
+}
+
+static BlockNumber
+_bt_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ BTScanOpaque so = (BTScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->currPos.firstItem) || (index > so->currPos.lastItem))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->currPos.items[index].heapTid;
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 263f75fce95..762d95d09ed 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -47,7 +47,6 @@ static Buffer _bt_walk_left(Relation rel, Relation heaprel, Buffer buf,
static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
static inline void _bt_initialize_more_data(BTScanOpaque so, ScanDirection dir);
-
/*
* _bt_drop_lock_and_maybe_pin()
*
@@ -1385,7 +1384,6 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
*/
_bt_parallel_done(scan);
BTScanPosInvalidate(so->currPos);
-
return false;
}
else
@@ -1538,6 +1536,12 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
*/
Assert(BufferIsValid(so->currPos.buf));
+ /*
+ * Mark the currPos as reset before loading the next chunk of pointers, to
+ * restart the preretching.
+ */
+ so->currPos.didReset = true;
+
page = BufferGetPage(so->currPos.buf);
opaque = BTPageGetOpaque(page);
diff --git a/src/backend/access/spgist/spgscan.c b/src/backend/access/spgist/spgscan.c
index cbfaf0c00ac..79015194b73 100644
--- a/src/backend/access/spgist/spgscan.c
+++ b/src/backend/access/spgist/spgscan.c
@@ -16,6 +16,7 @@
#include "postgres.h"
#include "access/genam.h"
+#include "access/relation.h"
#include "access/relscan.h"
#include "access/spgist_private.h"
#include "miscadmin.h"
@@ -32,6 +33,10 @@ typedef void (*storeRes_func) (SpGistScanOpaque so, ItemPointer heapPtr,
SpGistLeafTuple leafTuple, bool recheck,
bool recheckDistances, double *distances);
+static void spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset);
+static BlockNumber spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index);
+
+
/*
* Pairing heap comparison function for the SpGistSearchItem queue.
* KNN-searches currently only support NULLS LAST. So, preserve this logic
@@ -191,6 +196,7 @@ resetSpGistScanOpaque(SpGistScanOpaque so)
pfree(so->reconTups[i]);
}
so->iPtr = so->nPtrs = 0;
+ so->didReset = true;
}
/*
@@ -301,7 +307,7 @@ spgPrepareScanKeys(IndexScanDesc scan)
}
IndexScanDesc
-spgbeginscan(Relation rel, int keysz, int orderbysz)
+spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset)
{
IndexScanDesc scan;
SpGistScanOpaque so;
@@ -316,6 +322,8 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
so->keyData = NULL;
initSpGistState(&so->state, scan->indexRelation);
+ so->state.heap = relation_open(scan->indexRelation->rd_index->indrelid, NoLock);
+
so->tempCxt = AllocSetContextCreate(CurrentMemoryContext,
"SP-GiST search temporary context",
ALLOCSET_DEFAULT_SIZES);
@@ -371,6 +379,31 @@ spgbeginscan(Relation rel, int keysz, int orderbysz)
so->indexCollation = rel->rd_indcollation[0];
+ /*
+ * XXX maybe should happen in RelationGetIndexScan? But we need to define
+ * the callacks, so that needs to happen here ...
+ *
+ * XXX Do we need to do something for so->markPos?
+ */
+ if (prefetch_maximum > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->prefetchIndex = -1;
+ prefetcher->prefetchTarget = -3;
+ prefetcher->prefetchMaxTarget = prefetch_maximum;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ prefetcher->cacheIndex = 0;
+ memset(prefetcher->cacheBlocks, 0, sizeof(BlockNumber) * 8);
+
+ /* callbacks */
+ prefetcher->get_block = spgist_prefetch_getblock;
+ prefetcher->get_range = spgist_prefetch_getrange;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
scan->opaque = so;
return scan;
@@ -453,6 +486,8 @@ spgendscan(IndexScanDesc scan)
pfree(scan->xs_orderbynulls);
}
+ relation_close(so->state.heap, NoLock);
+
pfree(so);
}
@@ -584,6 +619,13 @@ spgLeafTest(SpGistScanOpaque so, SpGistSearchItem *item,
isnull,
distances);
+ // FIXME prefetch here? or in storeGettuple?
+ {
+ BlockNumber block = ItemPointerGetBlockNumber(&leafTuple->heapPtr);
+
+ PrefetchBuffer(so->state.heap, MAIN_FORKNUM, block);
+ }
+
spgAddSearchItemToQueue(so, heapItem);
MemoryContextSwitchTo(oldCxt);
@@ -1047,7 +1089,12 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
index_store_float8_orderby_distances(scan, so->orderByTypes,
so->distances[so->iPtr],
so->recheckDistances[so->iPtr]);
+
so->iPtr++;
+
+ /* prefetch additional tuples */
+ index_prefetch(scan, dir);
+
return true;
}
@@ -1070,6 +1117,7 @@ spggettuple(IndexScanDesc scan, ScanDirection dir)
pfree(so->reconTups[i]);
}
so->iPtr = so->nPtrs = 0;
+ so->didReset = true;
spgWalk(scan->indexRelation, so, false, storeGettuple,
scan->xs_snapshot);
@@ -1095,3 +1143,42 @@ spgcanreturn(Relation index, int attno)
return cache->config.canReturnData;
}
+
+static void
+spgist_prefetch_getrange(IndexScanDesc scan, ScanDirection dir, int *start, int *end, bool *reset)
+{
+ SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+
+ /* did we rebuild the array of tuple pointers? */
+ *reset = so->didReset;
+ so->didReset = false;
+
+ if (ScanDirectionIsForward(dir))
+ {
+ /* Did we already process the item or is it invalid? */
+ *start = so->iPtr;
+ *end = (so->nPtrs - 1);
+ }
+ else
+ {
+ *start = 0;
+ *end = so->iPtr;
+ }
+}
+
+static BlockNumber
+spgist_prefetch_getblock(IndexScanDesc scan, ScanDirection dir, int index)
+{
+ SpGistScanOpaque so = (SpGistScanOpaque) scan->opaque;
+ ItemPointer tid;
+
+ if ((index < so->iPtr) || (index >= so->nPtrs))
+ return InvalidBlockNumber;
+
+ /* get the tuple ID and extract the block number */
+ tid = &so->heapPtrs[index];
+
+ Assert(ItemPointerIsValid(tid));
+
+ return ItemPointerGetBlockNumber(tid);
+}
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 190e4f76a9e..4aac68f0766 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -17,6 +17,7 @@
#include "access/amvalidate.h"
#include "access/htup_details.h"
+#include "access/relation.h"
#include "access/reloptions.h"
#include "access/spgist_private.h"
#include "access/toast_compression.h"
@@ -334,6 +335,9 @@ initSpGistState(SpGistState *state, Relation index)
state->index = index;
+ /* we'll initialize the reference in spgbeginscan */
+ state->heap = NULL;
+
/* Get cached static information about index */
cache = spgGetCache(index);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
!INSTR_TIME_IS_ZERO(usage->blk_write_time));
bool has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+ bool has_prefetches = (usage->blks_prefetches > 0);
bool show_planning = (planning && (has_shared ||
has_local || has_temp || has_timing ||
has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
appendStringInfoChar(es->str, '\n');
}
+ /* As above, show only positive counter values. */
+ if (has_prefetches)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Prefetches:");
+
+ if (usage->blks_prefetches > 0)
+ appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+ if (usage->blks_prefetch_rounds > 0)
+ appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+
if (show_planning)
es->indent--;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* May have to restart scan from this point if a potential conflict is
* found.
+ *
+ * XXX Should this do index prefetch? Probably not worth it for unique
+ * constraints, I guess? Otherwise we should calculate prefetch_target
+ * just like in nodeIndexscan etc.
*/
retry:
conflict = false;
found_self = false;
- index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+ index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+ /* Start an index scan.
+ *
+ * XXX Should this do index prefetching? We're looking for a single tuple,
+ * probably using a PK / UNIQUE index, so does not seem worth it. If we
+ * reconsider this, calclate prefetch_target like in nodeIndexscan.
+ */
+ scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
retry:
found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
dst->local_blks_written += add->local_blks_written;
dst->temp_blks_read += add->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+ dst->blks_prefetches += add->blks_prefetches;
INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
add->blk_read_time, sub->blk_read_time);
INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We reach here if the index only scan is not parallel, or if we're
* serially executing an index only scan that was planned to be
* parallel.
+ *
+ * XXX Maybe we should enable prefetching, but prefetch only pages that
+ * are not all-visible (but checking that from the index code seems like
+ * a violation of layering etc).
+ *
+ * XXX This might lead to IOS being slower than plain index scan, if the
+ * table has a lot of pages that need recheck.
*/
scandesc = index_beginscan(node->ss.ss_currentRelation,
node->ioss_RelationDesc,
estate->es_snapshot,
node->ioss_NumScanKeys,
- node->ioss_NumOrderByKeys);
+ node->ioss_NumOrderByKeys,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc = scandesc;
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
/*
* When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ Relation heapRel = node->ss.ss_currentRelation;
/*
* extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
if (scandesc == NULL)
{
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
if (scandesc == NULL)
{
+ Relation heapRel = node->ss.ss_currentRelation;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Maybe reduce the value with parallel workers?
+ */
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
index_scan = index_beginscan(heapRel, indexRel,
&SnapshotNonVacuumable,
- 1, 0);
+ 1, 0, 0, 0); /* XXX maybe do prefetch? */
/* Set it up for index-only scan */
index_scan->xs_want_itup = true;
index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 4476ff7fba1..80fec7a11f9 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -160,7 +160,9 @@ typedef void (*amadjustmembers_function) (Oid opfamilyoid,
/* prepare for index scan */
typedef IndexScanDesc (*ambeginscan_function) (Relation indexRelation,
int nkeys,
- int norderbys);
+ int norderbys,
+ int prefetch_maximum,
+ int prefetch_reset);
/* (re)start index scan */
typedef void (*amrescan_function) (IndexScanDesc scan,
diff --git a/src/include/access/brin_internal.h b/src/include/access/brin_internal.h
index 97ddc925b27..f17dcdffd86 100644
--- a/src/include/access/brin_internal.h
+++ b/src/include/access/brin_internal.h
@@ -96,7 +96,7 @@ extern bool brininsert(Relation idxRel, Datum *values, bool *nulls,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc brinbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern int64 bringetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
extern void brinrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..6a500c5aa1f 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -152,7 +152,9 @@ extern bool index_insert(Relation indexRelation,
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys);
+ int nkeys, int norderbys,
+ int prefetch_target,
+ int prefetch_reset);
extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
Snapshot snapshot,
int nkeys);
@@ -169,7 +171,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
extern void index_parallelrescan(IndexScanDesc scan);
extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
Relation indexrel, int nkeys, int norderbys,
- ParallelIndexScanDesc pscan);
+ ParallelIndexScanDesc pscan,
+ int prefetch_target,
+ int prefetch_reset);
extern ItemPointer index_getnext_tid(IndexScanDesc scan,
ScanDirection direction);
struct TupleTableSlot;
@@ -230,4 +234,45 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
ScanDirection direction);
extern void systable_endscan_ordered(SysScanDesc sysscan);
+
+
+void index_prefetch(IndexScanDesc scandesc, ScanDirection direction);
+
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int *start, int *end,
+ bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int index);
+
+typedef struct IndexPrefetchData
+{
+ /*
+ * XXX We need to disable this in some cases (e.g. when using index-only
+ * scans, we don't want to prefetch pages). Or maybe we should prefetch
+ * only pages that are not all-visible, that'd be even better.
+ */
+ int prefetchIndex; /* how far we already prefetched */
+ int prefetchTarget; /* how far we should be prefetching */
+ int prefetchMaxTarget; /* maximum prefetching distance */
+ int prefetchReset; /* reset to this distance on rescan */
+
+ /*
+ * a small LRU cache of recently prefetched blocks
+ *
+ * XXX needs to be tiny, to make the (frequent) searches very cheap
+ */
+ BlockNumber cacheBlocks[8];
+ int cacheIndex;
+
+ prefetcher_getblock_function get_block;
+ prefetcher_getrange_function get_range;
+
+} IndexPrefetchData;
+
#endif /* GENAM_H */
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index 6da64928b66..b4bd3b2e202 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -384,7 +384,7 @@ typedef struct GinScanOpaqueData
typedef GinScanOpaqueData *GinScanOpaque;
-extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc ginbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void ginendscan(IndexScanDesc scan);
extern void ginrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/gist_private.h b/src/include/access/gist_private.h
index 3edc740a3f3..e844a9eed84 100644
--- a/src/include/access/gist_private.h
+++ b/src/include/access/gist_private.h
@@ -176,6 +176,7 @@ typedef struct GISTScanOpaqueData
OffsetNumber curPageData; /* next item to return */
MemoryContext pageDataCxt; /* context holding the fetched tuples, for
* index-only scans */
+ bool didReset; /* reset since last access? */
} GISTScanOpaqueData;
typedef GISTScanOpaqueData *GISTScanOpaque;
diff --git a/src/include/access/gistscan.h b/src/include/access/gistscan.h
index 65911245f74..adf167a60b6 100644
--- a/src/include/access/gistscan.h
+++ b/src/include/access/gistscan.h
@@ -16,7 +16,7 @@
#include "access/amapi.h"
-extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys);
+extern IndexScanDesc gistbeginscan(Relation r, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void gistrescan(IndexScanDesc scan, ScanKey key, int nkeys,
ScanKey orderbys, int norderbys);
extern void gistendscan(IndexScanDesc scan);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 9e035270a16..743192997c5 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -124,6 +124,8 @@ typedef struct HashScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
+ bool didReset;
+
HashScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
} HashScanPosData;
@@ -370,7 +372,7 @@ extern bool hashinsert(Relation rel, Datum *values, bool *isnull,
struct IndexInfo *indexInfo);
extern bool hashgettuple(IndexScanDesc scan, ScanDirection dir);
extern int64 hashgetbitmap(IndexScanDesc scan, TIDBitmap *tbm);
-extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc hashbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern void hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
extern void hashendscan(IndexScanDesc scan);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index d6847860959..8d053de461b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -984,6 +984,9 @@ typedef struct BTScanPosData
int lastItem; /* last valid index in items[] */
int itemIndex; /* current index in items[] */
+ /* Did the position reset/rebuilt since the last time we checked it? */
+ bool didReset;
+
BTScanPosItem items[MaxTIDsPerBTreePage]; /* MUST BE LAST */
} BTScanPosData;
@@ -1019,6 +1022,7 @@ typedef BTScanPosData *BTScanPos;
(scanpos).buf = InvalidBuffer; \
(scanpos).lsn = InvalidXLogRecPtr; \
(scanpos).nextTupleOffset = 0; \
+ (scanpos).didReset = true; \
} while (0)
/* We need one of these for each equality-type SK_SEARCHARRAY scan key */
@@ -1127,7 +1131,7 @@ extern bool btinsert(Relation rel, Datum *values, bool *isnull,
IndexUniqueCheck checkUnique,
bool indexUnchanged,
struct IndexInfo *indexInfo);
-extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys);
+extern IndexScanDesc btbeginscan(Relation rel, int nkeys, int norderbys, int prefetch_maximum, int prefetch_reset);
extern Size btestimateparallelscan(void);
extern void btinitparallelscan(void *target);
extern bool btgettuple(IndexScanDesc scan, ScanDirection dir);
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
Relation rel;
} IndexFetchTableData;
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
/*
* We use the same IndexScanDescData structure for both amgettuple-based
* and amgetbitmap-based index scans. Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
bool *xs_orderbynulls;
bool xs_recheckorderby;
+ /* prefetching state (or NULL if disabled) */
+ IndexPrefetchData *xs_prefetch;
+
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
} IndexScanDescData;
diff --git a/src/include/access/spgist.h b/src/include/access/spgist.h
index fe31d32dbe9..e1e2635597c 100644
--- a/src/include/access/spgist.h
+++ b/src/include/access/spgist.h
@@ -203,7 +203,7 @@ extern bool spginsert(Relation index, Datum *values, bool *isnull,
struct IndexInfo *indexInfo);
/* spgscan.c */
-extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz);
+extern IndexScanDesc spgbeginscan(Relation rel, int keysz, int orderbysz, int prefetch_maximum, int prefetch_reset);
extern void spgendscan(IndexScanDesc scan);
extern void spgrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
ScanKey orderbys, int norderbys);
diff --git a/src/include/access/spgist_private.h b/src/include/access/spgist_private.h
index c6ef46fc206..e00d4fc90b6 100644
--- a/src/include/access/spgist_private.h
+++ b/src/include/access/spgist_private.h
@@ -144,7 +144,7 @@ typedef struct SpGistTypeDesc
typedef struct SpGistState
{
Relation index; /* index we're working with */
-
+ Relation heap; /* heap the index is defined on */
spgConfigOut config; /* filled in by opclass config method */
SpGistTypeDesc attType; /* type of values to be indexed/restored */
@@ -231,6 +231,7 @@ typedef struct SpGistScanOpaqueData
bool recheckDistances[MaxIndexTuplesPerPage]; /* distance recheck
* flags */
HeapTuple reconTups[MaxIndexTuplesPerPage]; /* reconstructed tuples */
+ bool didReset; /* */
/* distances (for recheck) */
IndexOrderByDistance *distances[MaxIndexTuplesPerPage];
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
int64 local_blks_written; /* # of local disk blocks written */
int64 temp_blks_read; /* # of temp blocks read */
int64 temp_blks_written; /* # of temp blocks written */
+ int64 blks_prefetch_rounds; /* # of prefetch rounds */
+ int64 blks_prefetches; /* # of buffers prefetched */
instr_time blk_read_time; /* time spent reading blocks */
instr_time blk_write_time; /* time spent writing blocks */
instr_time temp_blk_read_time; /* time spent reading temp blocks */
Hi,
attached is a v4 of the patch, with a fairly major shift in the approach.
Until now the patch very much relied on the AM to provide information
which blocks to prefetch next (based on the current leaf index page).
This seemed like a natural approach when I started working on the PoC,
but over time I ran into various drawbacks:
* a lot of the logic is at the AM level
* can't prefetch across the index page boundary (have to wait until the
next index leaf page is read by the indexscan)
* doesn't work for distance searches (gist/spgist),
After thinking about this, I decided to ditch this whole idea of
exchanging prefetch information through an API, and make the prefetching
almost entirely in the indexam code.
The new patch maintains a queue of TIDs (read from index_getnext_tid),
with up to effective_io_concurrency entries - calling getnext_slot()
adds a TID at the queue tail, issues a prefetch for the block, and then
returns TID from the queue head.
Maintaining the queue is up to index_getnext_slot() - it can't be done
in index_getnext_tid(), because then it'd affect IOS (and prefetching
heap would mostly defeat the whole point of IOS). And we can't do that
above index_getnext_slot() because that already fetched the heap page.
I still think prefetching for IOS is doable (and desirable), in mostly
the same way - except that we'd need to maintain the queue from some
other place, as IOS doesn't do index_getnext_slot().
FWIW there's also the "index-only filters without IOS" patch [1]https://commitfest.postgresql.org/43/4352/ which
switches even regular index scans to index_getnext_tid(), so maybe
relying on index_getnext_slot() is a lost cause anyway.
Anyway, this has the nice consequence that it makes AM code entirely
oblivious of prefetching - there's no need to API, we just get TIDs as
before, and the prefetching magic happens after that. Thus it also works
for searches ordered by distance (gist/spgist). The patch got much
smaller (about 40kB, down from 80kB), which is nice.
I ran the benchmarks [2]https://github.com/tvondra/index-prefetch-tests-2/ with this v4 patch, and the results for the
"point" queries are almost exactly the same as for v3. The SAOP part is
still running - I'll add those results in a day or two, but I expect
similar outcome as for point queries.
regards
[1]: https://commitfest.postgresql.org/43/4352/
[2]: https://github.com/tvondra/index-prefetch-tests-2/
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch-v4.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v4.patchDownload
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069c..9045c6eb7aa 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_hitup = so->pageData[so->curPageData].recontup;
so->curPageData++;
-
return true;
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 0755be83901..f0412da94ae 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
@@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
PROGRESS_CLUSTER_INDEX_RELID
};
int64 ci_val[2];
+ int prefetch_target;
+
+ prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
/* Set phase and OIDOldIndex to columns */
ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+ prefetch_target, prefetch_target);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aebab..264ebe1d8e5 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ /* set in each AM when applicable */
+ scan->xs_prefetch = NULL;
+
return scan;
}
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, irel,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, indexRelation,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7abc..3722874948f 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
#include "catalog/pg_amproc.h"
#include "catalog/pg_type.h"
#include "commands/defrem.h"
+#include "common/hashfn.h"
#include "nodes/makefuncs.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/lsyscache.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
static IndexScanDesc index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap);
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
/* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
* index_beginscan - start a scan of an index with amgettuple
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
*/
IndexScanDesc
index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys)
+ int nkeys, int norderbys,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+ prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+ 0, 0); /* no prefetch */
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
static IndexScanDesc
index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap)
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
/*
* Tell the AM to open a scan.
*/
- scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
- norderbys);
+ scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
/* Initialize information for parallel scan. */
scan->parallel_scan = pscan;
scan->xs_temp_snap = temp_snap;
+ /* with prefetching enabled, initialize the necessary state */
+ if (prefetch_target > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->queueIndex = 0;
+ prefetcher->queueStart = 0;
+ prefetcher->queueEnd = 0;
+
+ prefetcher->prefetchTarget = 0;
+ prefetcher->prefetchMaxTarget = prefetch_target;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
return scan;
}
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
orderbys, norderbys);
+
+ /* If we're prefetching for this index, maybe reset some of the state. */
+ if (scan->xs_prefetch != NULL)
+ {
+ IndexPrefetch prefetcher = scan->xs_prefetch;
+
+ prefetcher->queueStart = 0;
+ prefetcher->queueEnd = 0;
+ prefetcher->queueIndex = 0;
+ prefetcher->prefetchDone = false;
+
+ prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+ prefetcher->prefetchReset);
+ }
}
/* ----------------
@@ -345,6 +399,17 @@ index_endscan(IndexScanDesc scan)
if (scan->xs_temp_snap)
UnregisterSnapshot(scan->xs_snapshot);
+ /* If prefetching enabled, log prefetch stats. */
+ if (scan->xs_prefetch)
+ {
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
+ elog(LOG, "index prefetch stats: requests %lu prefetches %lu (%f) skip cached %lu sequential %lu",
+ prefetch->countAll, prefetch->countPrefetch,
+ prefetch->countPrefetch * 100.0 / prefetch->countAll,
+ prefetch->countSkipCached, prefetch->countSkipSequential);
+ }
+
/* Release the scan data structure itself */
IndexScanEnd(scan);
}
@@ -487,10 +552,13 @@ index_parallelrescan(IndexScanDesc scan)
* index_beginscan_parallel - join parallel index scan
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
*/
IndexScanDesc
index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
- int norderbys, ParallelIndexScanDesc pscan)
+ int norderbys, ParallelIndexScanDesc pscan,
+ int prefetch_target, int prefetch_reset)
{
Snapshot snapshot;
IndexScanDesc scan;
@@ -499,7 +567,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
RegisterSnapshot(snapshot);
scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
- pscan, true);
+ pscan, true, prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -623,20 +691,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
bool
index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
for (;;)
{
+ /* with prefetching enabled, accumulate enough TIDs into the prefetch */
+ if (PREFETCH_ACTIVE(prefetch))
+ {
+ /*
+ * incrementally ramp up prefetch distance
+ *
+ * XXX Intentionally done as first, so that with prefetching there's
+ * always at least one item in the queue.
+ */
+ prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+ prefetch->prefetchMaxTarget);
+
+ /*
+ * get more TID while there is empty space in the queue (considering
+ * current prefetch target
+ */
+ while (!PREFETCH_FULL(prefetch))
+ {
+ ItemPointer tid;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scan, direction);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ prefetch->prefetchDone = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+ prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+ prefetch->queueEnd++;
+
+ index_prefetch(scan, tid);
+ }
+ }
+
if (!scan->xs_heap_continue)
{
- ItemPointer tid;
+ if (PREFETCH_ENABLED(prefetch))
+ {
+ /* prefetching enabled, but reached the end and queue empty */
+ if (PREFETCH_DONE(prefetch))
+ break;
+
+ scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+ prefetch->queueIndex++;
+ }
+ else /* not prefetching, just do the regular work */
+ {
+ ItemPointer tid;
- /* Time to fetch the next TID from the index */
- tid = index_getnext_tid(scan, direction);
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scan, direction);
- /* If we're out of index entries, we're done */
- if (tid == NULL)
- break;
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ break;
+
+ Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+ }
- Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
}
/*
@@ -988,3 +1110,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+ int idx;
+
+ /* If the queue is empty, just store the block and we're done. */
+ if (prefetch->blockIndex == 0)
+ {
+ prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+ prefetch->blockIndex++;
+ return false;
+ }
+
+ /*
+ * Otherwise, check if it's the same as the immediately preceding block (we
+ * don't want to prefetch the same block over and over.)
+ */
+ if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+ return true;
+
+ /* Not the same block, so add it to the queue. */
+ prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+ prefetch->blockIndex++;
+
+ /* check sequential patter a couple requests back */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+ {
+ /* not enough requests to confirm a sequential pattern */
+ if (prefetch->blockIndex < i)
+ return false;
+
+ /*
+ * index of the already requested buffer (-1 because we already
+ * incremented the index when adding the block to the queue)
+ */
+ idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+ /* */
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ * Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+ PrefetchCacheEntry *entry;
+
+ /* calculate which LRU to use */
+ int lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+ /* entry to (maybe) use for this block request */
+ uint64 oldestRequest = PG_UINT64_MAX;
+ int oldestIndex = -1;
+
+ /*
+ * First add the block to the (tiny) top-level LRU cache and see if it's
+ * part of a sequential pattern. In this case we just ignore the block
+ * and don't prefetch it - we expect read-ahead to do a better job.
+ *
+ * XXX Maybe we should still add the block to the later cache, in case
+ * we happen to access it later? That might help if we first scan a lot
+ * of the table sequentially, and then randomly. Not sure that's very
+ * likely with index access, though.
+ */
+ if (index_prefetch_is_sequential(prefetch, block))
+ {
+ prefetch->countSkipSequential++;
+ return true;
+ }
+
+ /* see if we already have prefetched this block (linear search of LRU) */
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /* Request numbers are positive, so 0 means "unused". */
+ if (entry->request == 0)
+ continue;
+
+ /* Is this entry for the same block as the current request? */
+ if (entry->block == block)
+ {
+ bool prefetched;
+
+ /*
+ * Is the old request sufficiently recent? If yes, we treat the
+ * block as already prefetched.
+ *
+ * XXX We do add the cache size to the request in order not to
+ * have issues with uint64 underflows.
+ */
+ prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+ /* Update the request number. */
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+ return prefetched;
+ }
+ }
+
+ /*
+ * We didn't find the block in the LRU, so store it either in an empty
+ * entry, or in the "oldest" prefetch request in this LRU.
+ */
+ Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+ entry->block = block;
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ /* not in the prefetch cache */
+ return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+ BlockNumber block;
+
+ /*
+ * No heap relation means bitmap index scan, which does prefetching at
+ * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+ * without the heap)
+ *
+ * XXX But in this case we should have prefetchMaxTarget=0, because in
+ * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+ * just check that.
+ */
+ if (!prefetch)
+ return;
+
+ /* was it initialized correctly? */
+ // Assert(prefetch->prefetchIndex != -1);
+
+ /*
+ * If we got here, prefetching is enabled and it's a node that supports
+ * prefetching (i.e. it can't be a bitmap index scan).
+ */
+ Assert(scan->heapRelation);
+
+ prefetch->countAll++;
+
+ block = ItemPointerGetBlockNumber(tid);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ */
+ if (!index_prefetch_add_cache(prefetch, block))
+ {
+ prefetch->countPrefetch++;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+ }
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 15f9bddcdf3..0e41ffa8fc0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
!INSTR_TIME_IS_ZERO(usage->blk_write_time));
bool has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+ bool has_prefetches = (usage->blks_prefetches > 0);
bool show_planning = (planning && (has_shared ||
has_local || has_temp || has_timing ||
has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
appendStringInfoChar(es->str, '\n');
}
+ /* As above, show only positive counter values. */
+ if (has_prefetches)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Prefetches:");
+
+ if (usage->blks_prefetches > 0)
+ appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+ if (usage->blks_prefetch_rounds > 0)
+ appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+
if (show_planning)
es->indent--;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b897..e5ce1dbc953 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* May have to restart scan from this point if a potential conflict is
* found.
+ *
+ * XXX Should this do index prefetch? Probably not worth it for unique
+ * constraints, I guess? Otherwise we should calculate prefetch_target
+ * just like in nodeIndexscan etc.
*/
retry:
conflict = false;
found_self = false;
- index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+ index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 9dd71684615..a997aac828f 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -157,8 +157,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+ /* Start an index scan.
+ *
+ * XXX Should this do index prefetching? We're looking for a single tuple,
+ * probably using a PK / UNIQUE index, so does not seem worth it. If we
+ * reconsider this, calclate prefetch_target like in nodeIndexscan.
+ */
+ scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
retry:
found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d2..434be59fca0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
dst->local_blks_written += add->local_blks_written;
dst->temp_blks_read += add->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+ dst->blks_prefetches += add->blks_prefetches;
INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
add->blk_read_time, sub->blk_read_time);
INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b9699..3ecb8470d47 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We reach here if the index only scan is not parallel, or if we're
* serially executing an index only scan that was planned to be
* parallel.
+ *
+ * XXX Maybe we should enable prefetching, but prefetch only pages that
+ * are not all-visible (but checking that from the index code seems like
+ * a violation of layering etc).
+ *
+ * XXX This might lead to IOS being slower than plain index scan, if the
+ * table has a lot of pages that need recheck.
*/
scandesc = index_beginscan(node->ss.ss_currentRelation,
node->ioss_RelationDesc,
estate->es_snapshot,
node->ioss_NumScanKeys,
- node->ioss_NumOrderByKeys);
+ node->ioss_NumOrderByKeys,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc = scandesc;
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d2..71ae6a47ce5 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
/*
* When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ Relation heapRel = node->ss.ss_currentRelation;
/*
* extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
if (scandesc == NULL)
{
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
if (scandesc == NULL)
{
+ Relation heapRel = node->ss.ss_currentRelation;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Maybe reduce the value with parallel workers?
+ */
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d3a136b6f55..c7248877f6c 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
need_full_snapshot = true;
}
+ elog(LOG, "slot = %s need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
InvalidXLogRecPtr,
XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076ea..0b02b6265d0 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
index_scan = index_beginscan(heapRel, indexRel,
&SnapshotNonVacuumable,
- 1, 0);
+ 1, 0, 0, 0); /* XXX maybe do prefetch? */
/* Set it up for index-only scan */
index_scan->xs_want_itup = true;
index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a3087956654..f3efffc4a84 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
#include "access/sdir.h"
#include "access/skey.h"
#include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys);
+ int nkeys, int norderbys,
+ int prefetch_target,
+ int prefetch_reset);
extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
Snapshot snapshot,
int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
extern void index_parallelrescan(IndexScanDesc scan);
extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
Relation indexrel, int nkeys, int norderbys,
- ParallelIndexScanDesc pscan);
+ ParallelIndexScanDesc pscan,
+ int prefetch_target,
+ int prefetch_reset);
extern ItemPointer index_getnext_tid(IndexScanDesc scan,
ScanDirection direction);
struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
ScanDirection direction);
extern void systable_endscan_ordered(SysScanDesc sysscan);
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int *start, int *end,
+ bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+ BlockNumber block;
+ uint64 request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define PREFETCH_LRU_SIZE 8
+#define PREFETCH_LRU_COUNT 128
+#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define PREFETCH_QUEUE_HISTORY 8
+#define PREFETCH_SEQ_PATTERN_BLOCKS 4
+
+
+typedef struct IndexPrefetchData
+{
+ /*
+ * XXX We need to disable this in some cases (e.g. when using index-only
+ * scans, we don't want to prefetch pages). Or maybe we should prefetch
+ * only pages that are not all-visible, that'd be even better.
+ */
+ int prefetchTarget; /* how far we should be prefetching */
+ int prefetchMaxTarget; /* maximum prefetching distance */
+ int prefetchReset; /* reset to this distance on rescan */
+ bool prefetchDone; /* did we get all TIDs from the index? */
+
+ /* runtime statistics */
+ uint64 countAll; /* all prefetch requests */
+ uint64 countPrefetch; /* actual prefetches */
+ uint64 countSkipSequential;
+ uint64 countSkipCached;
+
+ /*
+ * Queue of TIDs to prefetch.
+ *
+ * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+ * than dynamically adjusting for custom values.
+ */
+ ItemPointerData queueItems[MAX_IO_CONCURRENCY];
+ uint64 queueIndex; /* next TID to prefetch */
+ uint64 queueStart; /* first valid TID in queue */
+ uint64 queueEnd; /* first invalid (empty) TID in queue */
+
+ /*
+ * A couple of last prefetched blocks, used to check for certain access
+ * pattern and skip prefetching - e.g. for sequential access).
+ *
+ * XXX Separate from the main queue, because we only want to compare the
+ * block numbers, not the whole TID. In sequential access it's likely we
+ * read many items from each page, and we don't want to check many items
+ * (as that is much more expensive).
+ */
+ BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+ uint64 blockIndex; /* index in the block (points to the first
+ * empty entry)*/
+
+ /*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches.
+ */
+ uint64 prefetchReqNumber;
+ PrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a) ((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p) ((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p) ((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p) ((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p) ((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p) (PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v) ((v) % PREFETCH_QUEUE_HISTORY)
+
#endif /* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac04..c119fe597d8 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
Relation rel;
} IndexFetchTableData;
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
/*
* We use the same IndexScanDescData structure for both amgettuple-based
* and amgetbitmap-based index scans. Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
bool *xs_orderbynulls;
bool xs_recheckorderby;
+ /* prefetching state (or NULL if disabled) */
+ IndexPrefetchData *xs_prefetch;
+
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
} IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183bd..97dd3c2c421 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
int64 local_blks_written; /* # of local disk blocks written */
int64 temp_blks_read; /* # of temp blocks read */
int64 temp_blks_written; /* # of temp blocks written */
+ int64 blks_prefetch_rounds; /* # of prefetch rounds */
+ int64 blks_prefetches; /* # of buffers prefetched */
instr_time blk_read_time; /* time spent reading blocks */
instr_time blk_write_time; /* time spent writing blocks */
instr_time temp_blk_read_time; /* time spent reading temp blocks */
Here's a v5 of the patch, rebased to current master and fixing a couple
compiler warnings reported by cfbot (%lu vs. UINT64_FORMAT in some debug
messages). No other changes compared to v4.
cfbot also reported a failure on windows in pg_dump [1]https://cirrus-ci.com/task/6398095366291456, but it seem
pretty strange:
[11:42:48.708] ------------------------------------- 8<
-------------------------------------
[11:42:48.708] stderr:
[11:42:48.708] # Failed test 'connecting to an invalid database: matches'
The patch does nothing related to pg_dump, and the test works perfectly
fine for me (I don't have windows machine, but 32-bit and 64-bit linux
works fine for me).
regards
[1]: https://cirrus-ci.com/task/6398095366291456
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch-v5.patchtext/x-patch; charset=UTF-8; name=index-prefetch-v5.patchDownload
diff --git a/src/backend/access/gist/gistget.c b/src/backend/access/gist/gistget.c
index e2c9b5f069..9045c6eb7a 100644
--- a/src/backend/access/gist/gistget.c
+++ b/src/backend/access/gist/gistget.c
@@ -678,7 +678,6 @@ gistgettuple(IndexScanDesc scan, ScanDirection dir)
scan->xs_hitup = so->pageData[so->curPageData].recontup;
so->curPageData++;
-
return true;
}
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 5a17112c91..0b6c920ebd 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -44,6 +44,7 @@
#include "storage/smgr.h"
#include "utils/builtins.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
static void reform_and_rewrite_tuple(HeapTuple tuple,
Relation OldHeap, Relation NewHeap,
@@ -751,6 +752,9 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
PROGRESS_CLUSTER_INDEX_RELID
};
int64 ci_val[2];
+ int prefetch_target;
+
+ prefetch_target = get_tablespace_io_concurrency(OldHeap->rd_rel->reltablespace);
/* Set phase and OIDOldIndex to columns */
ci_val[0] = PROGRESS_CLUSTER_PHASE_INDEX_SCAN_HEAP;
@@ -759,7 +763,8 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
tableScan = NULL;
heapScan = NULL;
- indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0);
+ indexScan = index_beginscan(OldHeap, OldIndex, SnapshotAny, 0, 0,
+ prefetch_target, prefetch_target);
index_rescan(indexScan, NULL, 0, NULL, 0);
}
else
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 722927aeba..264ebe1d8e 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -126,6 +126,9 @@ RelationGetIndexScan(Relation indexRelation, int nkeys, int norderbys)
scan->xs_hitup = NULL;
scan->xs_hitupdesc = NULL;
+ /* set in each AM when applicable */
+ scan->xs_prefetch = NULL;
+
return scan;
}
@@ -440,8 +443,9 @@ systable_beginscan(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, irel,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
}
@@ -696,8 +700,9 @@ systable_beginscan_ordered(Relation heapRelation,
elog(ERROR, "column is not in index");
}
+ /* no index prefetch for system catalogs */
sysscan->iscan = index_beginscan(heapRelation, indexRelation,
- snapshot, nkeys, 0);
+ snapshot, nkeys, 0, 0, 0);
index_rescan(sysscan->iscan, key, nkeys, NULL, 0);
sysscan->scan = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index b25b03f7ab..0b8f136f04 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -54,11 +54,13 @@
#include "catalog/pg_amproc.h"
#include "catalog/pg_type.h"
#include "commands/defrem.h"
+#include "common/hashfn.h"
#include "nodes/makefuncs.h"
#include "pgstat.h"
#include "storage/bufmgr.h"
#include "storage/lmgr.h"
#include "storage/predicate.h"
+#include "utils/lsyscache.h"
#include "utils/ruleutils.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"
@@ -106,7 +108,10 @@ do { \
static IndexScanDesc index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap);
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset);
+
+static void index_prefetch(IndexScanDesc scan, ItemPointer tid);
/* ----------------------------------------------------------------
@@ -200,18 +205,36 @@ index_insert(Relation indexRelation,
* index_beginscan - start a scan of an index with amgettuple
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * prefetch_target determines if prefetching is requested for this index scan.
+ * We need to be able to disable this for two reasons. Firstly, we don't want
+ * to do prefetching for IOS (where we hope most of the heap pages won't be
+ * really needed. Secondly, we must prevent infinite loop when determining
+ * prefetch value for the tablespace - the get_tablespace_io_concurrency()
+ * does an index scan internally, which would result in infinite loop. So we
+ * simply disable prefetching in systable_beginscan().
+ *
+ * XXX Maybe we should do prefetching even for catalogs, but then disable it
+ * when accessing TableSpaceRelationId. We still need the ability to disable
+ * this and catalogs are expected to be tiny, so prefetching is unlikely to
+ * make a difference.
+ *
+ * XXX The second reason doesn't really apply after effective_io_concurrency
+ * lookup moved to caller of index_beginscan.
*/
IndexScanDesc
index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys)
+ int nkeys, int norderbys,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, norderbys, snapshot, NULL, false,
+ prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -241,7 +264,8 @@ index_beginscan_bitmap(Relation indexRelation,
Assert(snapshot != InvalidSnapshot);
- scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false);
+ scan = index_beginscan_internal(indexRelation, nkeys, 0, snapshot, NULL, false,
+ 0, 0); /* no prefetch */
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -258,7 +282,8 @@ index_beginscan_bitmap(Relation indexRelation,
static IndexScanDesc
index_beginscan_internal(Relation indexRelation,
int nkeys, int norderbys, Snapshot snapshot,
- ParallelIndexScanDesc pscan, bool temp_snap)
+ ParallelIndexScanDesc pscan, bool temp_snap,
+ int prefetch_target, int prefetch_reset)
{
IndexScanDesc scan;
@@ -276,12 +301,27 @@ index_beginscan_internal(Relation indexRelation,
/*
* Tell the AM to open a scan.
*/
- scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys,
- norderbys);
+ scan = indexRelation->rd_indam->ambeginscan(indexRelation, nkeys, norderbys);
/* Initialize information for parallel scan. */
scan->parallel_scan = pscan;
scan->xs_temp_snap = temp_snap;
+ /* with prefetching enabled, initialize the necessary state */
+ if (prefetch_target > 0)
+ {
+ IndexPrefetch prefetcher = palloc0(sizeof(IndexPrefetchData));
+
+ prefetcher->queueIndex = 0;
+ prefetcher->queueStart = 0;
+ prefetcher->queueEnd = 0;
+
+ prefetcher->prefetchTarget = 0;
+ prefetcher->prefetchMaxTarget = prefetch_target;
+ prefetcher->prefetchReset = prefetch_reset;
+
+ scan->xs_prefetch = prefetcher;
+ }
+
return scan;
}
@@ -317,6 +357,20 @@ index_rescan(IndexScanDesc scan,
scan->indexRelation->rd_indam->amrescan(scan, keys, nkeys,
orderbys, norderbys);
+
+ /* If we're prefetching for this index, maybe reset some of the state. */
+ if (scan->xs_prefetch != NULL)
+ {
+ IndexPrefetch prefetcher = scan->xs_prefetch;
+
+ prefetcher->queueStart = 0;
+ prefetcher->queueEnd = 0;
+ prefetcher->queueIndex = 0;
+ prefetcher->prefetchDone = false;
+
+ prefetcher->prefetchTarget = Min(prefetcher->prefetchTarget,
+ prefetcher->prefetchReset);
+ }
}
/* ----------------
@@ -345,6 +399,19 @@ index_endscan(IndexScanDesc scan)
if (scan->xs_temp_snap)
UnregisterSnapshot(scan->xs_snapshot);
+ /* If prefetching enabled, log prefetch stats. */
+ if (scan->xs_prefetch)
+ {
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
+ elog(LOG, "index prefetch stats: requests " UINT64_FORMAT " prefetches " UINT64_FORMAT " (%f) skip cached " UINT64_FORMAT " sequential " UINT64_FORMAT,
+ prefetch->countAll,
+ prefetch->countPrefetch,
+ prefetch->countPrefetch * 100.0 / prefetch->countAll,
+ prefetch->countSkipCached,
+ prefetch->countSkipSequential);
+ }
+
/* Release the scan data structure itself */
IndexScanEnd(scan);
}
@@ -487,10 +554,13 @@ index_parallelrescan(IndexScanDesc scan)
* index_beginscan_parallel - join parallel index scan
*
* Caller must be holding suitable locks on the heap and the index.
+ *
+ * XXX See index_beginscan() for more comments on prefetch_target.
*/
IndexScanDesc
index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
- int norderbys, ParallelIndexScanDesc pscan)
+ int norderbys, ParallelIndexScanDesc pscan,
+ int prefetch_target, int prefetch_reset)
{
Snapshot snapshot;
IndexScanDesc scan;
@@ -499,7 +569,7 @@ index_beginscan_parallel(Relation heaprel, Relation indexrel, int nkeys,
snapshot = RestoreSnapshot(pscan->ps_snapshot_data);
RegisterSnapshot(snapshot);
scan = index_beginscan_internal(indexrel, nkeys, norderbys, snapshot,
- pscan, true);
+ pscan, true, prefetch_target, prefetch_reset);
/*
* Save additional parameters into the scandesc. Everything else was set
@@ -623,20 +693,74 @@ index_fetch_heap(IndexScanDesc scan, TupleTableSlot *slot)
bool
index_getnext_slot(IndexScanDesc scan, ScanDirection direction, TupleTableSlot *slot)
{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+
for (;;)
{
+ /* with prefetching enabled, accumulate enough TIDs into the prefetch */
+ if (PREFETCH_ACTIVE(prefetch))
+ {
+ /*
+ * incrementally ramp up prefetch distance
+ *
+ * XXX Intentionally done as first, so that with prefetching there's
+ * always at least one item in the queue.
+ */
+ prefetch->prefetchTarget = Min(prefetch->prefetchTarget + 1,
+ prefetch->prefetchMaxTarget);
+
+ /*
+ * get more TID while there is empty space in the queue (considering
+ * current prefetch target
+ */
+ while (!PREFETCH_FULL(prefetch))
+ {
+ ItemPointer tid;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scan, direction);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ prefetch->prefetchDone = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+
+ prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueEnd)] = *tid;
+ prefetch->queueEnd++;
+
+ index_prefetch(scan, tid);
+ }
+ }
+
if (!scan->xs_heap_continue)
{
- ItemPointer tid;
+ if (PREFETCH_ENABLED(prefetch))
+ {
+ /* prefetching enabled, but reached the end and queue empty */
+ if (PREFETCH_DONE(prefetch))
+ break;
+
+ scan->xs_heaptid = prefetch->queueItems[PREFETCH_QUEUE_INDEX(prefetch->queueIndex)];
+ prefetch->queueIndex++;
+ }
+ else /* not prefetching, just do the regular work */
+ {
+ ItemPointer tid;
- /* Time to fetch the next TID from the index */
- tid = index_getnext_tid(scan, direction);
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scan, direction);
- /* If we're out of index entries, we're done */
- if (tid == NULL)
- break;
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ break;
+
+ Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
+ }
- Assert(ItemPointerEquals(tid, &scan->xs_heaptid));
}
/*
@@ -988,3 +1112,258 @@ index_opclass_options(Relation indrel, AttrNumber attnum, Datum attoptions,
return build_local_reloptions(&relopts, attoptions, validate);
}
+
+/*
+ * Add the block to the tiny top-level queue (LRU), and check if the block
+ * is in a sequential pattern.
+ */
+static bool
+index_prefetch_is_sequential(IndexPrefetch prefetch, BlockNumber block)
+{
+ int idx;
+
+ /* If the queue is empty, just store the block and we're done. */
+ if (prefetch->blockIndex == 0)
+ {
+ prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+ prefetch->blockIndex++;
+ return false;
+ }
+
+ /*
+ * Otherwise, check if it's the same as the immediately preceding block (we
+ * don't want to prefetch the same block over and over.)
+ */
+ if (prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex - 1)] == block)
+ return true;
+
+ /* Not the same block, so add it to the queue. */
+ prefetch->blockItems[PREFETCH_BLOCK_INDEX(prefetch->blockIndex)] = block;
+ prefetch->blockIndex++;
+
+ /* check sequential patter a couple requests back */
+ for (int i = 1; i < PREFETCH_SEQ_PATTERN_BLOCKS; i++)
+ {
+ /* not enough requests to confirm a sequential pattern */
+ if (prefetch->blockIndex < i)
+ return false;
+
+ /*
+ * index of the already requested buffer (-1 because we already
+ * incremented the index when adding the block to the queue)
+ */
+ idx = PREFETCH_BLOCK_INDEX(prefetch->blockIndex - i - 1);
+
+ /* */
+ if (prefetch->blockItems[idx] != (block - i))
+ return false;
+ }
+
+ return true;
+}
+
+/*
+ * index_prefetch_add_cache
+ * Add a block to the cache, return true if it was recently prefetched.
+ *
+ * When checking a block, we need to check if it was recently prefetched,
+ * where recently means within PREFETCH_CACHE_SIZE requests. This check
+ * needs to be very cheap, even with fairly large caches (hundreds of
+ * entries). The cache does not need to be perfect, we can accept false
+ * positives/negatives, as long as the rate is reasonably low. We also
+ * need to expire entries, so that only "recent" requests are remembered.
+ *
+ * A queue would allow expiring the requests, but checking if a block was
+ * prefetched would be expensive (linear search for longer queues). Another
+ * option would be a hash table, but that has issues with expiring entries
+ * cheaply (which usually degrades the hash table).
+ *
+ * So we use a cache that is organized as multiple small LRU caches. Each
+ * block is mapped to a particular LRU by hashing (so it's a bit like a
+ * hash table), and each LRU is tiny (e.g. 8 entries). The LRU only keeps
+ * the most recent requests (for that particular LRU).
+ *
+ * This allows quick searches and expiration, with false negatives (when
+ * a particular LRU has too many collisions).
+ *
+ * For example, imagine 128 LRU caches, each with 8 entries - that's 1024
+ * prefetch request in total.
+ *
+ * The recency is determined using a prefetch counter, incremented every
+ * time we end up prefetching a block. The counter is uint64, so it should
+ * not wrap (125 zebibytes, would take ~4 million years at 1GB/s).
+ *
+ * To check if a block was prefetched recently, we calculate hash(block),
+ * and then linearly search if the tiny LRU has entry for the same block
+ * and request less than PREFETCH_CACHE_SIZE ago.
+ *
+ * At the same time, we either update the entry (for the same block) if
+ * found, or replace the oldest/empty entry.
+ *
+ * If the block was not recently prefetched (i.e. we want to prefetch it),
+ * we increment the counter.
+ */
+static bool
+index_prefetch_add_cache(IndexPrefetch prefetch, BlockNumber block)
+{
+ PrefetchCacheEntry *entry;
+
+ /* calculate which LRU to use */
+ int lru = hash_uint32(block) % PREFETCH_LRU_COUNT;
+
+ /* entry to (maybe) use for this block request */
+ uint64 oldestRequest = PG_UINT64_MAX;
+ int oldestIndex = -1;
+
+ /*
+ * First add the block to the (tiny) top-level LRU cache and see if it's
+ * part of a sequential pattern. In this case we just ignore the block
+ * and don't prefetch it - we expect read-ahead to do a better job.
+ *
+ * XXX Maybe we should still add the block to the later cache, in case
+ * we happen to access it later? That might help if we first scan a lot
+ * of the table sequentially, and then randomly. Not sure that's very
+ * likely with index access, though.
+ */
+ if (index_prefetch_is_sequential(prefetch, block))
+ {
+ prefetch->countSkipSequential++;
+ return true;
+ }
+
+ /* see if we already have prefetched this block (linear search of LRU) */
+ for (int i = 0; i < PREFETCH_LRU_SIZE; i++)
+ {
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + i];
+
+ /* Is this the oldest prefetch request in this LRU? */
+ if (entry->request < oldestRequest)
+ {
+ oldestRequest = entry->request;
+ oldestIndex = i;
+ }
+
+ /* Request numbers are positive, so 0 means "unused". */
+ if (entry->request == 0)
+ continue;
+
+ /* Is this entry for the same block as the current request? */
+ if (entry->block == block)
+ {
+ bool prefetched;
+
+ /*
+ * Is the old request sufficiently recent? If yes, we treat the
+ * block as already prefetched.
+ *
+ * XXX We do add the cache size to the request in order not to
+ * have issues with uint64 underflows.
+ */
+ prefetched = (entry->request + PREFETCH_CACHE_SIZE >= prefetch->prefetchReqNumber);
+
+ /* Update the request number. */
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ prefetch->countSkipCached += (prefetched) ? 1 : 0;
+
+ return prefetched;
+ }
+ }
+
+ /*
+ * We didn't find the block in the LRU, so store it either in an empty
+ * entry, or in the "oldest" prefetch request in this LRU.
+ */
+ Assert((oldestIndex >= 0) && (oldestIndex < PREFETCH_LRU_SIZE));
+
+ entry = &prefetch->prefetchCache[lru * PREFETCH_LRU_SIZE + oldestIndex];
+
+ entry->block = block;
+ entry->request = ++prefetch->prefetchReqNumber;
+
+ /* not in the prefetch cache */
+ return false;
+}
+
+/*
+ * Do prefetching, and gradually increase the prefetch distance.
+ *
+ * XXX This is limited to a single index page (because that's where we get
+ * currPos.items from). But index tuples are typically very small, so there
+ * should be quite a bit of stuff to prefetch (especially with deduplicated
+ * indexes, etc.). Does not seem worth reworking the index access to allow
+ * more aggressive prefetching, it's best effort.
+ *
+ * XXX Some ideas how to auto-tune the prefetching, so that unnecessary
+ * prefetching does not cause significant regressions (e.g. for nestloop
+ * with inner index scan). We could track number of index pages visited
+ * and index tuples returned, to calculate avg tuples / page, and then
+ * use that to limit prefetching after switching to a new page (instead
+ * of just using prefetchMaxTarget, which can get much larger).
+ *
+ * XXX Obviously, another option is to use the planner estimates - we know
+ * how many rows we're expected to fetch (on average, assuming the estimates
+ * are reasonably accurate), so why not to use that. And maybe combine it
+ * with the auto-tuning based on runtime statistics, described above.
+ *
+ * XXX The prefetching may interfere with the patch allowing us to evaluate
+ * conditions on the index tuple, in which case we may not need the heap
+ * tuple. Maybe if there's such filter, we should prefetch only pages that
+ * are not all-visible (and the same idea would also work for IOS), but
+ * it also makes the indexing a bit "aware" of the visibility stuff (which
+ * seems a bit wrong). Also, maybe we should consider the filter selectivity
+ * (if the index-only filter is expected to eliminate only few rows, then
+ * the vm check is pointless). Maybe this could/should be auto-tuning too,
+ * i.e. we could track how many heap tuples were needed after all, and then
+ * we would consider this when deciding whether to prefetch all-visible
+ * pages or not (matters only for regular index scans, not IOS).
+ *
+ * XXX Maybe we could/should also prefetch the next index block, e.g. stored
+ * in BTScanPosData.nextPage.
+ */
+static void
+index_prefetch(IndexScanDesc scan, ItemPointer tid)
+{
+ IndexPrefetch prefetch = scan->xs_prefetch;
+ BlockNumber block;
+
+ /*
+ * No heap relation means bitmap index scan, which does prefetching at
+ * the bitmap heap scan, so no prefetch here (we can't do it anyway,
+ * without the heap)
+ *
+ * XXX But in this case we should have prefetchMaxTarget=0, because in
+ * index_bebinscan_bitmap() we disable prefetching. So maybe we should
+ * just check that.
+ */
+ if (!prefetch)
+ return;
+
+ /* was it initialized correctly? */
+ // Assert(prefetch->prefetchIndex != -1);
+
+ /*
+ * If we got here, prefetching is enabled and it's a node that supports
+ * prefetching (i.e. it can't be a bitmap index scan).
+ */
+ Assert(scan->heapRelation);
+
+ prefetch->countAll++;
+
+ block = ItemPointerGetBlockNumber(tid);
+
+ /*
+ * Do not prefetch the same block over and over again,
+ *
+ * This happens e.g. for clustered or naturally correlated indexes
+ * (fkey to a sequence ID). It's not expensive (the block is in page
+ * cache already, so no I/O), but it's not free either.
+ */
+ if (!index_prefetch_add_cache(prefetch, block))
+ {
+ prefetch->countPrefetch++;
+
+ PrefetchBuffer(scan->heapRelation, MAIN_FORKNUM, block);
+ pgBufferUsage.blks_prefetches++;
+ }
+}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8570b14f62..6ae445d62c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3558,6 +3558,7 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
!INSTR_TIME_IS_ZERO(usage->blk_write_time));
bool has_temp_timing = (!INSTR_TIME_IS_ZERO(usage->temp_blk_read_time) ||
!INSTR_TIME_IS_ZERO(usage->temp_blk_write_time));
+ bool has_prefetches = (usage->blks_prefetches > 0);
bool show_planning = (planning && (has_shared ||
has_local || has_temp || has_timing ||
has_temp_timing));
@@ -3655,6 +3656,23 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage, bool planning)
appendStringInfoChar(es->str, '\n');
}
+ /* As above, show only positive counter values. */
+ if (has_prefetches)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Prefetches:");
+
+ if (usage->blks_prefetches > 0)
+ appendStringInfo(es->str, " blocks=%lld",
+ (long long) usage->blks_prefetches);
+
+ if (usage->blks_prefetch_rounds > 0)
+ appendStringInfo(es->str, " rounds=%lld",
+ (long long) usage->blks_prefetch_rounds);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+
if (show_planning)
es->indent--;
}
diff --git a/src/backend/executor/execIndexing.c b/src/backend/executor/execIndexing.c
index 1d82b64b89..e5ce1dbc95 100644
--- a/src/backend/executor/execIndexing.c
+++ b/src/backend/executor/execIndexing.c
@@ -765,11 +765,15 @@ check_exclusion_or_unique_constraint(Relation heap, Relation index,
/*
* May have to restart scan from this point if a potential conflict is
* found.
+ *
+ * XXX Should this do index prefetch? Probably not worth it for unique
+ * constraints, I guess? Otherwise we should calculate prefetch_target
+ * just like in nodeIndexscan etc.
*/
retry:
conflict = false;
found_self = false;
- index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0);
+ index_scan = index_beginscan(heap, index, &DirtySnapshot, indnkeyatts, 0, 0, 0);
index_rescan(index_scan, scankeys, indnkeyatts, NULL, 0);
while (index_getnext_slot(index_scan, ForwardScanDirection, existing_slot))
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index e776524227..c0bb732658 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -204,8 +204,13 @@ RelationFindReplTupleByIndex(Relation rel, Oid idxoid,
/* Build scan key. */
skey_attoff = build_replindex_scan_key(skey, rel, idxrel, searchslot);
- /* Start an index scan. */
- scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0);
+ /* Start an index scan.
+ *
+ * XXX Should this do index prefetching? We're looking for a single tuple,
+ * probably using a PK / UNIQUE index, so does not seem worth it. If we
+ * reconsider this, calclate prefetch_target like in nodeIndexscan.
+ */
+ scan = index_beginscan(rel, idxrel, &snap, skey_attoff, 0, 0, 0);
retry:
found = false;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..434be59fca 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -235,6 +235,8 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
dst->local_blks_written += add->local_blks_written;
dst->temp_blks_read += add->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds;
+ dst->blks_prefetches += add->blks_prefetches;
INSTR_TIME_ADD(dst->blk_read_time, add->blk_read_time);
INSTR_TIME_ADD(dst->blk_write_time, add->blk_write_time);
INSTR_TIME_ADD(dst->temp_blk_read_time, add->temp_blk_read_time);
@@ -257,6 +259,8 @@ BufferUsageAccumDiff(BufferUsage *dst,
dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
+ dst->blks_prefetches += add->blks_prefetches - sub->blks_prefetches;
+ dst->blks_prefetch_rounds += add->blks_prefetch_rounds - sub->blks_prefetch_rounds;
INSTR_TIME_ACCUM_DIFF(dst->blk_read_time,
add->blk_read_time, sub->blk_read_time);
INSTR_TIME_ACCUM_DIFF(dst->blk_write_time,
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 0b43a9b969..3ecb8470d4 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -87,12 +87,20 @@ IndexOnlyNext(IndexOnlyScanState *node)
* We reach here if the index only scan is not parallel, or if we're
* serially executing an index only scan that was planned to be
* parallel.
+ *
+ * XXX Maybe we should enable prefetching, but prefetch only pages that
+ * are not all-visible (but checking that from the index code seems like
+ * a violation of layering etc).
+ *
+ * XXX This might lead to IOS being slower than plain index scan, if the
+ * table has a lot of pages that need recheck.
*/
scandesc = index_beginscan(node->ss.ss_currentRelation,
node->ioss_RelationDesc,
estate->es_snapshot,
node->ioss_NumScanKeys,
- node->ioss_NumOrderByKeys);
+ node->ioss_NumOrderByKeys,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc = scandesc;
@@ -674,7 +682,8 @@ ExecIndexOnlyScanInitializeDSM(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
node->ioss_VMBuffer = InvalidBuffer;
@@ -719,7 +728,8 @@ ExecIndexOnlyScanInitializeWorker(IndexOnlyScanState *node,
node->ioss_RelationDesc,
node->ioss_NumScanKeys,
node->ioss_NumOrderByKeys,
- piscan);
+ piscan,
+ 0, 0); /* no index prefetch for IOS */
node->ioss_ScanDesc->xs_want_itup = true;
/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4540c7781d..71ae6a47ce 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -43,6 +43,7 @@
#include "utils/lsyscache.h"
#include "utils/memutils.h"
#include "utils/rel.h"
+#include "utils/spccache.h"
/*
* When an ordering operator is used, tuples fetched from the index that
@@ -85,6 +86,7 @@ IndexNext(IndexScanState *node)
ScanDirection direction;
IndexScanDesc scandesc;
TupleTableSlot *slot;
+ Relation heapRel = node->ss.ss_currentRelation;
/*
* extract necessary information from index scan node
@@ -103,6 +105,22 @@ IndexNext(IndexScanState *node)
if (scandesc == NULL)
{
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -111,7 +129,9 @@ IndexNext(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -198,6 +218,23 @@ IndexNextWithReorder(IndexScanState *node)
if (scandesc == NULL)
{
+ Relation heapRel = node->ss.ss_currentRelation;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Should this also look at plan.plan_rows and maybe cap the target
+ * to that? Pointless to prefetch more than we expect to use. Or maybe
+ * just reset to that value during prefetching, after reading the next
+ * index page (or rather after rescan)?
+ */
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
+
/*
* We reach here if the index scan is not parallel, or if we're
* serially executing an index scan that was planned to be parallel.
@@ -206,7 +243,9 @@ IndexNextWithReorder(IndexScanState *node)
node->iss_RelationDesc,
estate->es_snapshot,
node->iss_NumScanKeys,
- node->iss_NumOrderByKeys);
+ node->iss_NumOrderByKeys,
+ prefetch_target,
+ prefetch_reset);
node->iss_ScanDesc = scandesc;
@@ -1678,6 +1717,21 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
{
EState *estate = node->ss.ps.state;
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ /*
+ * Determine number of heap pages to prefetch for this index. This is
+ * essentially just effective_io_concurrency for the table (or the
+ * tablespace it's in).
+ *
+ * XXX Maybe reduce the value with parallel workers?
+ */
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_allocate(pcxt->toc, node->iss_PscanLen);
index_parallelscan_initialize(node->ss.ss_currentRelation,
@@ -1690,7 +1744,9 @@ ExecIndexScanInitializeDSM(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
@@ -1726,6 +1782,14 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
ParallelWorkerContext *pwcxt)
{
ParallelIndexScanDesc piscan;
+ Relation heapRel;
+ int prefetch_target;
+ int prefetch_reset;
+
+ heapRel = node->ss.ss_currentRelation;
+
+ prefetch_target = get_tablespace_io_concurrency(heapRel->rd_rel->reltablespace);
+ prefetch_reset = Min(prefetch_target, node->ss.ps.plan->plan_rows);
piscan = shm_toc_lookup(pwcxt->toc, node->ss.ps.plan->plan_node_id, false);
node->iss_ScanDesc =
@@ -1733,7 +1797,9 @@ ExecIndexScanInitializeWorker(IndexScanState *node,
node->iss_RelationDesc,
node->iss_NumScanKeys,
node->iss_NumOrderByKeys,
- piscan);
+ piscan,
+ prefetch_target,
+ prefetch_reset);
/*
* If no run-time keys to calculate or they are ready, go ahead and pass
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index d27ef2985d..d65575fd10 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1131,6 +1131,8 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
need_full_snapshot = true;
}
+ elog(LOG, "slot = %s need_full_snapshot = %d", cmd->slotname, need_full_snapshot);
+
ctx = CreateInitDecodingContext(cmd->plugin, NIL, need_full_snapshot,
InvalidXLogRecPtr,
XL_ROUTINE(.page_read = logical_read_xlog_page,
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index c4fcd0076e..0b02b6265d 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6218,7 +6218,7 @@ get_actual_variable_endpoint(Relation heapRel,
index_scan = index_beginscan(heapRel, indexRel,
&SnapshotNonVacuumable,
- 1, 0);
+ 1, 0, 0, 0); /* XXX maybe do prefetch? */
/* Set it up for index-only scan */
index_scan->xs_want_itup = true;
index_rescan(index_scan, scankeys, 1, NULL, 0);
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index a308795665..f3efffc4a8 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
#include "access/sdir.h"
#include "access/skey.h"
#include "nodes/tidbitmap.h"
+#include "storage/bufmgr.h"
#include "storage/lockdefs.h"
#include "utils/relcache.h"
#include "utils/snapshot.h"
@@ -152,7 +153,9 @@ extern bool index_insert(Relation indexRelation,
extern IndexScanDesc index_beginscan(Relation heapRelation,
Relation indexRelation,
Snapshot snapshot,
- int nkeys, int norderbys);
+ int nkeys, int norderbys,
+ int prefetch_target,
+ int prefetch_reset);
extern IndexScanDesc index_beginscan_bitmap(Relation indexRelation,
Snapshot snapshot,
int nkeys);
@@ -169,7 +172,9 @@ extern void index_parallelscan_initialize(Relation heapRelation,
extern void index_parallelrescan(IndexScanDesc scan);
extern IndexScanDesc index_beginscan_parallel(Relation heaprel,
Relation indexrel, int nkeys, int norderbys,
- ParallelIndexScanDesc pscan);
+ ParallelIndexScanDesc pscan,
+ int prefetch_target,
+ int prefetch_reset);
extern ItemPointer index_getnext_tid(IndexScanDesc scan,
ScanDirection direction);
struct TupleTableSlot;
@@ -230,4 +235,108 @@ extern HeapTuple systable_getnext_ordered(SysScanDesc sysscan,
ScanDirection direction);
extern void systable_endscan_ordered(SysScanDesc sysscan);
+/*
+ * XXX not sure it's the right place to define these callbacks etc.
+ */
+typedef void (*prefetcher_getrange_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int *start, int *end,
+ bool *reset);
+
+typedef BlockNumber (*prefetcher_getblock_function) (IndexScanDesc scandesc,
+ ScanDirection direction,
+ int index);
+
+/*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches. Doesn't need to be perfectly accurate, but we
+ * aim to make false positives/negatives reasonably low.
+ */
+typedef struct PrefetchCacheEntry {
+ BlockNumber block;
+ uint64 request;
+} PrefetchCacheEntry;
+
+/*
+ * Size of the cache of recently prefetched blocks - shouldn't be too
+ * small or too large. 1024 seems about right, it covers ~8MB of data.
+ * It's somewhat arbitrary, there's no particular formula saying it
+ * should not be higher/lower.
+ *
+ * The cache is structured as an array of small LRU caches, so the total
+ * size needs to be a multiple of LRU size. The LRU should be tiny to
+ * keep linear search cheap enough.
+ *
+ * XXX Maybe we could consider effective_cache_size or something?
+ */
+#define PREFETCH_LRU_SIZE 8
+#define PREFETCH_LRU_COUNT 128
+#define PREFETCH_CACHE_SIZE (PREFETCH_LRU_SIZE * PREFETCH_LRU_COUNT)
+
+/*
+ * Used to detect sequential patterns (and disable prefetching).
+ */
+#define PREFETCH_QUEUE_HISTORY 8
+#define PREFETCH_SEQ_PATTERN_BLOCKS 4
+
+
+typedef struct IndexPrefetchData
+{
+ /*
+ * XXX We need to disable this in some cases (e.g. when using index-only
+ * scans, we don't want to prefetch pages). Or maybe we should prefetch
+ * only pages that are not all-visible, that'd be even better.
+ */
+ int prefetchTarget; /* how far we should be prefetching */
+ int prefetchMaxTarget; /* maximum prefetching distance */
+ int prefetchReset; /* reset to this distance on rescan */
+ bool prefetchDone; /* did we get all TIDs from the index? */
+
+ /* runtime statistics */
+ uint64 countAll; /* all prefetch requests */
+ uint64 countPrefetch; /* actual prefetches */
+ uint64 countSkipSequential;
+ uint64 countSkipCached;
+
+ /*
+ * Queue of TIDs to prefetch.
+ *
+ * XXX Sizing for MAX_IO_CONCURRENCY may be overkill, but it seems simpler
+ * than dynamically adjusting for custom values.
+ */
+ ItemPointerData queueItems[MAX_IO_CONCURRENCY];
+ uint64 queueIndex; /* next TID to prefetch */
+ uint64 queueStart; /* first valid TID in queue */
+ uint64 queueEnd; /* first invalid (empty) TID in queue */
+
+ /*
+ * A couple of last prefetched blocks, used to check for certain access
+ * pattern and skip prefetching - e.g. for sequential access).
+ *
+ * XXX Separate from the main queue, because we only want to compare the
+ * block numbers, not the whole TID. In sequential access it's likely we
+ * read many items from each page, and we don't want to check many items
+ * (as that is much more expensive).
+ */
+ BlockNumber blockItems[PREFETCH_QUEUE_HISTORY];
+ uint64 blockIndex; /* index in the block (points to the first
+ * empty entry)*/
+
+ /*
+ * Cache of recently prefetched blocks, organized as a hash table of
+ * small LRU caches.
+ */
+ uint64 prefetchReqNumber;
+ PrefetchCacheEntry prefetchCache[PREFETCH_CACHE_SIZE];
+
+} IndexPrefetchData;
+
+#define PREFETCH_QUEUE_INDEX(a) ((a) % (MAX_IO_CONCURRENCY))
+#define PREFETCH_QUEUE_EMPTY(p) ((p)->queueEnd == (p)->queueIndex)
+#define PREFETCH_ENABLED(p) ((p) && ((p)->prefetchMaxTarget > 0))
+#define PREFETCH_FULL(p) ((p)->queueEnd - (p)->queueIndex == (p)->prefetchTarget)
+#define PREFETCH_DONE(p) ((p) && ((p)->prefetchDone && PREFETCH_QUEUE_EMPTY(p)))
+#define PREFETCH_ACTIVE(p) (PREFETCH_ENABLED(p) && !(p)->prefetchDone)
+#define PREFETCH_BLOCK_INDEX(v) ((v) % PREFETCH_QUEUE_HISTORY)
+
#endif /* GENAM_H */
diff --git a/src/include/access/relscan.h b/src/include/access/relscan.h
index d03360eac0..c119fe597d 100644
--- a/src/include/access/relscan.h
+++ b/src/include/access/relscan.h
@@ -106,6 +106,12 @@ typedef struct IndexFetchTableData
Relation rel;
} IndexFetchTableData;
+/*
+ * Forward declaration, defined in genam.h.
+ */
+typedef struct IndexPrefetchData IndexPrefetchData;
+typedef struct IndexPrefetchData *IndexPrefetch;
+
/*
* We use the same IndexScanDescData structure for both amgettuple-based
* and amgetbitmap-based index scans. Some fields are only relevant in
@@ -162,6 +168,9 @@ typedef struct IndexScanDescData
bool *xs_orderbynulls;
bool xs_recheckorderby;
+ /* prefetching state (or NULL if disabled) */
+ IndexPrefetchData *xs_prefetch;
+
/* parallel index scan information, in shared memory */
struct ParallelIndexScanDescData *parallel_scan;
} IndexScanDescData;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 87e5e2183b..97dd3c2c42 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -33,6 +33,8 @@ typedef struct BufferUsage
int64 local_blks_written; /* # of local disk blocks written */
int64 temp_blks_read; /* # of temp blocks read */
int64 temp_blks_written; /* # of temp blocks written */
+ int64 blks_prefetch_rounds; /* # of prefetch rounds */
+ int64 blks_prefetches; /* # of buffers prefetched */
instr_time blk_read_time; /* time spent reading blocks */
instr_time blk_write_time; /* time spent writing blocks */
instr_time temp_blk_read_time; /* time spent reading temp blocks */
Hi,
Attached is a v6 of the patch, which rebases v5 (just some minor
bitrot), and also does a couple changes which I kept in separate patches
to make it obvious what changed.
0001-v5-20231016.patch
----------------------
Rebase to current master.
0002-comments-and-minor-cleanup-20231012.patch
----------------------------------------------
Various comment improvements (remove obsolete ones clarify a bunch of
other comments, etc.). I tried to explain the reasoning why some places
disable prefetching (e.g. in catalogs, replication, ...), explain how
the caching / LRU works etc.
0003-remove-prefetch_reset-20231016.patch
-----------------------------------------
I decided to remove the separate prefetch_reset parameter, so that all
the index_beginscan() methods only take a parameter specifying the
maximum prefetch target. The reset was added early when the prefetch
happened much lower in the AM code, at the index page level, and the
reset was when moving to the next index page. But now after the prefetch
moved to the executor, this doesn't make much sense - the resets happen
on rescans, and it seems right to just reset to 0 (just like for bitmap
heap scans).
0004-PoC-prefetch-for-IOS-20231016.patch
----------------------------------------
This is a PoC adding the prefetch to index-only scans too. At first that
may seem rather strange, considering eliminating the heap fetches is the
whole point of IOS. But if the pages are not marked as all-visible (say,
the most recent part of the table), we may still have to fetch them. In
which case it'd be easy to see cases that IOS is slower than a regular
index scan (with prefetching).
The code is quite rough. It adds a separate index_getnext_tid_prefetch()
function, adding prefetching on top of index_getnext_tid(). I'm not sure
it's the right pattern, but it's pretty much what index_getnext_slot()
does too, except that it also does the fetch + store to the slot.
Note: There's a second patch adding index-only filters, which requires
the regular index scans from index_getnext_slot() to _tid() too.
The prefetching then happens only after checking the visibility map (if
requested). This part definitely needs improvements - for example
there's no attempt to reuse the VM buffer, which I guess might be expensive.
index-prefetch.pdf
------------------
Attached is also a PDF with results of the same benchmark I did before,
comparing master vs. patched with various data patterns and scan types.
It's not 100% comparable to earlier results as I only ran it on a
laptop, and it's a bit noisier too. The overall behavior and conclusions
are however the same.
I was specifically interested in the IOS behavior, so I added two more
cases to test - indexonlyscan and indexonlyscan-clean. The first is the
worst-case scenario, with no pages marked as all-visible in VM (the test
simply deletes the VM), while indexonlyscan-clean is the good-case (no
heap fetches needed).
The results mostly match the expected behavior, particularly for the
uncached runs (when the data is expected to not be in memory):
* indexonlyscan (i.e. bad case) - About the same results as
"indexscans", with the same speedups etc. Which is a good thing
(i.e. IOS is not unexpectedly slower than regular indexscans).
* indexonlyscan-clean (i.e. good case) - Seems to have mostly the same
performance as without the prefetching, except for the low-cardinality
runs with many rows per key. I haven't checked what's causing this,
but I'd bet it's the extra buffer lookups/management I mentioned.
I noticed there's another prefetching-related patch [1]/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com from Thomas
Munro. I haven't looked at it yet, so hard to say how much it interferes
with this patch. But the idea looks interesting.
[1]: /messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
/messages/by-id/CA+hUKGJkOiOCa+mag4BF+zHo7qo=o9CFheB8=g6uT5TUm2gkvA@mail.gmail.com
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
index-prefetch.pdfapplication/pdf; name=index-prefetch.pdfDownload
%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(�� G o o g l e S h e e t s)
/Title
(�� U n t i t l e d s p r e a d s h e e t)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x��}]�$��]~1������������"�|]�4#yi��vmIO^�����0����L~f����fn�e��I�dd��{����?��?o�����^��exsf�U�a����Xo;�������v��8�7��d_���]�y��&�o}3�?o�Z�f��5�������
-���^�u�
�����2��n4������i�)�6h��������T����K�B�������r��_���k���\�0����&0�����?,��a�i�X����1���V�����a�q���8{u��)&��Ja��������` �7v������c��'w��t+v*n�N������@�No���k����������Y �����v��vp�u.�tX�w�K��Z�����-S,���l�a�nd��<���lo��-�4�������/�L9>�J���j���?�h��K����{K���c��l�����{K�x���v 2� K��]qAJs)�e.E7��Ye��Q �u�w
%��������2��|����Ry�o�2�7G�����KF��J�{�����t���|���M+���OV�Y�H���H,���"i?�Qb�R����8Y�ce����%�=�{@�+�@�G��������fu;0��������L�qe6:.�'�;Dw�W��>�y%�P%`Q� ���q2��)�<�yu ����o/�*ry���Wf���$�(�������fEy:3E��c�f��+��Qfr��!����2����l���Oz!���P-�bQ-���;f��8����i3�z����|�i]&�$HB��,H�e��BL�RzC�HI\x�-T ��D��qe"!��Kh31:dt�Wf���/��jq��j��Zr��Z�i������*34����b��H��*4����~o � ��������Z�:2E��I �B���[+����WS�X(1Q:�t@W&qs�G����Fj��F,j����c�<.�qI��,Is4c��cB���}o\�5:'QP�oI�rys4K�G�t���K�P��+��,��V��2:`�Wfk�,������,)��w�i���v3oNI;C�����rFJ��(C��,(��p�W�
���T������I\AvC�{�+4��G{2(+R:<��W&R��
7��eNS�f��[4�9o���XGt���C����9��z ]�iN@&os��n��,{.�m�@s��14����d��pGB��C��DB:3�q�|��P�aIi�NZf�4Gl�K��0�x�j��e*�T�B��lR]g��2LsryC�v�>P��T���4i� ���2��NQ&B��C�����;�Rq(�R�(KJ�FL�v3����Fm��q�a"Q���Cm��V"W�E$���2 �V���r9�
�+��-F$�8����p�l�@���%��L��<�R���@|���@�m<,Jd>|{�����_���`]"Q�f���h��wi���D����7�����}���0�)P��+��<���fj��f,j�R�9���<X���,���L+s|l1aFB��C��D��tY�^$�`&os<f���������W��`�?
���+�x{�[s}R����j34���z��zh\���0E�8.�����2��_|2�CB��qe"�� �D�p�������h�.�;Vj�0���L���4�f���g�����+4�����lj"�:L�,zC��I�I��m���ah\�HxoR5S:�t@W&P.f����'Wl�r�$7���cQ1���^�f���"����Mw���2\��#����2 �V����`AB<�{C��N����{�aOZ�I{��G^R<��w3
�u�|xe��������!e�=0��+�d<���0�xH�gpo�f�U��0'x��'��Lp{���������Po?�J4�k9�x��gF�{����i+��(�(��p(����<���^G���f�����9ve�3�GJ��qe���9����FH�'�`&ho����!w�(�9B��<�\3���r9�
�+��=�F��Z���Z,��Rj9�zE�A�~7�>-$�k�z�����=����y�7�����L����B�n1^!U�(�i!7_������<yBw84�qeV7t����B�|7����������*Qz��@W&R�������yb����,��E%�( ��_H����� ��L��24�L��ZY���P/�bQ/���{
�����M3@��2c��L�@~>�{@�Wf�~O/*���q73�7G�u?QK��O��i��Wfu?Ll��L)t�@Z>3:`�Wf���uE���/�(��p(���� �&F3@�|7�<bEdh\�CW%�������e��0L����� Inr��!�������KfZ����Gft���14�Ld��6�\���nB��)��" G�)p��a��������S�I�+s:��O�qx Ylr}�!� ����\�TV"tH��04�L ,����[�R�Cr:1:�@�h�@��\
��5>C�i!-]����������Wh\��#H�3F�f�.5���'e�aZ�|Q�����2�[,��}���B���!����2���Q������h$t(���(K
�X"`�9��o�������!�Cr<sy���W&��H�'��;2\ ^�M��jM���Kp���Ypa��u1QU��!����2�0����LfO�o�?�w"t(�"�({�=���4H��f�� �]�����3Rz��@���YR����@�z�p����N�-����=��]������n�������W���f��,J��J��,%�;�Sv�t�nf���������=�{D��W�pH�7�HS$�3�7����(H�'�`�������Ka:{��|��}&t@W&K��%�(*��,*��J8{1�$�w3��C��6'L��9�eJ�����V
aN��>Ny =H ���
aN��yc�Ob�g@~��BUC��3O�lQK<"�C:t�qe"]���np��'��'�`�2UC���lV0X�����p�.�Ms�r���������
a�ZE�**r��A��6
������9��+��k�T���2��\���'oT��$�nf�x���������eb��$z���0�;f��)t�!g^�xowf�2�CF��qeV�������[&�m��e*31F�@=S��)����y���;��n��ye������#���qe�p��+
�����Yr�<���<$��n�2��y��E�� 92�n�\ze�����w3$�_���9v��� =�f.5�9���Y+qEY@N=�{C�#��nn9M�Q�hO�qeo�|���C�;13W�����k���<��P���W�|O�q�"e0(�����`�8 ��e0s��!��'�7C�=39`
�+o�x��Q\��g���B ���Q/���^�m��'����t�Z�����L{�����+4�L��9�D�(���"�7���:��D���_�
�+�HEQZ��!����2��u( H\��P�aYa�
�� B�f����P�Mg���K���=����Y���.rE��Y��!���D�XPZD���ym�d����!����2 *rr��w8��������j��n&i-#$�+36~�n*Sz��@�D��<�]&�(/��gho�f��^ALLQj�)_FH�Wf�t�������������e�PeaQ������e�|�nf!^!A^��c�DiA=3z`�'��9������,z���B��6��(2��'�x�*��k����"3�Ct��PU4��k�h9B�<q9T�Ee��2� p�!c��Yw�.04������a��)=P���9�e�z<H2�����<v3�H��4UH�/�L��|L-3�]#r9�r�W�������c%.��p�
���}��
�����r����qe��l��H��24�L�%���=0FMB�>SxC����'#�(6��'�`.���M�;[)�����ch\���g�y����p��������P2%cQ2� ��`xYb�>@���f��+36~(�� =z �WfIgI�7G�(?X������� ����`A��P�U���M-3:dt��7���1�c����������cbt����{���� � ��8A��6�������L��24�����n�e&��<X ���&�\��e�b�
����� 2��Yu�}�cft���14�LdL�w[��VHmw�S����^���^�A/�diVf���>�\���9=Zi?#�GJ��qeN��V;qE9�
D��0 �F�0QE%�������0'n�#��UByDF��C��D���W]n"-T�C�8T�E�XR-�/zQsP���Y�� 34��+�s�G��BKf��W����H'�G ��J���0�~�_���`M%Q3W�|���`m%3:`�+O��,���H���H�A$w���� m��6s��������q�/Sz��@WfAyO�'qEA��
��!�����N�Q i� ���2��N�,���r8X�R�ugJ�/�������8��Q��9����=�{��_A��?UCd���&����.���q�e����f���A���kgK�tH��2~p��2��K3��nw�w�H�P&����dbI��#��o����]�r�%��=4�����W)b�� �~�p'�����4���m$�3I�i�$�x�a����=�������U�ic��f�8Oln�����L�*��J:���=J��b>-'�$A<w��XL9�W������x=������6������-�7�=��>�D�g�'��91<��`��������U�ic��F�;�;��Urr��l%'K����e��P�B���4���k��D��x=����}=.:4���|�Cd��S�{C���g��,�d�h�BiC���U�wQ�u�{A�*R���}m#i�����u3��
+:��U"���,-�;WuVX� v!^+9G��s ���<��W�ic��v�����`�'�F�����v�B��D�`�%�F�����h����O)~���Kfu���jl-9��O�Vc��55�� s�c�{7���"�H��vI���"oR��<�P6u���0����Y C���n��LOA�*Z���}mW�w�{2��T�*�JE�V�q�f�+�h�B�3������B}����<����}�k�4f��GX�)�!�V6���mR%��d����9��a��B���U�������Gn��U��i]�!Wi�V��������������}[��������?�����!����`����5���F������n�����]�kW������Ek,~��)��`3�f������:��W���U�.�����=��\<���v#�.��/z+�2��/�$�2��/�<����4�7���a�g��yy���Y?��G�����n�_�����n�_�7;�O���t����`����[����>�e����/��?��W��z�K���6��������������|���?������x�f��_��/?����^�����p��������������������O~���__�����_�k<s
?��re�������x��/#��G��v������w�?E��7����ud�?�,�]�(`X����Cq'�������v��__~����?3�B��z��K�����7}#���� ��|��,�m0�!���F�O[��p����/Y������4$���Tn�����ci�?�3�w�V�����oW��u���'�
�7���$�+�����}�� }Y]�<���e�7g�c^D�� x1g2
�����BD2�IZ_<o
z�|V��~�$�q����qrO��8O���1{��Q����>�<�j�{�� ���N���c
y}y�6n��+7��|I�}�s,�_]���tu���c���~w��� �/H��+�����]�����gt��/^�Z����7�Vx@C�I������J���%����������7�w}�3g�}s�2�����u-��+n_������'f _/�V��}�K������t�^���'��:'?G��4�)����0~-e|����-������/#����;q����{����"��=o\��n\wr�_�"���g���I��_���[�����U���]rU,n���\�+tU$���G;�*x��>�A;�v���������~0;+��;���������
�`rV�6:�"9��6��tV�&������������~we���Ne�'��,���<����9=]����j���i �? �4U1N� ��6pr<p�_uM��7�� i������&�+������|���>�T���y����
����� ��WAH������[k��n�@���V_��'t�}� ��m����h��>�0X��X`aG��0�u}d�s���1���K�8�Q,|���
�,���)+��a
��z���X�|X��.'����9���5�
#:��H���n��N���^��N{�h'�]�^Dv$�fdGhxEe����]�s�lZxEyo���^������,^��livS{�����"�i����N���������Y�>������OdU-���F�����/)����#*�2?=�#)4�����w������� ��_/5|��������D{��"wf���8j��
g%����ED�*Dp�)+��������bRV�HyV���w�����(�J����;{��4\ �?n�e�����;�2
.���Q�E�@�����*u`��p�4���;a�6����B*�]-]���ny��dar�4����+���>:��)Tl�������'�n��--�p�1X���?�z}���8��<;\�)4�,Lu��R�Ek, ���R�.r����"�Q,��:r{U����$j�_
����.�
� /b��(qKk��z0�o�{Z���� bO�������j�P�I�ju$��[T�n�� ���h�WA�Uu�U�S���]-]����)����^����>z���&���G:n\��:D����z������tLw�������`��A�Y����������R����t$��~��=�����i�k��7���t��G�.Y�����B>��)�c�q��p�������d��A�3��e�����=,���j���������H��N8�EX�I`���D���;��6�g�?/�"���4�:r{5*:y}��bC�e:���h����I�Y���.���"�����R�,t�&��
b�����G��
��.���B�Ek,�������q����+��d��Q��$�����Q��+D�P�RH�����
�(�Jr]">w������R������Z��Z�@�q��,���L�_��J��������'����Jiie'��2�.�Vv��S`i*�E������aw.{
�c(M��#��b����,���nC�M�-}C���$u�$�d��������3���k���� ���`�<�X�Ba�&;���)��s1�(��\��$5���J����Xx.m���Hd�����|��� ?=u�x�����'\ ������Z��Hh�@��{�@�Q;����4�f Gh�CX�Y`a5'3�
[��Q�?e�p�$���$��iz8C�;d��[��uW�CY�E����;$��k���ONrw�B������qyp�����~&g�hZ!��O?���P�,�],��w���[���b�Q% Ib��<�c(�]����(^�#�UB��>:��w1��?� ����
�s���x'-��|t�^���I��k9�#�u]^�I�*�s4��O����0��� ��q��
Y��4��-���D���>�G��7��}��>�O����q�x�N��x�59 �hh�N���k'7��v�on
I�?�����k��ge������;H��{���]����BS(����RW�|�����������z���� ���ujb��Xd�+S�G����u7�&�~:O�����GB+l4o�{RW�Bw�Scj�����koJ��?tB��m�j���1��]�|����oo�X������~;�--�Q5�S�@#>e��{���8;u�S�>����Q�������?
���v��pT[�SW�G��Y}H����p��>}��: -"�So�zR��B��S��Nn� ��Y�EP��~�M������T�5�����1�(�k���E_���P�*���_������=W-�/u}�����U�L�`����1� �e��9��FOK��~w$��B?�~��&|i)����Q8;DX��T8<�E�����H�[����p���(�6rv���("��'��M������#;9����C����>���kN���?k�d_�;$tf�7�?/�#��+��X\a2������!�����H����z8>i"RY�m)#{-eed��(�S���|���<R�uuEp�}�������@P}���l��B�����,�F��������Ghy ) ����
Y\a8��
�9W#�cP#�#�Ek/��������WxDX�9e�="�A���8���T����H���N���9�@�kN>�C��g�,�������7�A#|�������:+
W���B�S?Yu�v�@S�+�,����J8�+N��#3��S��*�*�V?+I�,�9���5������OzB�q����T�,�V���$��R��)|
���
�4Y\�g�pU�� ~��=�0����i*����,
��
�@�0�S����Q�G�-$��m���e�y����^�C��B9d�~z}��Y9��O.��)�����_w�I�Q��'G�$C�/H�K7�.W��G�jD��j�g�����P��d�YZ��K}��0��b��[�}WC��`����X�� �o�M���-�-v���K���}���+��`����1�t%�a�F�*m$����p��PW����SX�cP�q�� �Q��z8>����[��8`�'�YeH��e>������H��BV�P��s���E��#�N��+w,���R���(�j��Q
FW�adq�FW�Y�C�QzIB�����r7�JW��R����8eH3��2$���;oH>�;���$Fj����?�J��������{�]�F��0~��0��Kk�3*/IX�����/�������q]W�!��� ��!�c�q������Go�8z��!v��2��q���?j|�������,r���x�,����99DY��v���-������q��U6W�l�p5*D�!��(@B+�o��W�+m�� ��o�2:}JvW�2�0m�#��W��{�������<�����MM\��o�'���CWa2�����*L7,GX�/���jLT14$�6�����P���v�o�������&���S�M]����e���V����e�����iEKC�=�E�Va��K{O@U-$���������_/r���3UH�<&�������W���!�/����$�������1��OP��8`����!vV�2���Yq-IdI�6�\q~ti^��C
�Wl�FOx�3��ti�F�j�i�BN����]�g���W����`���$_�N�O�e���OW��������W�� �Q�W�
yq��.Z��(dA%��2��������q=,��,��N�t������D]���8\i:�����]vS�*Dh�;�7N�O��t��pS�W����K����!�4i(C���������N����DY��[99+�~�g#(?��ybG�������(|�PM7�@e\w�+��� �7�3�������Lr�O��r�b�������j�2$_�3(w��%5~���@�kN>P�����{KK���{�k�n��sC6�Aak�2nrW,��]��Rw��������;�������J�t ������%w�P�tQ�Q�������3��]�#Gw�"�7�#Gw�"j@�n9�.��Y�n��c��Ll ���~�O�����E'Yu>�NBK�;IP�P����9�o�p������v�sion�~�NK��~U9�E����
|:zb���x�S���,����:����������Ol�M\�xv�uj_qr��d/�?Y��l�?��Ok!��q���p�>����(
W�GQ����~�\y�
E�W>��b����E��8]�}�������K��!�R�W�Ou���8d�����@!6s��
������(|������f�p3�r?d�G�~VX#$�E�����+�,��tG��{�t��}��_4g�!8�,Eg����S�{����q�������,O�r��M�1�z����>���g�B2(��(�f���
�!���`���)/��;-���������=�W<5�O���}���n���j��_��������W,�|�~_���f��Q��s)�L���1}#���'�zO?��H�=IX�@��W�����'���@���<Sln�jr$�����3��4v�'�����'���@�}��>�����lLF��}��Y�������<���A�'W,>��A��}�9z����O!4<�+\j��ug��������H`����i�Qy�
O����Sq����}���m$��M�:O����%���yL�IG�+���>r���'Fo����v_{�����u�������KT3&��5<-�+<�E�z���p���������~�d����������'�����E-�.�~����7b�K/�1������S�4T���G����=r�'�����sZH��K���.�'�z������'
W%��_Uv��}!���OGOZt�,��?����O�.�r��x{:��_$2� ���S�4g����-�������@�z�V]����M������i_qr~������Kv4��.������:����yd���,�����|����O�~������|��*{�;���x�|-^����+l����g��
yEm��yq������}��Ek�n4��2���M[���������8/�!��p4�^v"<c��lq�C���)�\�Q����FX�j�y������(��+��\^�I���xA�����Bz��bq�������u�$��/�I��i`�%uG�����S{5J3�G������~�� :���!-��|d��\�S���T�q������8
[oK\��[��"���>��PIV����Hwz���s�M`�Rhd�x��i5�>���=��8JO�=�����{���=-�������IO��8h���'\u�����'{��d�-����tsb�����ii'��HR*�W�p���P��_
Xg��B~��n�!���\�_Y�oB����j���C�|iu�X����hqI',����I��V�k�U���>y�#r��?(�2����k�)N���{�x`i*���BI`��4�^�)��v?�E���S{%d� T�`�:���a�`�E��Gv����7�A��,t����vO�7'p���B$7N�xq�'�����3,�4�c����y`���U�$����Tee�~�C�=�|��T�k��T�����>?�p�13�}!���x��fc<�<���#���@VM�>��?�a���V�3�
endstream
endobj
8
0
obj
11659
endobj
9
0
obj
[
]
endobj
12
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
842
595
]
/Contents
13
0
R
/Resources
14
0
R
/Annots
16
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
15
0
R
>>
stream
x��}��$�qf=�fh��{z���������A�]�{X�<�
��t��aH>�8�����L#"#2�Q��f;�9��I~��� ����tO�~=�r������?���n��.���1�.��o�u�uA�����GNq�O�?�n�G��J������������.������bC�7���Z��_�~y������q�^�G�S�B����0����_�-_��x#�!V�/�y!8��������?�^���!`�~L|�on�������a��������u�%�FJV��_j{n��.c�.D��+�42��3%,��-�u���7�L +�� ������+]`�{JW�_�s��{C�������>�7{��4S��R�s���&��t{��^#Ob�IL<��'�+P�D�z��Zow������r��n��od���4Xh3������6�-��DtX��$:Q�~�MoS�S�����j�������f��O�k���:�LZ�XXcM����6e�9����E[��Pb�LCq���� $�d�g���(�����4Xx3������6����3�Z�*Lo�6�F��G��/P`��-t�.��=������ +XcM����6e���u�x�.Vl�KLB�I(2 ����+GU]�j�0����?Ur��_��TPK�u��Bi���f�X��]����By���>����Px�M�wi��{�9���D"�J�F7�Gw������>�9o���7�����K��jmf������F��^-��
�#�A��}�D'�.�1�k�Ki�mv�x: ���(�(���������:�sy%@2��d��l"�M\�F����L�9� e��=U"������J��b��j������LU8�d�(rmDq��"���`����=�)�����t��1�@KXcM�������'O LL2��D{k�A��$�<AUY�s{i8����%�D�hsmD�=_�/�E!�
JX�s[���mb��?]�r��Hh�M����6��
ke-:)��M5�M5����X����)�3����Li�.����-�dXd���3S��=�1�1�[A�0�0�$��$zb�d����=��BY���nS�����[�s��$���x������"�-���'1�D����������D���;@�3���~�WA{�.3�L�J{n#:}�E����Banc�9��p��1��4aN�t�=��]V�\B�K�N7
i
=VleM�5Q�����o����z�ho�#������Pd������q��
�d��4QMr{>�C��*�z�2JX�s���|vO�MP4���S��I���Vf=�g��]����{y4'Z[�t�}��=�,�^<V���������h�or����<����f���E�mb������>|���'1�Rm�>�w��
���m��"�����&���
(@"4�_!s��%��!%�?�������[���L�K�.Q�����6w����mb�Nd����������0������}n���f�+of���.���x���A��J��.ve�A�1��Y��Vaz��K{n���{!�o�����51�DY����Y����-�LD��61-%�����Z�zPA��Hh�����_�MSP�>`j��=���^���%,����88���X/�51�DYK{n3V�k���u�hA��e���}�wA4�?�h��hJew����5��;�{���=���(h��=����f��2V�;:*h���&J[�s[�]������I����ibc��FyL�"c����6�R��t�=�<��jQ���6S������i�2�M4]��N4=����]P&��m����lJs zG@Y@�4Y_I%-��=�����2-%����������I�)��i�����#I����+��D���������=���4��=���:
#X��A�4�4�un\w��}����l�)�4e�����7���o��K{ns��� �M���<9H6�=���2�d�F��=��m�u�HP��Y���J�(�������~���4���i)2-EYK��>h�f�gix�Y���� ���z%��4S�����t���)!H���+A�����"1�D)J{n���(#��0��l �8��� ��
g�If�Mm��q�����U��fF�)ii�m�tG?#��o��J� ��:���!&JX�s�������&F�(mi�m�>e��SQb*�LEq���e CZj�m$��h���vhJW��h3�-��ZJ�
2�%����h�U*~kO��Zh��K{n���:#���4�1�lL�zL�
UH4�?�h��R������L<�� o��� �i����xM�a���� �IK�+�@�h�M����6��;���Nh���A����:oc������=��s���X*V)2���bi�mLq��,��Ox�o��om7_����l��pi�m~�G�hW��x�]�s{k#�Ib2IL&��$�erOH0��_�����#��������l+kf������F�Jm(@�0w1�D��������y�����f���V���`b#�Fq���H�����hve�������J�m���=�G����3��P�.�
�=�+
��iE�haeo�S��FZ=��R���^��v��Gb��LQ����"�u��F�<�:��Mu�J�m���=�C[� xA��nPyr�@s�X8�5Z# �b�iM`e�U�~���&F�(ai�m���}%��K���@��f�Ld��{��L(�T�M���f��&��]��+�D��v�S]i$�n.����(2�zD��A�����J mQi����.��-���j�G �b���6H6'^�SG+��Fn�����m-�_��P�
@��A�;-7��
�����@S�+�;�_�"����&�[�s���y�e� V
�T6
����[h-`���3�����e��/��53�LYK{nS����I��i�'Z X������bi"���A���>W)�H����6���(�61�$���Te�\F�&Z�m$�H�
+{��02��x3�-��=���l h�&�+d�}��� h���&�[�s����&Z ��2����@�
R�u��F�)+)K{nG����DZ��������-� `-�[������h�u!@<���V�K���
8���oG���*a���=�i�FX���s��t#-<